Hadoop 1.x vs 2

Post on 26-Jan-2015

115 views 1 download

description

There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.

Transcript of Hadoop 1.x vs 2

Hadoop 1.x vs Hadoop 2

Rommel Garcia Solutions Engineer - Big Data

Hortonworks

Transition To Big Data

Relational Dimensional(EDW)

Big Data

Data Explosion

3 Design Dimensions

Key Hadoop Data Types

Sentiment

Clickstream

Sensor/Machine

Geographic

Server Logs

Text

Hadoop is NOT

ESB

NoSQL

HPC

Relational

Real-time

The “Jack of all Trades”

Hadoop 1

Limited up to 4,000 nodes per cluster

O(# of tasks in a cluster)

JobTracker bottleneck - resource management, job scheduling and monitoring

Only has one namespace for managing HDFS

Map and Reduce slots are static

Only job to run is MapReduce

Hadoop 1 - Basics

BBBB CCCC AAAA AAAA AAAA

AAAA BBBB CCCC CCCC BBBB

MapReduce (Computation Framework)

HDFS (Storage Framework)

Hadoop 1 - Reading Files

Rack1 Rack2 Rack3 RackN

read file (fsimage/edit)Hadoop Client

NameNode SNameNode

return DNs, block ids, etc.

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

checkpoint

heartbeat/block reportread blocks

Hadoop 1 - Writing Files

Rack1 Rack2 Rack3 RackN

request write (fsimage/edit)Hadoop Client

NameNode SNameNode

return DNs, etc.

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

checkpoint

block reportwrite blocks

replication pipelining

Hadoop 1 - Running Jobs

Rack1 Rack2 Rack3 RackN

Hadoop Client

JobTracker

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

submit job

deploy job

part 0part 0part 0part 0

map

reduce

shuffle

Hadoop 1 - Security

UsersUsersUsersUsers

FFIIRREEWWAALLLL

LDAP/AD

Client Node/Spoke Server

KDC

Hadoop Cluster

authN/authZ

service request

block token

delegate token

* block token is for accessing data

* delegate token is for running jobs

Encryption PluginEncryption Plugin

Hadoop 1 - APIs

org.apache.hadoop.mapreduce.Partitioner

org.apache.hadoop.mapreduce.Mapper

org.apache.hadoop.mapreduce.Reducer

org.apache.hadoop.mapreduce.Job

Hadoop 2

Potentially up to 10,000 nodes per cluster

O(cluster size)

Supports multiple namespace for managing HDFS

Efficient cluster utilization (YARN)

MRv1 backward and forward compatible

Any apps can integrate with Hadoop

Beyond Java

Hadoop 2 - Basics

Hadoop 2 - Reading Files

(w/ NN Federation)

Rack1 Rack2 Rack3 RackN

read file

fsimage/edit copyHadoop Client NN1/ns1

SNameNodeper NN

return DNs, block ids, etc.

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

checkpoint

register/heartbeat/

block report

read blocks

fs sync Backup NNper NN

checkpoint

NN2/ns2 NN3/ns3 NN4/ns4

or

ns1 ns2 ns3 ns4

dn1, dn2

dn1, dn3

dn4, dn5dn4, dn5

Block Pools

Hadoop 2 - Writing Files

Rack1 Rack2 Rack3 RackN

request write

Hadoop Client

return DNs, etc.

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

write blocks

replication pipelining

fsimage/edit copyNN1/ns1

SNameNodeper NN

checkpoint

block report

fs sync Backup NNper NN

checkpoint

NN2/ns2 NN3/ns3 NN4/ns4

or

Hadoop 2 - Running Jobs

RackN

NodeManager

NodeManager

NodeManager

Rack2

NodeManager

NodeManager

NodeManager

Rack1

NodeManager

NodeManager

NodeManager

C2.1

C1.4

AM2

C2.2 C2.3

AM1

C1.3

C1.2

C1.1

Hadoop Client 1

Hadoop Client 2

create app2

submit app1

submit app2

create app1

ASM Schedulerqueues

ASM Containers

NM ASM

Scheduler Resources

.......negotiates.......

.......reports to.......

.......partitions.......

ResourceManager

status report

Hadoop 2 - Security

FFIIRREEWWAALLLL

LDAP/AD

Knox Gateway Cluster

KDC

Hadoop Cluster

Enterprise/Cloud SSO Provider

JDBC ClientJDBC Client

REST ClientREST Client

FFIIRREEWWAALLLL

DMZ

Browser(HUE)Browser(HUE)Native Hive/HBase Native Hive/HBase

EncryptionEncryption

Hadoop 2 - APIs

org.apache.hadoop.yarn.api.ApplicationClientProtocol

org.apache.hadoop.yarn.api.ApplicationMasterProtocol

org.apache.hadoop.yarn.api.ContainerManagementProtocol

Resources

http://hortonworks.com/products/hortonworks-sandbox/

http://hortonworks.com/products/hdp-2/

http://hortonworks.com/resources/

http://hadoopsummit.org/san-jose/

Hadoop Summit 2014

Thank you!www.linkedin.com/in/rommelgarcia

twitter.com/rommelgarcia

rgarcia@hortonworks.com

Hortonworks