Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition
-
Upload
big-data-joe-rossi -
Category
Technology
-
view
241 -
download
1
description
Transcript of Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition
© 2014 Trace3, All rights reserved.
BIG DATA INTELLIGENCE PRACTICE
HADOOP: PAST, PRESENT AND FUTURE
© 2014 Trace3, All rights reserved.
Roadmap
1
~1 hour
1-‐ What Makes Up Hadoop 1.x?
2-‐ What’s New In Hadoop 2.x?
3-‐ The Future Of Hadoop …
© 2014 Trace3, All rights reserved.
WHAT MAKES UP HADOOP 1.0?
© 2014 Trace3, All rights reserved.
What’s a “Node”?
Node aka Server
Compute
Storage
Processes / Daemons / Services
Memory
OperaZng System
© 2014 Trace3, All rights reserved.
Hadoop 1.0: HDFS + MapReduce
4
NameNode
DataNode / TaskTracker DataNode / TaskTracker
DataNode / TaskTracker DataNode / TaskTracker
JobTracker
Client 1-‐1
1-‐2 1-‐3
© 2014 Trace3, All rights reserved.
Hadoop 1.0: HDFS + MapReduce
5
NameNode
DataNode / TaskTracker DataNode / TaskTracker
DataNode / TaskTracker DataNode / TaskTracker
JobTracker
Client 1-‐1 1-‐2
1-‐3
Reduce Map
2-‐1 3-‐2 3-‐3 4-‐1
2-‐3 4-‐2 2-‐2 3-‐1 4-‐3
Reduce Map
© 2014 Trace3, All rights reserved.
MapReduce v1 LimitaZons
6
Scalability Maximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000
Availability JobTracker failure kills all queued and running jobs
Resources ParZZoned into Map and Reduce Hard parGGoning of Map and Reduce slots led to low resource uZlizaZon
No Support for Alternate Paradigms / Services Only MapReduce batch jobs, nothing else
© 2014 Trace3, All rights reserved.
Hadoop 1.0: Single Use System
7
HADOOP 1.0
Single Use System Batch Apps
HDFS (redundant, reliable storage)
MapReduce (cluster resource management and data
processing)
Pig Hive
© 2014 Trace3, All rights reserved.
WHAT’S NEW IN HADOOP 2.0?
© 2014 Trace3, All rights reserved.
YARN
9
YARN Replaces MapReduce
Yet Another Resource NegoZator
YARN will be the de-‐facto distributed operaZng system for Big Data
© 2014 Trace3, All rights reserved.
YARN = BIG DATA
10
© 2014 Trace3, All rights reserved. 11
Store DATA in one place Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
ApplicaGons Run NaGvely IN Hadoop
HDFS2 (redundant, reliable storage)
YARN (cluster resource management)
BATCH (MapReduce)
INTERACTIVE (Tez)
ONLINE (HBase)
STREAMING (DataTorrent)
GRAPH (Giraph)
YARN: No Longer Just Batch Apps
© 2014 Trace3, All rights reserved. 12
YARN: ApplicaZons
Running all on the same Hadoop cluster to give applicaZons access to all the same source data!
MapReduce v2
Real-‐Time Stream Processing
Master-‐Worker Online
In-‐Memory
Apache Storm
© 2014 Trace3, All rights reserved. 13
YARN: Quickly Maturing
2010
2011
2012
2013
2014
Today
Conceived at Yahoo!
Alpha Releases – 2.0
Beta Releases – 2.1 GA Released – 2.2
200,000+ nodes, 800,000+ jobs daily 10 million+ hours of compute daily
Version 2.3 Version 2.4
Version 2.5
© 2014 Trace3, All rights reserved. 14
YARN: What Has Changed? YARN MRv1 RM
ResourceManager
AM ApplicaZonMaster
JT JobTracker
Scheduler Scheduler
NM NodeManager
TT TaskTracker
Container Map & Reduce Slot
ResourceManager
Scheduler
JobTracker
Scheduler
NodeManager
ApplicaZonMaster
TaskTracker
Map Reduce
NodeManager
Container Container
TaskTracker
Map Reduce
© 2014 Trace3, All rights reserved.
The 6 Benefits Of YARN
15
• Scale • New programming models and services
• Improved cluster uZlizaZon
• Agility • Backwards compaZble with MapReduce v1
• Mixed workloads on the same source of data
© 2014 Trace3, All rights reserved.
THE FUTURE OF HADOOP
© 2014 Trace3, All rights reserved.
SQL on Hadoop
Speed Deliver interacGve query performance.
SQL Support array of SQL semanGcs for analyGc applicaGons running against Hadoop.
Scale SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes
© 2014 Trace3, All rights reserved.
SQL on Hadoop
Hive on Apache Tez Hortonworks HDP2
Hive on Apache Spark Cloudera CDH5
Apache Drill MapR M7
Cloudera Impala Cloudera CDH5
Pivotal HAWQ Pivotal Big Data Suite
© 2014 Trace3, All rights reserved.
Apache Spark
HDFS2 (redundant, reliable storage)
YARN (cluster resource management)
Apache Spark (Databricks)
Programming Languages Java, Scala, Python, R*
InteracZve Shell Ability to write code and get output.
Faster by ~100x Due how it handles data in memory.
© 2014 Trace3, All rights reserved.
Apache Spark – Wordcount
© 2014 Trace3, All rights reserved.
HOYA: HBase (NoSQL) on YARN
Dynamic Scaling On-‐demand cluster size. Increase and decrease the size with load.
Easier Deployment APIs to create, start, stop and delete HBase clusters.
Availability Recover from Region Server loss with a new container.
© 2014 Trace3, All rights reserved.
Apache REEF
Machine Learning Framework well suited for building machine learning jobs.
Scalable / Fault Tolerant Makes it easy to implement scalable, fault-‐tolerant runGme environments for a range of computaGonal models.
Maintain State Users can build jobs that uGlize data from where it’s needed and also maintain state a`er jobs are done.
Retainable Evaluator ExecuGon Framework
© 2014 Trace3, All rights reserved.
Real-‐Time Stream Processing
Apache Storm
Streaming
© 2014 Trace3, All rights reserved.
Heterogeneous Storage
NameNode
Storage
NameNode
SATA SSD Fusion IO
THEN NOW
© 2014 Trace3, All rights reserved.
Hadoop Roadmap
• Apache Hadoop 2.5 – NodeManager Restart w/o disrupGon
• Apache Hadoop 2.6 – Memory As Storage Tier – Dynamic Resource ConfiguraGon – Support For Docker Containers
Q3 2014
Q4 2014
© 2014 Trace3, All rights reserved.
I KNOW YOU HAVE QUESTIONS
26
© 2014 Trace3, All rights reserved.
THANK YOU!
hqp://bigdatajoe.io/
hqp://bigdatacentric.com/
@bigdatajoerossi