Hadoop 201 -- Deeper into the Elephant

93
Deeper into the elephant: a whirlwind tour of Hadoop ecosystem Roman Shaposhnik Director of Open Source @Pivotal (Twitter: @rhatr)

Transcript of Hadoop 201 -- Deeper into the Elephant

Page 1: Hadoop 201 -- Deeper into the Elephant

Deeper into the elephant: a whirlwind tour of Hadoop ecosystem

Roman Shaposhnik Director of Open Source @Pivotal

(Twitter: @rhatr)

Page 2: Hadoop 201 -- Deeper into the Elephant

Who’s this guy?

•  Director of Open Source, building a team of OS contributors

•  Apache Software Foundation guy (VP of Apache Incubator, ASF member, committer on Hadoop, Giraph, Sqoop, etc)

•  Used to be root@Cloudera

•  Used to be PHB@Yahoo! (original Hadoop team)

•  Used to be a hacker at Sun microsystems (Sun Studio compilers and tools)

Page 3: Hadoop 201 -- Deeper into the Elephant

Agenda

&

Page 4: Hadoop 201 -- Deeper into the Elephant

Agenda

Page 5: Hadoop 201 -- Deeper into the Elephant

Long, long time ago…

HDFS

ASF Projects FLOSS Projects Pivotal Products

MapReduce

Page 6: Hadoop 201 -- Deeper into the Elephant

In a blink of an eye:

HDFS

Pig

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

ASF Projects FLOSS Projects Pivotal Products

GemFire XD

Oozie

MapReduce

Hive

Tez

Giraph

Hadoop UI

Hue

SolrCloud

Phoenix

HBase

Crunch Mahout

Spark

Shark

Streaming

MLib

GraphX

Impala

HAW

Q

SpringXD

MADlib

Ham

ster

PivotalR

YARN

Page 7: Hadoop 201 -- Deeper into the Elephant

Genesis of Hadoop

• Google papers on GFS and MapReduce

• A subproject of Apache Nutch

• A bet by Yahoo!

Page 8: Hadoop 201 -- Deeper into the Elephant

Data brings value

• What features to add to the product

• Data analysis must enable decisions

• V3: volume, velocity, variety

Page 9: Hadoop 201 -- Deeper into the Elephant

Big Data brings big value

Page 10: Hadoop 201 -- Deeper into the Elephant
Page 11: Hadoop 201 -- Deeper into the Elephant

Entering: Industrial Data

Page 12: Hadoop 201 -- Deeper into the Elephant

Hadoop’s childhood

• HDFS: Hadoop Distributed Filesystem

• MapReduce: computational framework

Page 13: Hadoop 201 -- Deeper into the Elephant
Page 14: Hadoop 201 -- Deeper into the Elephant

HDFS: not a POSIXfs • Huge blocks: 64Mb (128Mb)

• Mostly immutable files (append, truncate)

• Streaming data access

• Block replication

Page 15: Hadoop 201 -- Deeper into the Elephant

How do I use it?

$ hadoop fs –lsr / # hadoop-fuse-dfs dfs://hadoop-hdfs /mnt $ ls /mnt # mount –t nfs –o vers=3,proto=tcp,nolock host:/ /mnt $ ls /mnt

Page 16: Hadoop 201 -- Deeper into the Elephant

Principle #1

HDFS is the datalake

Page 17: Hadoop 201 -- Deeper into the Elephant

Pivotal’s Focus on Data Lakes

Existing EDW / Datamarts

Raw “untouched” Data

In-Mem

ory Parallel Ingest

Data Management���

(Search Engine)

Processed Data

In-Memory Services BI / A

nalytical Tools

Data Lake

ERP

HR

SFDC

New Data Sources/Formats

Machine

Traditional Data Sources

Finally! I now have full

transparency on the data

with amazing speed!

All data��� is now

accessible!

I can now afford ���“Big Data”

Business Users

ELT Processing with Hadoop

HDFS MapReduce/SQL/Pig/Hive

Analytical Data Marts/

Sandboxes

Security and Control

Page 18: Hadoop 201 -- Deeper into the Elephant

HDFS enables the stack

HDFS

Pig

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

ASF Projects FLOSS Projects Pivotal Products

GemFire XD

Oozie

MapReduce

Hive

Tez

Giraph

Hadoop UI

Hue

SolrCloud

Phoenix

HBase

Crunch Mahout

Spark

Shark

Streaming

MLib

GraphX

Impala

HAW

Q

SpringXD

MADlib

Ham

ster

PivotalR

YARN

Page 19: Hadoop 201 -- Deeper into the Elephant

Principle #2

Apps share their internal state

Page 20: Hadoop 201 -- Deeper into the Elephant

MapReduce

• Batch oriented (long jobs; final results)

• Brings the computation to the data

• Very constrained programming model

• Embarrassingly parallel programming model

• Used to be the only game in town for compute

Page 21: Hadoop 201 -- Deeper into the Elephant

MapReduce Overview

• Record = (Key, Value)

• Key : Comparable, Serializable

• Value: Serializable

• Logical Phases: Input, Map, Shuffle, Reduce, Output

Page 22: Hadoop 201 -- Deeper into the Elephant

Map

• Input: (Key1, Value1)

• Output: List(Key2, Value2)

• Projections, Filtering, Transformation

Page 23: Hadoop 201 -- Deeper into the Elephant

Shuffle

• Input: List(Key2, Value2)

• Output

• Sort(Partition(List(Key2, List(Value2))))

• Provided by Hadoop : Several Customizations Possible

Page 24: Hadoop 201 -- Deeper into the Elephant

Reduce

• Input: List(Key2, List(Value2))

• Output: List(Key3, Value3)

• Aggregations

Page 25: Hadoop 201 -- Deeper into the Elephant

Anatomy of MapReduce

d a c

a b c

a 3 b 1 c 2

a 1 b 1 c 1

a 1 c 1 a 1

a 1 1 1 b 1 c 1 1

HDFS mappers reducers HDFS

Page 26: Hadoop 201 -- Deeper into the Elephant

MapReduce DataFlow

Page 27: Hadoop 201 -- Deeper into the Elephant

How do I use it? public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { public void map(Object key, Text value, Context context) { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

Page 28: Hadoop 201 -- Deeper into the Elephant

How do I use it? public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Page 29: Hadoop 201 -- Deeper into the Elephant

How do I run it?

$ hadoop jar hadoop-examples.jar wordcount \ input \ output

Page 30: Hadoop 201 -- Deeper into the Elephant

Principle #3

MapReduce is assembly language of Hadoop

Page 31: Hadoop 201 -- Deeper into the Elephant

Hadoop’s childhood

• Compact (pretty much a single jar)

• Challenged in scalability and SPOFs

• Extremely batch oriented

• Hard for non-Java programmers

Page 32: Hadoop 201 -- Deeper into the Elephant

Then, something happened

Page 33: Hadoop 201 -- Deeper into the Elephant

Hadoop 1.0

HDFS

ASF Projects FLOSS Projects Pivotal Products

MapReduce

Page 34: Hadoop 201 -- Deeper into the Elephant

Hadoop 2.0

HDFS

ASF Projects FLOSS Projects Pivotal Products

MapReduce Tez

YARN

Ham

ster

YARN

Page 35: Hadoop 201 -- Deeper into the Elephant

Hadoop 2.0

• HDFS 2.0

• Yet Another Resource Negotiator (YARN)

• MapReduce is just an “application” now

• Tez is another “application”

• Pivotal’s Hamster (OpenMPI) yet another one

Page 36: Hadoop 201 -- Deeper into the Elephant

MapReduce 1.0

Job Tracker

Task Tracker���(HDFS)

Task Tracker���(HDFS)

task1 task1 task1 task1 task1

task1 task1 task1 task1 taskN

Page 37: Hadoop 201 -- Deeper into the Elephant

YARN (AKA MR2.0)

Resource���Manager

Job Tracker

task1 task1 task1 task1 task1 Task Tracker

Page 38: Hadoop 201 -- Deeper into the Elephant

YARN (AKA MR2.0)

Resource���Manager

Job Tracker

task1 task1 task1 task1 task1 Task Tracker

Page 39: Hadoop 201 -- Deeper into the Elephant

YARN • Yet Another Resource Negotiator

• Resource Manager

• Node Managers

• Application Masters

• Specific to paradigm, e.g. MR Application master (aka JobTracker)

Page 40: Hadoop 201 -- Deeper into the Elephant

YARN: beyond MR

Resource���Manager

MPI

MPI

Page 41: Hadoop 201 -- Deeper into the Elephant

Hamster

•  Hadoop and MPI on the same cluster

•  OpenMPI Runtime on Hadoop YARN

•  Hadoop Provides: Resource Scheduling, ���Process monitoring, Distributed File System

•  Open MPI Provides: Process launching, ���Communication, I/O forwarding

Page 42: Hadoop 201 -- Deeper into the Elephant

Hamster Components • Hamster Application Master

• Gang Scheduler, YARN Application Preemption

• Resource Isolation (lxc Containers)

• ORTE: Hamster Runtime

• Process launching, Wireup, Interconnect

Page 43: Hadoop 201 -- Deeper into the Elephant

Hamster Architecture

Page 44: Hadoop 201 -- Deeper into the Elephant

Hadoop 2.0

HDFS

ASF Projects FLOSS Projects Pivotal Products

MapReduce Tez

YARN

Ham

ster

YARN

Page 45: Hadoop 201 -- Deeper into the Elephant

Hadoop ecosystem

HDFS

Pig

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

ASF Projects FLOSS Projects Pivotal Products

Oozie

MapReduce

Hive

Tez

Giraph

Hadoop UI

Hue

SolrCloud

Phoenix

HBase

Crunch Mahout

YARN

Ham

ster

YARN

Page 46: Hadoop 201 -- Deeper into the Elephant

There’s way too much stuff

• Tracking dependencies

• Integration testing

• Optimizing the defaults

• Rationalizing the behaviour

Page 47: Hadoop 201 -- Deeper into the Elephant

Wait! We’ve seen this!

GNU Software Linux kernel

Page 48: Hadoop 201 -- Deeper into the Elephant

Apache Bigtop Hadoop ecosystem (Hbase, Pig, Hive)

Hadoop���(HDFS, YARN, MR)

Page 49: Hadoop 201 -- Deeper into the Elephant

Principle #4

Apache Bigtop is how the Hadoop distros get

defined

Page 50: Hadoop 201 -- Deeper into the Elephant

The ecosystem • Apache HBase

• Apache Crunch, Pig, Hive and Phoenix

• Apache Giraph

• Apache Oozie

• Apache Mahout

• Apache Sqoop and Flume

Page 51: Hadoop 201 -- Deeper into the Elephant

Apache HBase • Small mutable records vs. HDFS files

• HFiles kept in HDFS

• Memcached for HDFS

• Built on HDFS and Zookeeper

• Google’s Bigtable

Page 52: Hadoop 201 -- Deeper into the Elephant

Hbase datamodel

• Driven by the original Webtable usecase:

com.cnn.www <html>...

content:

CNN CNN.co

anchor:a.com anchor:b.com

Page 53: Hadoop 201 -- Deeper into the Elephant

How do I use it? HTable table = new HTable(config, “table”);

Put p = new Put(Bytes.toBytes(“row”));

p.add(Bytes.toBytes(“family”),

Bytes.toBytes(“qualifier”),

Bytes.toBytes(“data”));

table.put(p);

Page 54: Hadoop 201 -- Deeper into the Elephant

Dataflow model

HBase

HDFS

Producer Consumer

Page 55: Hadoop 201 -- Deeper into the Elephant

When do I use it?

• Serving up large amounts of data

• Fast random access

• Scan operations

Page 56: Hadoop 201 -- Deeper into the Elephant

Principle #5

HBase: when you need OLAP + OLTP

Page 57: Hadoop 201 -- Deeper into the Elephant

What if its OLTP?

HDFS

Pig

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

ASF Projects FLOSS Projects Pivotal Products

Oozie

MapReduce

Hive

Tez

Giraph

Hadoop UI

Hue

SolrCloud

Phoenix

HBase

Crunch Mahout

YARN

Ham

ster

YARN

Page 58: Hadoop 201 -- Deeper into the Elephant

GemFire XD

HDFS

Pig

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

ASF Projects FLOSS Projects Pivotal Products

Oozie

MapReduce

Hive

Tez

Giraph

Hadoop UI

Hue

SolrCloud

Phoenix

HBase

Crunch Mahout

YARN

GemFire XD

Ham

ster

YARN

Page 59: Hadoop 201 -- Deeper into the Elephant

GemFire XD: a better HBase? • Close sourced but extremely mature

• SQL/Objects/JSON data model

• High concurrency, high update load

• Mostly selective point queries (no scans)

• Tiered storage architecture

Page 60: Hadoop 201 -- Deeper into the Elephant

YCSB Benchmark; Throughput is 2-12X

0

100000

200000

300000

400000

500000

600000

700000

800000

AU BU CU D FU LOAD

Th

rou

ghp

ut

(op

s/se

c)

HBase

4

8

12

16

0

100000

200000

300000

400000

500000

600000

700000

800000

AU BU CU D FU LOAD

Th

rou

ghp

ut

(op

s/se

c)

GemFire XD

4

8

12

16

Page 61: Hadoop 201 -- Deeper into the Elephant

YCSB Benchmark; Latency is 2X – 20X better

0

2000

4000

6000

8000

10000

12000

14000

Lat

en

cy (μ

sec)

HBase

4

8

12

16

0

2000

4000

6000

8000

10000

12000

14000

Lat

en

cy (μ

sec)

GemFire XD

4

8

12

16

Page 62: Hadoop 201 -- Deeper into the Elephant

Principle #6

There are always 3 implementations

Page 63: Hadoop 201 -- Deeper into the Elephant

Querying data

• MapReduce: “an assembly language”

• Apache Pig: a data manipulation DSL (now Turing complete!)

• Apache Hive: a batch-oriented SQL on top of Hadoop

Page 64: Hadoop 201 -- Deeper into the Elephant

How do I use Pig?

grunt> A = load ‘./input.txt’;

grunt> B = foreach A generate ��� flatten(TOKENIZE((chararray)$0)) as��� words;

grunt> C = group B by word;

grunt> D = foreach C generate COUNT(B), ��� group;

Page 65: Hadoop 201 -- Deeper into the Elephant

How do I use Hive? CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts AS

SELECT word, count(1) AS count FROM

(SELECT explode(split(line, '\s')) AS word FROM docs)

GROUP BY word

ORDER BY word;

Page 66: Hadoop 201 -- Deeper into the Elephant

Can we short Oracle now?

• No indexing

• Batch oriented scheduling

• Optimization for long running queries

• Metadata management is still in flux

Page 67: Hadoop 201 -- Deeper into the Elephant

[Close to] real-time SQL

• Impala (inspired by Google’s F1)

• Hive/Tez (AKA Stinger)

• Facebook’s Presto (Hive’s lineage)

• Pivotal’s HAWQ

Page 68: Hadoop 201 -- Deeper into the Elephant

HAWQ

• GreenPlum MPP database core

• True ANSI SQL support

• HDFS storage backend

• Parquet support

Page 69: Hadoop 201 -- Deeper into the Elephant

Principle #7

SQL on Hadoop

Page 70: Hadoop 201 -- Deeper into the Elephant

Feeding the elephant

Page 71: Hadoop 201 -- Deeper into the Elephant

Getting data in: Flume • Designed for collecting log data

• Flexible deployment topology

Page 72: Hadoop 201 -- Deeper into the Elephant

Sqoop: RDBMs connection • Sqoop 1

• A MapReduce tool

• Must use Oozie for workflows

• Sqoop 2

• Well, 0.99.x really

• A standalone service

Page 73: Hadoop 201 -- Deeper into the Elephant

Spring XD

• Unified, distributed, extensible system for data ingestions, real time analytics and data exports

• Apache Licensed, not ASF

• A runtime service, not a library

• AKA “Oozie + Flume + Sqoop + Morphlines”

Page 74: Hadoop 201 -- Deeper into the Elephant

How do I use it?

# deployment: ./xd-singlenode

$ ./xd-shell

xd:> hadoop config fs –namenode hdfs://nn:8020

xd:> stream create –definition “time | hdfs” ��� –name ticktock

xd:> stream destroy –name ticktock

Page 75: Hadoop 201 -- Deeper into the Elephant

Feeding the Elephant

HDFS

Pig

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

ASF Projects FLOSS Projects Pivotal Products

Oozie

MapReduce

Hive

Tez

Giraph

Hadoop UI

Hue

SolrCloud

Phoenix

HBase

Crunch Mahout

YARN

GemFire XD

SpringXD

Ham

ster

YARN

Page 76: Hadoop 201 -- Deeper into the Elephant

Spark the disruptor

HDFS

Pig

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

ASF Projects FLOSS Projects Pivotal Products

GemFireXD

Oozie

MapReduce

Hive

Tez

Giraph

Hadoop UI

Hue

SolrCloud

Phoenix

HBase

Crunch Mahout

Spark

Shark

Streaming

MLib

GraphX

SpringXD

YARN

Ham

ster

YARN

Page 77: Hadoop 201 -- Deeper into the Elephant

What’s wrong with MR?

Source: UC Berkeley Spark project (just the image)

Page 78: Hadoop 201 -- Deeper into the Elephant

Spark innovations • Resilient Distribtued Datasets (RDDs)

• Distributed on a cluster

• Manipulated via parallel operators (map, etc.)

• Automatically rebuilt on failure

• A parallel ecosystem

• A solution to iterative and multi-stage apps

Page 79: Hadoop 201 -- Deeper into the Elephant

RDDs

warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1))

HadoopRDD���path = hdfs://

FilteredRDD���contains…

MappedRDD split…

Page 80: Hadoop 201 -- Deeper into the Elephant

Parallel operators

• map, reduce

• sample, filter

• groupBy, reduceByKey

• join, leftOuterJoin, rightOuterJoin

• union, cross

Page 81: Hadoop 201 -- Deeper into the Elephant

An alternative backend

• Shark: a Hive on Spark (now Spark SQL)

• Spork: a Pig on Spark

• Mlib: machine learning on Spark

• GraphX: Graph processing on Spark

• Also featuring its own streaming engine

Page 82: Hadoop 201 -- Deeper into the Elephant

How do I use it?

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Page 83: Hadoop 201 -- Deeper into the Elephant

Principle #8

Spark is the technology of 2014

Page 84: Hadoop 201 -- Deeper into the Elephant

Where’s the cloud?

Page 85: Hadoop 201 -- Deeper into the Elephant

What’s new?

• True elasticity

• Resource partitioning

• Security

• Data marketplace

• Data leaks/breaches

Page 86: Hadoop 201 -- Deeper into the Elephant

Hadoop Maturity

ETL Offload Accommodate massive ���

data growth with existing EDW investments

Data Lakes Unify Unstructured and Structured Data Access

Big Data Apps

Build analytic-led applications impacting ���

top line revenue

Data-Driven Enterprise

App Dev and Operational Management on HDFS

Data Architecture

Page 87: Hadoop 201 -- Deeper into the Elephant

Pivotal HD on Pivotal CF

� Enterprise PaaS Management System

� Flexible multi-language ‘buildpack’ architecture

� Deployed applications enjoy built-in services

� On-Premise Hadoop as a Service

� Single cluster deployment of Pivotal HD

� Developers instantly bind to shared Hadoop Clusters

� Speeds up time-to-value

Page 88: Hadoop 201 -- Deeper into the Elephant

Pivotal Data Fabric Evolution

Analytic���Data Marts

SQL Services

Operational ���Intelligence

In-Memory Database

Run-Time���Applications

Data Staging���Platform

Data Mgmt. Services

Pivotal Data Platform

Stream ���Ingestion

Streaming Services

Software-Defined Datacenter

New Data-fabrics

In-Memory Grid

...ETC

Page 89: Hadoop 201 -- Deeper into the Elephant

Principle #9

Hadoop in the Cloud is one of many

distributed frameworks

Page 90: Hadoop 201 -- Deeper into the Elephant

2014 is the year of Hadoop

HDFS

Pig

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

ASF Projects FLOSS Projects Pivotal Products

GemFire XD

Oozie

MapReduce

Hive

Tez

Giraph

Hadoop UI

Hue

SolrCloud

Phoenix

HBase

Crunch Mahout

Spark

Shark

Streaming

MLib

GraphX

Impala

HAW

Q

SpringXD

MADlib

Ham

ster

PivotalR

YARN

Page 91: Hadoop 201 -- Deeper into the Elephant

A NEW PLATFORM FOR A NEW ERA

Additional Line 18 Point Verdana

Page 92: Hadoop 201 -- Deeper into the Elephant

Credits

• Apache Software Foundation

• Milind Bhandarkar

• Konstantin Boudnik

• Robert Geiger

• Susheel Kaushik

• Mak Gokhale

Page 93: Hadoop 201 -- Deeper into the Elephant

Questions ?