Hadoop 201 -- Deeper into the Elephant

Deeper into the elephant: a whirlwind tour of Hadoop ecosystem

Roman Shaposhnik Director of Open Source @Pivotal

(Twitter: @rhatr)

Who’s this guy?

•  Director of Open Source, building a team of OS contributors

•  Apache Software Foundation guy (VP of Apache Incubator, ASF member, committer on Hadoop, Giraph, Sqoop, etc)

•  Used to be root@Cloudera

•  Used to be PHB@Yahoo! (original Hadoop team)

•  Used to be a hacker at Sun microsystems (Sun Studio compilers and tools)

Agenda

Long, long time ago…

ASF Projects FLOSS Projects Pivotal Products

MapReduce

In a blink of an eye:

Sqoop Flume

Coordination and workflow

management

Zookeeper

Command Center

GemFire XD

MapReduce

Giraph

Hadoop UI

SolrCloud

Phoenix

Crunch Mahout

Streaming

GraphX

Impala

SpringXD

MADlib

PivotalR

Genesis of Hadoop

• Google papers on GFS and MapReduce

• A subproject of Apache Nutch

• A bet by Yahoo!

Data brings value

• What features to add to the product

• Data analysis must enable decisions

• V3: volume, velocity, variety

Big Data brings big value

Entering: Industrial Data

Hadoop’s childhood

• HDFS: Hadoop Distributed Filesystem

• MapReduce: computational framework

HDFS: not a POSIXfs • Huge blocks: 64Mb (128Mb)

• Mostly immutable files (append, truncate)

• Streaming data access

• Block replication

How do I use it?

$ hadoop fs –lsr / # hadoop-fuse-dfs dfs://hadoop-hdfs /mnt $ ls /mnt # mount –t nfs –o vers=3,proto=tcp,nolock host:/ /mnt $ ls /mnt

Principle #1

HDFS is the datalake

Pivotal’s Focus on Data Lakes

Existing EDW / Datamarts

Raw “untouched” Data

In-Mem

ory Parallel Ingest

Data Management��

(Search Engine)

Processed Data

In-Memory Services BI / A

nalytical Tools

Data Lake

New Data Sources/Formats

Machine

Traditional Data Sources

Finally! I now have full

transparency on the data

with amazing speed!

All data�� is now

accessible!

I can now afford ��“Big Data”

Business Users

ELT Processing with Hadoop

HDFS MapReduce/SQL/Pig/Hive

Analytical Data Marts/

Sandboxes

Security and Control

HDFS enables the stack

Sqoop Flume

management

Zookeeper

Command Center

GemFire XD

MapReduce

Giraph

Hadoop UI

SolrCloud

Phoenix

Crunch Mahout

Streaming

GraphX

Impala

SpringXD

MADlib

PivotalR

Principle #2

Apps share their internal state

MapReduce

• Batch oriented (long jobs; final results)

• Brings the computation to the data

• Very constrained programming model

• Embarrassingly parallel programming model

• Used to be the only game in town for compute

MapReduce Overview

• Record = (Key, Value)

• Key : Comparable, Serializable

• Value: Serializable

• Logical Phases: Input, Map, Shuffle, Reduce, Output

• Input: (Key1, Value1)

• Output: List(Key2, Value2)

• Projections, Filtering, Transformation

Shuffle

• Input: List(Key2, Value2)

• Output

• Sort(Partition(List(Key2, List(Value2))))

• Provided by Hadoop : Several Customizations Possible

Reduce

• Input: List(Key2, List(Value2))

• Output: List(Key3, Value3)

• Aggregations

Anatomy of MapReduce

a 3 b 1 c 2

a 1 b 1 c 1

a 1 c 1 a 1

a 1 1 1 b 1 c 1 1

HDFS mappers reducers HDFS

MapReduce DataFlow

How do I use it? public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { public void map(Object key, Text value, Context context) { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

How do I use it? public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

How do I run it?

$ hadoop jar hadoop-examples.jar wordcount \ input \ output

Principle #3

MapReduce is assembly language of Hadoop

Hadoop’s childhood

• Compact (pretty much a single jar)

• Challenged in scalability and SPOFs

• Extremely batch oriented

• Hard for non-Java programmers

Then, something happened

Hadoop 1.0

MapReduce

Hadoop 2.0

MapReduce Tez

Hadoop 2.0

• HDFS 2.0

• Yet Another Resource Negotiator (YARN)

• MapReduce is just an “application” now

• Tez is another “application”

• Pivotal’s Hamster (OpenMPI) yet another one

MapReduce 1.0

Job Tracker

Task Tracker��(HDFS)

task1 task1 task1 task1 task1

task1 task1 task1 task1 taskN

YARN (AKA MR2.0)

Resource��Manager

Job Tracker

task1 task1 task1 task1 task1 Task Tracker

YARN (AKA MR2.0)

Job Tracker

task1 task1 task1 task1 task1 Task Tracker

YARN • Yet Another Resource Negotiator

• Resource Manager

• Node Managers

• Application Masters

• Specific to paradigm, e.g. MR Application master (aka JobTracker)

YARN: beyond MR

Hamster

•  Hadoop and MPI on the same cluster

•  OpenMPI Runtime on Hadoop YARN

•  Hadoop Provides: Resource Scheduling, ��Process monitoring, Distributed File System

•  Open MPI Provides: Process launching, ��Communication, I/O forwarding

Hamster Components • Hamster Application Master

• Gang Scheduler, YARN Application Preemption

• Resource Isolation (lxc Containers)

• ORTE: Hamster Runtime

• Process launching, Wireup, Interconnect

Hamster Architecture

Hadoop 2.0

MapReduce Tez

Hadoop ecosystem

Sqoop Flume

management

Zookeeper

Command Center

MapReduce

Giraph

Hadoop UI

SolrCloud

Phoenix

Crunch Mahout

There’s way too much stuff

• Tracking dependencies

• Integration testing

• Optimizing the defaults

• Rationalizing the behaviour

Wait! We’ve seen this!

GNU Software Linux kernel

Apache Bigtop Hadoop ecosystem (Hbase, Pig, Hive)

Hadoop��(HDFS, YARN, MR)

Principle #4

Apache Bigtop is how the Hadoop distros get

defined

The ecosystem • Apache HBase

• Apache Crunch, Pig, Hive and Phoenix

• Apache Giraph

• Apache Oozie

• Apache Mahout

• Apache Sqoop and Flume

Apache HBase • Small mutable records vs. HDFS files

• HFiles kept in HDFS

• Memcached for HDFS

• Built on HDFS and Zookeeper

• Google’s Bigtable

Hbase datamodel

• Driven by the original Webtable usecase:

com.cnn.www <html>...

content:

CNN CNN.co

anchor:a.com anchor:b.com

How do I use it? HTable table = new HTable(config, “table”);

Put p = new Put(Bytes.toBytes(“row”));

p.add(Bytes.toBytes(“family”),

Bytes.toBytes(“qualifier”),

Bytes.toBytes(“data”));

table.put(p);

Dataflow model

Producer Consumer

When do I use it?

• Serving up large amounts of data

• Fast random access

• Scan operations

Principle #5

HBase: when you need OLAP + OLTP

What if its OLTP?

Sqoop Flume

management

Zookeeper

Command Center

MapReduce

Giraph

Hadoop UI

SolrCloud

Phoenix

Crunch Mahout

GemFire XD

Sqoop Flume

management

Zookeeper

Command Center

MapReduce

Giraph

Hadoop UI

SolrCloud

Phoenix

Crunch Mahout

GemFire XD

GemFire XD: a better HBase? • Close sourced but extremely mature

• SQL/Objects/JSON data model

• High concurrency, high update load

• Mostly selective point queries (no scans)

• Tiered storage architecture

YCSB Benchmark; Throughput is 2-12X

100000

200000

300000

400000

500000

600000

700000

800000

AU BU CU D FU LOAD

100000

200000

300000

400000

500000

600000

700000

800000

AU BU CU D FU LOAD

GemFire XD

YCSB Benchmark; Latency is 2X – 20X better

cy (μ

GemFire XD

Principle #6

There are always 3 implementations

Querying data

• MapReduce: “an assembly language”

• Apache Pig: a data manipulation DSL (now Turing complete!)

• Apache Hive: a batch-oriented SQL on top of Hadoop

How do I use Pig?

grunt> A = load ‘./input.txt’;

grunt> B = foreach A generate �� flatten(TOKENIZE((chararray)$0)) as�� words;

grunt> C = group B by word;

grunt> D = foreach C generate COUNT(B), �� group;

How do I use Hive? CREATE TABLE docs (line STRING);

LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;

CREATE TABLE word_counts AS

SELECT word, count(1) AS count FROM

(SELECT explode(split(line, '\s')) AS word FROM docs)

GROUP BY word

ORDER BY word;

Can we short Oracle now?

• No indexing

• Batch oriented scheduling

• Optimization for long running queries

• Metadata management is still in flux

[Close to] real-time SQL

• Impala (inspired by Google’s F1)

• Hive/Tez (AKA Stinger)

• Facebook’s Presto (Hive’s lineage)

• Pivotal’s HAWQ

• GreenPlum MPP database core

• True ANSI SQL support

• HDFS storage backend

• Parquet support

Principle #7

SQL on Hadoop

Feeding the elephant

Getting data in: Flume • Designed for collecting log data

• Flexible deployment topology

Sqoop: RDBMs connection • Sqoop 1

• A MapReduce tool

• Must use Oozie for workflows

• Sqoop 2

• Well, 0.99.x really

• A standalone service

Spring XD

• Unified, distributed, extensible system for data ingestions, real time analytics and data exports

• Apache Licensed, not ASF

• A runtime service, not a library

• AKA “Oozie + Flume + Sqoop + Morphlines”

How do I use it?

# deployment: ./xd-singlenode

$ ./xd-shell

xd:> hadoop config fs –namenode hdfs://nn:8020

xd:> stream create –definition “time | hdfs” �� –name ticktock

xd:> stream destroy –name ticktock

Feeding the Elephant

Sqoop Flume

management

Zookeeper

Command Center

MapReduce

Giraph

Hadoop UI

SolrCloud

Phoenix

Crunch Mahout

GemFire XD

SpringXD

Spark the disruptor

Sqoop Flume

management

Zookeeper

Command Center

GemFireXD

MapReduce

Giraph

Hadoop UI

SolrCloud

Phoenix

Crunch Mahout

Streaming

GraphX

SpringXD

What’s wrong with MR?

Source: UC Berkeley Spark project (just the image)

Spark innovations • Resilient Distribtued Datasets (RDDs)

• Distributed on a cluster

• Manipulated via parallel operators (map, etc.)

• Automatically rebuilt on failure

• A parallel ecosystem

• A solution to iterative and multi-stage apps

warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1))

HadoopRDD��path = hdfs://

FilteredRDD��contains…

MappedRDD split…

Parallel operators

• map, reduce

• sample, filter

• groupBy, reduceByKey

• join, leftOuterJoin, rightOuterJoin

• union, cross

An alternative backend

• Shark: a Hive on Spark (now Spark SQL)

• Spork: a Pig on Spark

• Mlib: machine learning on Spark

• GraphX: Graph processing on Spark

• Also featuring its own streaming engine

How do I use it?

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Principle #8

Spark is the technology of 2014

Where’s the cloud?

What’s new?

• True elasticity

• Resource partitioning

• Security

• Data marketplace

• Data leaks/breaches

Hadoop Maturity

ETL Offload Accommodate massive ��

data growth with existing EDW investments

Data Lakes Unify Unstructured and Structured Data Access

Big Data Apps

Build analytic-led applications impacting ��

top line revenue

Data-Driven Enterprise

App Dev and Operational Management on HDFS

Data Architecture

Pivotal HD on Pivotal CF

� Enterprise PaaS Management System

� Flexible multi-language ‘buildpack’ architecture

� Deployed applications enjoy built-in services

� On-Premise Hadoop as a Service

� Single cluster deployment of Pivotal HD

� Developers instantly bind to shared Hadoop Clusters

� Speeds up time-to-value

Pivotal Data Fabric Evolution

Analytic��Data Marts

SQL Services

Operational ��Intelligence

In-Memory Database

Run-Time��Applications

Data Staging��Platform

Data Mgmt. Services

Pivotal Data Platform

Stream ��Ingestion

Streaming Services

Software-Defined Datacenter

New Data-fabrics

In-Memory Grid

...ETC

Principle #9

Hadoop in the Cloud is one of many

distributed frameworks

2014 is the year of Hadoop

Sqoop Flume

management

Zookeeper

Command Center

GemFire XD

MapReduce

Giraph

Hadoop UI

SolrCloud

Phoenix

Crunch Mahout

Streaming

GraphX

Impala

SpringXD

MADlib

PivotalR

A NEW PLATFORM FOR A NEW ERA

Additional Line 18 Point Verdana

Credits

• Apache Software Foundation

• Milind Bhandarkar

• Konstantin Boudnik

• Robert Geiger

• Susheel Kaushik

• Mak Gokhale

Questions ?

Hadoop 201 -- Deeper into the Elephant

Software

Transcript of Hadoop 201 -- Deeper into the Elephant

2. Hadoop - lsd.ls.fi.upm.eslsd.ls.fi.upm.es/nuevas-tendencias-en-sistemas-distribuidos/Hadoop_… · Hadoop Hadoop Software Ecosystem Hadoop MapReduce Hadoop Distributed File System

MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop

State of the Elephant Hadoop yesterday, today, and tomorrow Page 1 Owen O’Malley owen@hortonworks.com @owen_omalley.

Hadoop 2 @Twitter, Elephant Scale. Presented at

Eagle-Eyed Elephant: Split-Oriented Indexing in Hadooptheory.stanford.edu/~jvondrak/data/eee.pdf · Eagle-Eyed Elephant: Split-Oriented Indexing in Hadoop Mohamed Y. Eltabakh1 Fatma

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

· (Page views ? Hourly? Monthly Hadoop Node Hadoop Node Hadoop Camus Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Ad-Hoc Analysis External Datastores Trends

Analyzing Hadoop with Hadoop

Adopt A Pet (Elephant?): Are You Enjoying Your Apache Hadoop ... - sas.com · 1 Paper SAS1684-2018 Adopt a Pet (Elephant?) – Are You Enjoying You’re Apache Hadoop Investment?

Hadoop , Hadoop , Hadoop !!!

2.3: APPLY DEDUCTIVE REASONING€¦ · inductive reasoning or deductive reasoning. Explain your choice. a. The northern elephant seal requires more strokes to surface the deeper it

Up-Armoring The Elephant: Adding Kerberos-based Security to Hadoop

Networking for Big Datajain//cse570-13/ftp/m_11nbd.pdf · Hadoop An open source implementation of MapReduce Named by Doug Cutting at Yahoo after his son’s yellow plus elephant Hadoop

Hadoop Elephant in Active Directory Forest

My Elephant PowerPoint By: Caleb Anderson. skeleton African elephant Asian elephant.

Driven by Python - jonEbirdjonebird.com/hadoop_intro.pdf · Introduction to Hadoop Driven by Python. 09/27/09 2 What is Hadoop? 09/27/09 3 Doug Cutting's daughter's stuffed toy elephant

Strata + Hadoop World 2012: Taming the Elephant - Learn how Monsanto manages their Hadoop clusters to enable Genome/Sequence processing

The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Active Directory Forest Hadoop Elephant in · Hadoop Elephant in Active Directory Forest ... Allegro Hadoop cluster in numbers ... Auto-deployment and autoconfiguration of Hadoop

Hadoop – An Elephant can't jump. But can carry heavy ...€¦ · 1.2 What is hadoop? (Name of a toy elephant actually) Hadoop is a framework which provides open source libraries