Hackathon bonn

Hortonworks: We Do Hadoop.Our mission is to enable your Modern Data Architecture

by Delivering Enterprise Apache Hadoop

YARN, Tez, Stinger

June 2014

Our Mission:

Our Commitment

Open LeadershipDrive innovation in the open exclusively via the Apache community-driven open source process

Enterprise RigorEngineer, test and certify Apache Hadoop with the enterprise in mind

Ecosystem EndorsementFocus on deep integration with existing data center technologies and skills

Headquarters: Palo Alto, CAEmployees: 300+ and growing

Trusted Partners

Enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop

Driving Our Innovation Through Apache

147,933 lines

614,041 lines

End Users

449,768 lines

Total Net Lines Contributed to Apache Hadoop

Yahoo: 10

Cloudera: 7

IBM: 3

10 Others

21

Facebook: 5

LinkedIn: 3

Total Number of Committers to Apache Hadoop

63total

Hortonworks mission is to power your modern data architecture by enabling

Hadoop to be an enterprise data platform that deeply integrates with your data center technologies

Apache Project

CommittersPMC

Members

Hadoop 21 13

Tez 10 4

Hive 11 3

HBase 8 3

Pig 6 5

Sqoop 1 0

Ambari 20 12

Knox 6 2

Falcon 2 2

Oozie 2 2

Zookeeper

2 1

Flume 1 0

Accumulo 2 2

Storm 1 0

Drill 1 0

TOTAL 95 48

Broad Ecosystem Integration

APPL

ICAT

ION

SDA

TA S

YSTE

MSO

URC

ES

RDBMS EDW MPP

Emerging Sources (Sensor, Sentiment, Geo, Unstructured)

HANA

BusinessObjects BI

OPERATIONAL TOOLS

DEV & DATA TOOLS

Existing Sources (CRM, ERP, Clickstream, Logs)

INFRASTRUCTURE

UDADiagram

Relying on Hortonworks…

Teradata Portfolio for Hadoop

• Seamless data access between Teradata and Hadoop (SQL-H)

• Simple management & monitoring with Viewpoint integration

• Flexible deployment options

HDInsight & HDP for Windows

• Only Hadoop Distribution for Windows Azure & Windows Server

• Native integration with SQL Server, Excel, and System Center

• Extends Hadoop to .NET community

Complete Portfolio for Hadoop

Appliances

Instant Access + Infinite Scale

• SAP can assure their customers they are deploying an SAP HANA + Hadoop architecture fully supported by SAP

• Enables analytics apps (BOBJ) to interact with Hadoop

HDP 2.1: Enterprise Hadoop Platform

Hortonworks Data Platform (HDP)

• The ONLY 100% open source and most current platform

• Integrates full range of enterprise-ready services

• Certified and tested at scale

• Engineered for deep ecosystem interoperability

OS/VM Cloud Appliance

CORE SERVICES

CORE

Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots

HORTONWORKS DATA PLATFORM (HDP)

OPERATIONAL SERVICES

DATASERVICES

HDFS

SQOOP

FLUME

NFS

LOAD & EXTRACT

WebHDFS

KNOX*

OOZIE

AMBARI

FALCON*

YARN

MAP TEZREDUCE

HIVE &HCATALOG

PIGHBASE


DATASERVICES

CORE SERVICES


Schedule


Storage

Resource Management

Process

Data Movement

ClusterMgmnt Dataset

Mgmnt Data Access

CORE SERVICES



DATASERVICES

HDFS

SQOOP

FLUMEAMBARIFALCON

YARN

MAP TEZREDUCE

HIVEPIGHBASE

OOZIE


LOAD & EXTRACT

WebHDFS

NFS

KNOX*

Our Vision: Hadoop as Next-Gen Platform

HADOOP 1.0

HDFS(redundant, reliable storage)

MapReduce(cluster resource management

& data processing)

HDFS2(redundant, highly-available & reliable storage)

YARN(cluster resource management)

MapReduce(data processing)

Others

HADOOP 2.0

Single Use SystemBatch Apps

Multi Purpose PlatformBatch, Interactive, Online, Streaming, …

The 1st Generation of Hadoop: Batch

HADOOP 1.0Built for Web-Scale Batch Apps

Single App

BATCH

HDFS

Single App

INTERACTIVE

Single App

BATCH

HDFS

• All other usage patterns must leverage that same infrastructure

• Forces the creation of silos for managing mixed workloads

Single App

BATCH

HDFS

Single App

ONLINE

Hadoop MapReduce Classic

• JobTracker

–Manages cluster resources and job scheduling

• TaskTracker

–Per-node agent

–Manage tasks

YARN: Taking Hadoop Beyond Batch

Applications Run Natively in Hadoop

HDFS2 (Redundant, Reliable Storage)

YARN (Cluster Resource Management)

BATCH(MapReduce)

INTERACTIVE(Tez)

STREAMING(Storm, S4,…)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

ONLINE(HBase)

OTHER(Search)

(Weave…)

Store ALL DATA in one place…

Interact with that data in MULTIPLE WAYS

with Predictable Performance and Quality of Service

55 Key Benefits of YARN

1. Scale

2. New Programming Models & Services

3. Improved cluster utilization

4. Agility

5. Beyond Java

Concepts

• Application–Application is a temporal job or a service submitted YARN–Examples

– Map Reduce Job (job)– Hbase Cluster (service)

• Container–Basic unit of allocation–Fine-grained resource allocation across multiple resource

types (memory, cpu, disk, network, gpu etc.)– container_0 = 2GB, 1CPU– container_1 = 1GB, 6 CPU

–Replaces the fixed map/reduce slots

Design Centre

• Split up the two major functions of JobTracker–Cluster resource management–Application life-cycle management

• MapReduce becomes user-land library

YARN Applications

• Data processing applications and services–Online Serving – HOYA (HBase on YARN)–Real-time event processing – Storm, S4, other commercial

platforms

– Interactive SQL – Tez (Generalization of MR)–Machine Learning – MPI (OpenMPI, MPICH2)– In-Memory: Spark–Graph processing: Giraph–Enabled by allowing the use of paradigm-specific application

master

Run all on the same Hadoop cluster!

© Hortonworks Inc. 2012

NodeManager NodeManager NodeManager NodeManager

map 1.1

vertex1.2.2



map1.2

reduce1.1

Batch

vertex1.1.1

vertex1.1.2

vertex1.2.1

Interactive SQL

YARN as OS for Data Lake

ResourceManager

Scheduler

Real-Time

nimbus0

nimbus1

nimbus2

© Hortonworks Inc. 2012

Multi-Tenant YARN

ResourceManager

Schedulerroot

Adhoc10%

DW60%

Mrkting30%

Dev10%

Reserved20%

Prod70%

Prod80%

Dev20%

P070%

P130%

Multi-Tenancy with CapacityScheduler

• Queues• Economics as queue-capacity

–Hierarchical Queues

• SLAs–Preemption

• Resource Isolation–Linux: cgroups–MS Windows: Job Control–Roadmap: Virtualization (Xen, KVM)

• Administration–Queue ACLs–Run-time re-configuration for queues–Charge-back

ResourceManager

Scheduler

root

Adhoc10%

DW70%

Mrkting20%

Dev10%

Reserved20%

Prod70%

Prod80%

Dev20%

P070%

P130%

Capacity Scheduler

Hierarchical Queues

Tez (“Speed”)

• What is it?–A data processing framework as an alternative to MapReduce –A new incubation project in the ASF

• Who else is involved?–22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,

Microsoft

• Why does it matter?–Widens the platform for Hadoop use cases–Crucial to improving the performance of low-latency applications –Core to the Stinger initiative–Evidence of Hortonworks leading the community in the evolution

of Enterprise Hadoop

Moving Hadoop Beyond MapReduce

• Low level data-processing execution engine• Built on YARN

• Enables pipelining of jobs• Removes task and job launch times• Does not write intermediate output to HDFS

–Much lighter disk and network usage

• New base of MapReduce, Hive, Pig, Cascading etc.• Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline

Tez - Core Idea

Task with pluggable Input, Processor & Output

YARN ApplicationMaster to run DAG of Tez Tasks

Input Processor

Task

Output

Tez Task - <Input, Processor, Output>

Building Blocks for Tasks

MapReduce ‘Map’ MapReduce ‘Reduce’

HDFS Input

Map Processor

MapReduce ‘Map’ Task

Sorted Output

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Shuffle Input

Reduce Processor

Intermediate ‘Reduce’ for Map-Reduce-Reduce

Sorted Output

Shuffle Input

Reduce Processor

HDFS Output

MapReduce ‘Reduce’ Task

Special Pig/Hive ‘Map’

HDFS Input

Map Processor

Tez Task

Pipeline

Sorter Output

Special Pig/Hive ‘Reduce’

Shuffle Skip-merge Input

Reduce Processor

Tez Task

Sorted Output

In-memory Map

HDFSInput

Map Processor

Tez Task

In-memor

y Sorted Output

Pig/Hive-MR versus Pig/Hive-TezSELECT a.state, COUNT(*),

AVERAGE(c.price)

FROM a

JOIN b ON (a.id = b.id)

JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

Pig/Hive - MR Pig/Hive - Tez

I/O Synchronization

Barrier

I/O Synchronization

Barrier

Job 1

Job 2

Job 3

Single Job

Tez on YARN: Going Beyond Batch

Tez Optimizes Execution

New runtime engine for more efficient data processing

Always-On Tez Service

Low latency processing forall Hadoop data processing

Tez Task

SQL-in-Hadoop with Apache Hive

• Apache Hive is the standard for SQL interaction with Hadoop–Enterprise makes final purchasing

decision on two key characteristics: 'compatibility' with existing investments (60%) and skills (20%)

–Most application claim Hive compatibility TODAY*

• Stinger Initiative: Simple Focus–Performance–SQL-Compatibility–Scalability

Claims publicly made by: Teradata, Microsoft, Oracle, Microstrategy, IBM, Information Builders, SAS, QlikTech, SAP, Tableau, Tibco, Actuate, Jaspersoft, Alteryx, Datameer, Pentaho

Had

oop

HDFS

Hive

TezMapReduce

SQL

YARN

Business Analytics

CustomApps

Improves existing tools & preserves investments

Stinger Project(announced February 2013)

Batch AND Interactive SQL-IN-Hadoop

Stinger InitiativeA broad, community-based effort to drive the next generation of HIVE

Hive 0.13, April 2014:• Hive on Apache Tez• Query Service• Buffer Cache• Cost Based Optimizer (Optiq)• Vectorized Processing

Hive 0.11, May 2013:• Base Optimizations• SQL Analytic Functions• ORCFile, Modern File Format

Hive 0.12, October 2013:

• VARCHAR, DATE Types• ORCFile predicate pushdown• Advanced Optimizations• Performance Boosts via YARN

SpeedImprove Hive query performance by 100X to allow for interactive query times (seconds)

ScaleThe only SQL interface to Hadoop designed for queries that scale from TB to PB

SQLSupport broadest range of SQL semantics for analytic applications running against Hadoop

…all IN Hadoop

Goals:

Hortonworks: The Value of “Open” for You

Validate & Try1. Download the

Hortonworks Sandbox

2. Learn Hadoop using the technical tutorials

3. Investigate a business case using the step-by-step business cases scenarios

4. Validate YOUR business case using your data in the sandbox

Connect With the Hadoop CommunityWe employ a large number of Apache project committers & innovators so that you are represented in the open source community

Avoid Vendor Lock-InHortonworks Data Platform remain as close to the open source trunk as possible and is developed 100% in the open so you are never locked in

The Partners you Rely On, Rely On Hortonworks We work with partners to deeply integrate Hadoop with data center technologies so you can leverage existing skills and investments

Certified for the EnterpriseWe engineer, test and certify the Hortonworks Data Platform at scale to ensure reliability and stability you require for enterprise use

Support from the ExpertsWe provide the highest quality of support for deploying at scale. You are supported by hundreds of years of Hadoop experience

Engage1. Execute a Business Case

Discovery Workshop with our architects

2. Build a business case for Hadoop today

Hackathon bonn

Technology

Transcript of Hackathon bonn