Hackathon bonn
-
Upload
emil-andreas-siemes -
Category
Technology
-
view
131 -
download
0
description
Transcript of Hackathon bonn
Hortonworks: We Do Hadoop.Our mission is to enable your Modern Data Architecture
by Delivering Enterprise Apache Hadoop
YARN, Tez, Stinger
June 2014
Our Mission:
Our Commitment
Open LeadershipDrive innovation in the open exclusively via the Apache community-driven open source process
Enterprise RigorEngineer, test and certify Apache Hadoop with the enterprise in mind
Ecosystem EndorsementFocus on deep integration with existing data center technologies and skills
Page 2
Headquarters: Palo Alto, CAEmployees: 300+ and growing
Trusted Partners
Enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop
Driving Our Innovation Through Apache
147,933 lines
614,041 lines
End Users
449,768 lines
Total Net Lines Contributed to Apache Hadoop
Yahoo: 10
Cloudera: 7
IBM: 3
10 Others
21
Facebook: 5
LinkedIn: 3
Total Number of Committers to Apache Hadoop
63total
Hortonworks mission is to power your modern data architecture by enabling
Hadoop to be an enterprise data platform that deeply integrates with your data center technologies
Page 3
Apache Project
CommittersPMC
Members
Hadoop 21 13
Tez 10 4
Hive 11 3
HBase 8 3
Pig 6 5
Sqoop 1 0
Ambari 20 12
Knox 6 2
Falcon 2 2
Oozie 2 2
Zookeeper
2 1
Flume 1 0
Accumulo 2 2
Storm 1 0
Drill 1 0
TOTAL 95 48
Broad Ecosystem Integration
Page 4
APPL
ICAT
ION
SDA
TA S
YSTE
MSO
URC
ES
RDBMS EDW MPP
Emerging Sources (Sensor, Sentiment, Geo, Unstructured)
HANA
BusinessObjects BI
OPERATIONAL TOOLS
DEV & DATA TOOLS
Existing Sources (CRM, ERP, Clickstream, Logs)
INFRASTRUCTURE
UDADiagram
Relying on Hortonworks…
Teradata Portfolio for Hadoop
• Seamless data access between Teradata and Hadoop (SQL-H)
• Simple management & monitoring with Viewpoint integration
• Flexible deployment options
Page 5
HDInsight & HDP for Windows
• Only Hadoop Distribution for Windows Azure & Windows Server
• Native integration with SQL Server, Excel, and System Center
• Extends Hadoop to .NET community
Complete Portfolio for Hadoop
Appliances
Instant Access + Infinite Scale
• SAP can assure their customers they are deploying an SAP HANA + Hadoop architecture fully supported by SAP
• Enables analytics apps (BOBJ) to interact with Hadoop
HDP 2.1: Enterprise Hadoop Platform
Page 6
Hortonworks Data Platform (HDP)
• The ONLY 100% open source and most current platform
• Integrates full range of enterprise-ready services
• Certified and tested at scale
• Engineered for deep ecosystem interoperability
OS/VM Cloud Appliance
CORE SERVICES
CORE
Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots
HORTONWORKS DATA PLATFORM (HDP)
OPERATIONAL SERVICES
DATASERVICES
HDFS
SQOOP
FLUME
NFS
LOAD & EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP TEZREDUCE
HIVE &HCATALOG
PIGHBASE
OPERATIONAL SERVICES
DATASERVICES
CORE SERVICES
HORTONWORKS DATA PLATFORM (HDP)
Schedule
Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots
Storage
Resource Management
Process
Data Movement
ClusterMgmnt Dataset
Mgmnt Data Access
CORE SERVICES
HORTONWORKS DATA PLATFORM (HDP)
OPERATIONAL SERVICES
DATASERVICES
HDFS
SQOOP
FLUMEAMBARIFALCON
YARN
MAP TEZREDUCE
HIVEPIGHBASE
OOZIE
Enterprise ReadinessHigh Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots
LOAD & EXTRACT
WebHDFS
NFS
KNOX*
Our Vision: Hadoop as Next-Gen Platform
HADOOP 1.0
HDFS(redundant, reliable storage)
MapReduce(cluster resource management
& data processing)
HDFS2(redundant, highly-available & reliable storage)
YARN(cluster resource management)
MapReduce(data processing)
Others
HADOOP 2.0
Single Use SystemBatch Apps
Multi Purpose PlatformBatch, Interactive, Online, Streaming, …
Page 7
The 1st Generation of Hadoop: Batch
HADOOP 1.0Built for Web-Scale Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
• All other usage patterns must leverage that same infrastructure
• Forces the creation of silos for managing mixed workloads
Single App
BATCH
HDFS
Single App
ONLINE
Hadoop MapReduce Classic
• JobTracker
–Manages cluster resources and job scheduling
• TaskTracker
–Per-node agent
–Manage tasks
Page 9
YARN: Taking Hadoop Beyond Batch
Page 10
Applications Run Natively in Hadoop
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH(MapReduce)
INTERACTIVE(Tez)
STREAMING(Storm, S4,…)
GRAPH(Giraph)
IN-MEMORY(Spark)
HPC MPI(OpenMPI)
ONLINE(HBase)
OTHER(Search)
(Weave…)
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
55 Key Benefits of YARN
1. Scale
2. New Programming Models & Services
3. Improved cluster utilization
4. Agility
5. Beyond Java
Page 11
Page 12
Concepts
• Application–Application is a temporal job or a service submitted YARN–Examples
– Map Reduce Job (job)– Hbase Cluster (service)
• Container–Basic unit of allocation–Fine-grained resource allocation across multiple resource
types (memory, cpu, disk, network, gpu etc.)– container_0 = 2GB, 1CPU– container_1 = 1GB, 6 CPU
–Replaces the fixed map/reduce slots
Page 13
Design Centre
• Split up the two major functions of JobTracker–Cluster resource management–Application life-cycle management
• MapReduce becomes user-land library
YARN Applications
• Data processing applications and services–Online Serving – HOYA (HBase on YARN)–Real-time event processing – Storm, S4, other commercial
platforms
– Interactive SQL – Tez (Generalization of MR)–Machine Learning – MPI (OpenMPI, MPICH2)– In-Memory: Spark–Graph processing: Giraph–Enabled by allowing the use of paradigm-specific application
master
Run all on the same Hadoop cluster!
Page 14
© Hortonworks Inc. 2012
NodeManager NodeManager NodeManager NodeManager
map 1.1
vertex1.2.2
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
map1.2
reduce1.1
Batch
vertex1.1.1
vertex1.1.2
vertex1.2.1
Interactive SQL
YARN as OS for Data Lake
ResourceManager
Scheduler
Real-Time
nimbus0
nimbus1
nimbus2
© Hortonworks Inc. 2012
Multi-Tenant YARN
ResourceManager
Schedulerroot
Adhoc10%
DW60%
Mrkting30%
Dev10%
Reserved20%
Prod70%
Prod80%
Dev20%
P070%
P130%
Multi-Tenancy with CapacityScheduler
• Queues• Economics as queue-capacity
–Hierarchical Queues
• SLAs–Preemption
• Resource Isolation–Linux: cgroups–MS Windows: Job Control–Roadmap: Virtualization (Xen, KVM)
• Administration–Queue ACLs–Run-time re-configuration for queues–Charge-back
Page 17
ResourceManager
Scheduler
root
Adhoc10%
DW70%
Mrkting20%
Dev10%
Reserved20%
Prod70%
Prod80%
Dev20%
P070%
P130%
Capacity Scheduler
Hierarchical Queues
Tez (“Speed”)
• What is it?–A data processing framework as an alternative to MapReduce –A new incubation project in the ASF
• Who else is involved?–22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,
Microsoft
• Why does it matter?–Widens the platform for Hadoop use cases–Crucial to improving the performance of low-latency applications –Core to the Stinger initiative–Evidence of Hortonworks leading the community in the evolution
of Enterprise Hadoop
Moving Hadoop Beyond MapReduce
• Low level data-processing execution engine• Built on YARN
• Enables pipelining of jobs• Removes task and job launch times• Does not write intermediate output to HDFS
–Much lighter disk and network usage
• New base of MapReduce, Hive, Pig, Cascading etc.• Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline
Tez - Core Idea
Task with pluggable Input, Processor & Output
YARN ApplicationMaster to run DAG of Tez Tasks
Input Processor
Task
Output
Tez Task - <Input, Processor, Output>
Building Blocks for Tasks
MapReduce ‘Map’ MapReduce ‘Reduce’
HDFS Input
Map Processor
MapReduce ‘Map’ Task
Sorted Output
Intermediate ‘Reduce’ for Map-Reduce-Reduce
Shuffle Input
Reduce Processor
Intermediate ‘Reduce’ for Map-Reduce-Reduce
Sorted Output
Shuffle Input
Reduce Processor
HDFS Output
MapReduce ‘Reduce’ Task
Special Pig/Hive ‘Map’
HDFS Input
Map Processor
Tez Task
Pipeline
Sorter Output
Special Pig/Hive ‘Reduce’
Shuffle Skip-merge Input
Reduce Processor
Tez Task
Sorted Output
In-memory Map
HDFSInput
Map Processor
Tez Task
In-memor
y Sorted Output
Pig/Hive-MR versus Pig/Hive-TezSELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Pig/Hive - MR Pig/Hive - Tez
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Job 1
Job 2
Job 3
Single Job
Tez on YARN: Going Beyond Batch
Tez Optimizes Execution
New runtime engine for more efficient data processing
Always-On Tez Service
Low latency processing forall Hadoop data processing
Tez Task
SQL-in-Hadoop with Apache Hive
• Apache Hive is the standard for SQL interaction with Hadoop–Enterprise makes final purchasing
decision on two key characteristics: 'compatibility' with existing investments (60%) and skills (20%)
–Most application claim Hive compatibility TODAY*
• Stinger Initiative: Simple Focus–Performance–SQL-Compatibility–Scalability
Claims publicly made by: Teradata, Microsoft, Oracle, Microstrategy, IBM, Information Builders, SAS, QlikTech, SAP, Tableau, Tibco, Actuate, Jaspersoft, Alteryx, Datameer, Pentaho
Page 24
Had
oop
HDFS
Hive
TezMapReduce
SQL
YARN
Business Analytics
CustomApps
Improves existing tools & preserves investments
Stinger Project(announced February 2013)
Batch AND Interactive SQL-IN-Hadoop
Stinger InitiativeA broad, community-based effort to drive the next generation of HIVE
Hive 0.13, April 2014:• Hive on Apache Tez• Query Service• Buffer Cache• Cost Based Optimizer (Optiq)• Vectorized Processing
Hive 0.11, May 2013:• Base Optimizations• SQL Analytic Functions• ORCFile, Modern File Format
Hive 0.12, October 2013:
• VARCHAR, DATE Types• ORCFile predicate pushdown• Advanced Optimizations• Performance Boosts via YARN
SpeedImprove Hive query performance by 100X to allow for interactive query times (seconds)
ScaleThe only SQL interface to Hadoop designed for queries that scale from TB to PB
SQLSupport broadest range of SQL semantics for analytic applications running against Hadoop
…all IN Hadoop
Goals:
Hortonworks: The Value of “Open” for You
Page 26
Validate & Try1. Download the
Hortonworks Sandbox
2. Learn Hadoop using the technical tutorials
3. Investigate a business case using the step-by-step business cases scenarios
4. Validate YOUR business case using your data in the sandbox
Connect With the Hadoop CommunityWe employ a large number of Apache project committers & innovators so that you are represented in the open source community
Avoid Vendor Lock-InHortonworks Data Platform remain as close to the open source trunk as possible and is developed 100% in the open so you are never locked in
The Partners you Rely On, Rely On Hortonworks We work with partners to deeply integrate Hadoop with data center technologies so you can leverage existing skills and investments
Certified for the EnterpriseWe engineer, test and certify the Hortonworks Data Platform at scale to ensure reliability and stability you require for enterprise use
Support from the ExpertsWe provide the highest quality of support for deploying at scale. You are supported by hundreds of years of Hadoop experience
Engage1. Execute a Business Case
Discovery Workshop with our architects
2. Build a business case for Hadoop today