Hadoop World Vertica

15
Vertica Integration with Apache Hadoop Hadoop World NYC 2009 Hadoop Compute Cluster Map Map Map Reduce HDFS

description

Hadoop World 2009 Presentation on the Vertica Hadoop Connector

Transcript of Hadoop World Vertica

Page 1: Hadoop World Vertica

Vertica Integration with Apache HadoopHadoop World NYC 2009

Hadoop Compute

Cluster

Hadoop Compute

ClusterMapMap

MapMap

MapMap

ReduceReduce

HDFS

Page 2: Hadoop World Vertica

2

Vertica Systems Confidential – Do Not Distribute

Vertica® Analytic Database

> MPP columnar architecture

> Second to sub-second queries

> 300GB/node load times

> Scales to hundreds of TBs

> Standard ETL & Reporting Tools

www.vertica.com

Page 3: Hadoop World Vertica

3

Vertica Systems Confidential – Do Not Distribute

What do people do with Hadoop?

> Transform data

> Archive data

> Look for Patterns > Parse Logs

Page 4: Hadoop World Vertica

4

Vertica Systems Confidential – Do Not Distribute

Big Data comes in Three Forms

> Unstructured

Images, sound, video

> Semi-structured

Logs, data feeds,

event streams

> Fully Structured

Relational tables

Page 5: Hadoop World Vertica

5

Vertica Systems Confidential – Do Not Distribute

Availability, Scalability and Efficiency

…how fast can you go from data to answers?

>Unstructured data needs to be analyzed to make sense.

>Semi-structure data parsed based on spec (or brute force).

>Structured data can be optimized for ad-hoc analysis.

Page 6: Hadoop World Vertica

6

Vertica Systems Confidential – Do Not Distribute

Hadoop / Vertica

> Distributed processing framework (MapReduce)

> Distributed storage layer (HDFS)

> Vertica can be used as a data source and target for MapReduce

> Data can also be moved between Vertica and HDFS (sqoop)

> Hadoop talks to Vertica via custom Input and Output Formatters

Page 7: Hadoop World Vertica

7

Vertica Systems Confidential – Do Not Distribute

Hadoop Compute

Cluster

Hadoop Compute

Cluster MapMap

MapMap

MapMap

ReduceReduce

Hadoop / Vertica

Vertica serves as a structured data repository for hadoop

Page 8: Hadoop World Vertica

8

Vertica Systems Confidential – Do Not Distribute

Hadoop / Vertica

> Vertica’s input formatter takes a parameterized query

> Relational Map operations can be pushed down to the database

> Vertica’s output formatter takes an existing table name or a description

> Vertica output tables can be optimized directly from hadoop

Page 9: Hadoop World Vertica

9

Vertica Systems Confidential – Do Not Distribute

Hadoop / Vertica

Federate multiple Vertica database clusters with hadoop

Hadoop Compute

Cluster

Hadoop Compute

ClusterMapMap

MapMap

MapMap

ReduceReduce

Hadoop Compute

Cluster

Hadoop Compute

ClusterMapMap

MapMap

MapMap

ReduceReduce

Hadoop Compute

Cluster

Hadoop Compute

ClusterMapMap

MapMap

MapMap

ReduceReduce

Hadoop Compute

Cluster

Hadoop Compute

ClusterMapMap

MapMap

MapMap

ReduceReduce

Page 10: Hadoop World Vertica

10

Vertica Systems Confidential – Do Not Distribute

What is the Interface?

> Input Formatter Query specifies which data to read

Query can be parameterizes (map push down)

Each input split gets one parameter

OR, input can be spliced with order by and limit (slower)

> Output Formatter Job specifies format for output table

Vertica converts reduced output into trickle loads

Vertica can optimize new tables

Page 11: Hadoop World Vertica

11

Vertica Systems Confidential – Do Not Distribute

Some Hadoop / Vertica Applications

> Elastic Map Reduce parsing and loading CloudFront Logs

> Tickstore algorithm with map push down

> Analyze time series

> Sessionize click streams

> Parse and load logs

Page 12: Hadoop World Vertica

12

Vertica Systems Confidential – Do Not Distribute

Basic Example

> Elastic Map Reduce parsing and loading CloudFront Logs

> Mapper reads from S3 CloudFront Logs

> Parses into records, transmits to reducer

> Reducer loads into Vertica

> All done with streaming API

~ 10 lines of python

Limitless SQL

Page 13: Hadoop World Vertica

13

Vertica Systems Confidential – Do Not Distribute

Advanced Example

> Tickstore algorithm with map push down

> Input formatter queries Vertica using map push down

> Identity Mapper passes through to reducer

> Reducer runs proprietary algorithm moving average, correlations, secret sauce

> Results are stored in a new table for further analysis

> Vertica optimizes the new table

Page 14: Hadoop World Vertica

14

Vertica Systems Confidential – Do Not Distribute

How to get started

> Get a copy of hadoop from Apache or Cloudera

> Get vertica from www.vertica.com or via Amazon or RightScale or as a VM

> Grab the formatter and Vertica jdbc drivers from vetica.com/MapReduce

> Included in contrib from hadoop 0.21.0 (MR-775)

> Put the jars in hadoop/lib

> Run your Hadoop/Vertica job

Page 15: Hadoop World Vertica

15

Vertica Systems Confidential – Do Not Distribute

Future Directions and Questions

> Archiving information lifecycle (sqoop)

> Invoking hadoop jobs from Vertica

> Joining Vertica data mid job

> Using Vertica for (structured) transient job data

> [email protected]

> Vertica.com/MapReduce