PhillyDB Talk - Beyond Batch

43
©MapR Technologies Beyond Batch Drill & Storm Brad Anderson

description

The venerable MapReduce framework has allowed Hadoop to prove its worth in the big data space, and to store and analyze much larger data sets than was possible before. But there is a lot of activity in the big data ecosystem currently surrounding other major categories of workflows beyond batch. These emerging tools include low latency i/o (HBase), interactive queries (Drill), stream processing (Storm), and text processing / indexing (Solr). This talk discusses some of the more interesting developments in Drill and Storm, their capabilities, and how they are being put to use in real world situations.

Transcript of PhillyDB Talk - Beyond Batch

Page 1: PhillyDB Talk - Beyond Batch

©MapR Technologies

Beyond Batch

Drill & Storm

Brad Anderson

Page 2: PhillyDB Talk - Beyond Batch

©MapR Technologies

whoami

• Brad Anderson

• Solutions Architect at MapR (Atlanta)

•ATLHUG co-chair

• ‘boorad’ most places (twitter, github)

[email protected]

Page 3: PhillyDB Talk - Beyond Batch

MapRBenchmark MapR 2.1.1 CDH 4.1.1 MapR Speed

Increase

Terasort (1x replication, compression disabled)

Total 13m 35s 26m 6s 1.9x

Map 7m 58s 21m 8s 2.7x

Reduce 13m 32s 23m 37s 1.7x

DFSIO throughput/node

Read 1003 MB/s 656 MB/s 1.5x

Write 924 MB/s 654 MB/s 1.4x

YCSB (50% read, 50% update)

Throughput 36,584.4 op/s 12,500.5 op/s 2.9x

Runtime 3.80 hr 11.11 hr 2.9x

YCSB (95% read, 5% update)

Throughput 24,704.3 op/s 10,776.4 op/s 2.3x

Runtime 0.56 hr 1.29 hr 2.3x

MapR/Google

Apache Hadoop

Time 54 sec 62 sec

Nodes 1,003 1,460

Disks 1,003 5,840

Cores 4,012 11,680

Benchmark hardware configuration: 10 servers, 12 x 2 cores (2.4 GHz), 12 x 2TB, 48 GB, 1 x 10GbE

- Faster and More Scalable

Page 4: PhillyDB Talk - Beyond Batch

©MapR Technologies

Beyond Batch

HBase & M7

Apache Drill

Storm

Solr & Elastic Search

Page 5: PhillyDB Talk - Beyond Batch

©MapR Technologies

Latency Matters

Batch Interactive Streaming

Page 6: PhillyDB Talk - Beyond Batch

©MapR Technologies

Big Data PictureBatch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to minutes Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model MapReduce Queries DAG

Users Developers Analysts and Developers Developers

Google project MapReduce Dremel

Open source project Hadoop MapReduce Storm, S4

Apache Drill

Page 7: PhillyDB Talk - Beyond Batch

Interactive SQL Initiatives for Hadoop

SQL conversion to MapReduce

SQL based OLTP SQL based analytics

Impala*

Real-time interactive

queries

Real-time interactive

queries

* Does not work with other distributions

Page 8: PhillyDB Talk - Beyond Batch

©MapR Technologies

Page 9: PhillyDB Talk - Beyond Batch

©MapR Technologies

Google Dremel• Interactive analysis of large-scale datasets

• Trillion records at interactive speeds• Complementary to MapReduce• Used by thousands of Google employees• Paper published at VLDB 2010

• Model• Nested data model with schema

• Most data at Google is stored/transferred in Protocol Buffers

• SQL-like query language with nested data support

• Implementation• Column-based storage and processing• In-situ data access (GFS and Bigtable)• Tree architecture as in Web search (and databases)

Page 10: PhillyDB Talk - Beyond Batch

©MapR Technologies

Google BigQuery• Hosted Dremel (Dremel as a Service)• CLI (bq) and Web UI• Import data from Google Cloud Storage or local files

• Files must be in CSV format • Nested data not supported [yet] except built-in datasets• Schema definition required

Page 11: PhillyDB Talk - Beyond Batch

©MapR Technologies

Drill Design PrinciplesFlexible• Pluggable query languages• Extensible execution engine• Pluggable data formats

• Columns and Rows• Schema and Schema-less

• Pluggable data sources

Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages

Dependable• No SPOF• Instant recovery from crashes

Fast• C/C++ core with Java support

• Google C++ style guide• Min latency and max throughput

(limited only by hardware)

Page 12: PhillyDB Talk - Beyond Batch

©MapR Technologies

DrQL ExampleDocId: 10Links Forward: 20 Forward: 40 Forward: 60Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A'Name Url: 'http://B'Name Language Code: 'en-gb' Country: 'gb'

SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS StrFROM tWHERE REGEXP(Name.Url, '^http') AND DocId < 20;

Id: 10Name Cnt: 2 Language Str: 'http://A,en-us' Str: 'http://A,en'Name Cnt: 0

* Example from the Dremel paper

Page 13: PhillyDB Talk - Beyond Batch

©MapR Technologies

Data Flow

Page 14: PhillyDB Talk - Beyond Batch

©MapR Technologies

Extensibility• Nested query languages

• DrQL

• Mongo Query Language

• Cascading, Hive, Pig

• Distributed execution engine• Extensible model (eg, Dryad)

• Low-latency

• Fault tolerant

Page 15: PhillyDB Talk - Beyond Batch

©MapR Technologies

ExtensibilityNested data formats

• Pluggable model• Column-based (ColumnIO/Dremel, Trevni, RCFile) • Row-based (RecordIO, Avro, JSON, CSV)• Schema (Protocol Buffers, Avro, CSV) • Schema-less (JSON, BSON)

Scalable data sources• Pluggable model• Hadoop• HBase

Page 16: PhillyDB Talk - Beyond Batch

Drill Architecture

Public interfaces enable extensibility– Add a new query language by implementing a parser– Add a new data source by implementing an API– Provide a plan directly to the execution engine to control execution

Each level of the plan has a human readable representation– Facilitates debugging and development

Page 17: PhillyDB Talk - Beyond Batch

Drill Architecture (2)

Page 18: PhillyDB Talk - Beyond Batch

©MapR Technologies

Query Components• Query components:

• SELECT• FROM• WHERE• GROUP BY• HAVING• JOIN

• Key logical operators:• Scan• Filter• Aggregate• Join

Page 19: PhillyDB Talk - Beyond Batch

©MapR Technologies

Scan Operators

Scan with schema Scan without schema

Operator output

Protocol Buffers JSON-like (MessagePack)

Supported data formats

ColumnIO (column-based protobuf/Dremel)RecordIO (row-based protobuf)CSV

JSONHBase

SELECT …FROM …

ColumnIO(proto URI, data URI)RecordIO(proto URI, data URI)

Json(data URI)HBase(table name)

• Drill supports multiple data formats by having per-format scan operators• Queries involving multiple data formats/sources are supported

• Fields and predicates can be pushed down into the scan operator

• Scan operators may have adaptive side-effects (database cracking)• Produce ColumnIO from RecordIO• Google PowerDrill stores materialized expressions with the data

Page 20: PhillyDB Talk - Beyond Batch

©MapR Technologies

Execution Engine Layers• Drill execution engine has two layers

• Operator layer is serialization-aware• Processes individual records

• Execution layer is not serialization-aware• Processes batches of records (blobs)• Responsible for communication, dependencies and fault tolerance

Page 21: PhillyDB Talk - Beyond Batch

©MapR Technologies

Hadoop Integration• Hadoop data sources

• Hadoop FileSystem API (HDFS/MapR-FS)

• HBase

• Hadoop data formats• Apache Avro

• RCFile

• MapReduce-based tools to create column-based formats

• Table registry in HCatalog

• Run long-running services in YARN

Page 22: PhillyDB Talk - Beyond Batch

©MapR Technologies

Fully Open

Page 23: PhillyDB Talk - Beyond Batch

MomentumOver 200 people on the Drill mailing list

Over 200 members of the Bay Area Drill User Group

Over 100 participants the first meetup in Sunnyvale, CA

• MapR, Cisco, Intel, eBay, Google, Yahoo!, LinkedIn, …

Drill meetups across the US and Europe

OpenDremel team and source code merged with Apache Drill

Simba Technologies – ODBC inventor developing a Drill ODBC driver

• Tableau, MicroStrategy, Excel, SAP Crystal Reports, …

Page 24: PhillyDB Talk - Beyond Batch

©MapR Technologies

Storm

Page 25: PhillyDB Talk - Beyond Batch

©MapR Technologies

Before Storm

Queues Workers

Page 26: PhillyDB Talk - Beyond Batch

©MapR Technologies

Example

(simplified)

Page 27: PhillyDB Talk - Beyond Batch

©MapR Technologies

Storm

Guaranteed data processing

Horizontal scalability

Fault-tolerance

No intermediate message brokers!

Higher level abstraction than message passing

“Just works”

Page 28: PhillyDB Talk - Beyond Batch

©MapR Technologies

Concepts

Page 29: PhillyDB Talk - Beyond Batch

©MapR Technologies

Unbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Streams

Page 30: PhillyDB Talk - Beyond Batch

©MapR Technologies

Source of streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Spouts

Page 31: PhillyDB Talk - Beyond Batch

©MapR Technologies

public interface ISpout extends Serializable { void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void nextTuple(); void ack(Object msgId); void fail(Object msgId);}

Spouts

Page 32: PhillyDB Talk - Beyond Batch

©MapR Technologies

Processes input streams and produces new streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple

Bolts

Page 33: PhillyDB Talk - Beyond Batch

©MapR Technologies

public class DoubleAndTripleBolt extends BaseRichBolt { private OutputCollectorBase _collector;

public void prepare(Map conf, TopologyContext context, OutputCollectorBase collector) { _collector = collector; }

public void execute(Tuple input) { int val = input.getInteger(0); _collector.emit(input, new Values(val*2, val*3)); _collector.ack(input); }

public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("double", "triple")); } }

Bolts

Page 34: PhillyDB Talk - Beyond Batch

©MapR Technologies

Network of spouts and bolts

Topologies

Page 35: PhillyDB Talk - Beyond Batch

©MapR Technologies

Cascading for Storm

Trident

Page 36: PhillyDB Talk - Beyond Batch

©MapR Technologies

TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6);

Trident

Page 37: PhillyDB Talk - Beyond Batch

©MapR Technologies

Interoperability

Page 38: PhillyDB Talk - Beyond Batch

©MapR Technologies

Kafka (with transactions)KestrelJMSAMQP

Spouts

Page 39: PhillyDB Talk - Beyond Batch

©MapR Technologies

FunctionsFiltersAggregationJoinsTalk to databases, Hadoop write-

behind

Bolts

Page 40: PhillyDB Talk - Beyond Batch

©MapR Technologies

Hadoop

batchprocesses

Apps

Business

Value

RawData

realtime processes

Storm

Queue

Parallel Cluster Ingest

Page 41: PhillyDB Talk - Beyond Batch

Georg

Queue

©MapR Technologies

Hadoop

batchprocesses

Apps

Business

Value

RawData

realtime processes

Storm

TailSpout

Page 42: PhillyDB Talk - Beyond Batch

Georg and TailSpout

Page 43: PhillyDB Talk - Beyond Batch

©MapR Technologies

Get Involved!• Slides

• http://slideshare.net/boorad/phillydb

• Join the Apache Drill mailing list• [email protected]

• Watch TailSpout & Georg development• https://github.com/{tdunning | boorad | rlankenau}/mapr-spout

• Join MapR• [email protected][email protected]

• @boorad