PhillyDB Talk - Beyond Batch

©MapR Technologies

Beyond Batch

Drill & Storm

Brad Anderson

©MapR Technologies

whoami

• Brad Anderson

• Solutions Architect at MapR (Atlanta)

•ATLHUG co-chair

• ‘boorad’ most places (twitter, github)

• [email protected]

mailto:[email protected]

MapRBenchmark MapR 2.1.1 CDH 4.1.1 MapR Speed

Increase

Terasort (1x replication, compression disabled)

Total 13m 35s 26m 6s 1.9x

Map 7m 58s 21m 8s 2.7x

Reduce 13m 32s 23m 37s 1.7x

DFSIO throughput/node

Read 1003 MB/s 656 MB/s 1.5x

Write 924 MB/s 654 MB/s 1.4x

YCSB (50% read, 50% update)

Throughput 36,584.4 op/s 12,500.5 op/s 2.9x

Runtime 3.80 hr 11.11 hr 2.9x

YCSB (95% read, 5% update)

Throughput 24,704.3 op/s 10,776.4 op/s 2.3x

Runtime 0.56 hr 1.29 hr 2.3x

MapR/Google

Apache Hadoop

Time 54 sec 62 sec

Nodes 1,003 1,460

Disks 1,003 5,840

Cores 4,012 11,680

Benchmark hardware configuration: 10 servers, 12 x 2 cores (2.4 GHz), 12 x 2TB, 48 GB, 1 x 10GbE

- Faster and More Scalable

©MapR Technologies

Beyond Batch

HBase & M7

Apache Drill

Storm

Solr & Elastic Search

©MapR Technologies

Latency Matters

Batch Interactive Streaming

©MapR Technologies

Big Data PictureBatch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to minutes Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model MapReduce Queries DAG

Users Developers Analysts and Developers Developers

Google project MapReduce Dremel

Open source project Hadoop MapReduce Storm, S4

Apache Drill

Interactive SQL Initiatives for Hadoop

SQL conversion to MapReduce

SQL based OLTP SQL based analytics

Impala*

Real-time interactive

queries

Real-time interactive

queries

* Does not work with other distributions

©MapR Technologies

©MapR Technologies

Google Dremel• Interactive analysis of large-scale datasets

• Trillion records at interactive speeds• Complementary to MapReduce• Used by thousands of Google employees• Paper published at VLDB 2010

• Model• Nested data model with schema

• Most data at Google is stored/transferred in Protocol Buffers

• SQL-like query language with nested data support

• Implementation• Column-based storage and processing• In-situ data access (GFS and Bigtable)• Tree architecture as in Web search (and databases)

©MapR Technologies

Google BigQuery• Hosted Dremel (Dremel as a Service)• CLI (bq) and Web UI• Import data from Google Cloud Storage or local files

• Files must be in CSV format • Nested data not supported [yet] except built-in datasets• Schema definition required

©MapR Technologies

Drill Design PrinciplesFlexible• Pluggable query languages• Extensible execution engine• Pluggable data formats

• Columns and Rows• Schema and Schema-less

• Pluggable data sources

Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages

Dependable• No SPOF• Instant recovery from crashes

Fast• C/C++ core with Java support

• Google C++ style guide• Min latency and max throughput

(limited only by hardware)

©MapR Technologies

DrQL ExampleDocId: 10Links Forward: 20 Forward: 40 Forward: 60Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A'Name Url: 'http://B'Name Language Code: 'en-gb' Country: 'gb'

SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS StrFROM tWHERE REGEXP(Name.Url, '^http') AND DocId < 20;

Id: 10Name Cnt: 2 Language Str: 'http://A,en-us' Str: 'http://A,en'Name Cnt: 0

* Example from the Dremel paper

©MapR Technologies

Data Flow

©MapR Technologies

Extensibility• Nested query languages

• DrQL

• Mongo Query Language

• Cascading, Hive, Pig

• Distributed execution engine• Extensible model (eg, Dryad)

• Low-latency

• Fault tolerant

©MapR Technologies

ExtensibilityNested data formats

• Pluggable model• Column-based (ColumnIO/Dremel, Trevni, RCFile) • Row-based (RecordIO, Avro, JSON, CSV)• Schema (Protocol Buffers, Avro, CSV) • Schema-less (JSON, BSON)

Scalable data sources• Pluggable model• Hadoop• HBase

Drill Architecture

Public interfaces enable extensibility– Add a new query language by implementing a parser– Add a new data source by implementing an API– Provide a plan directly to the execution engine to control execution

Each level of the plan has a human readable representation– Facilitates debugging and development

Drill Architecture (2)

©MapR Technologies

Query Components• Query components:

• SELECT• FROM• WHERE• GROUP BY• HAVING• JOIN

• Key logical operators:• Scan• Filter• Aggregate• Join

©MapR Technologies

Scan Operators

Scan with schema Scan without schema

Operator output

Protocol Buffers JSON-like (MessagePack)

Supported data formats

ColumnIO (column-based protobuf/Dremel)RecordIO (row-based protobuf)CSV

JSONHBase

SELECT …FROM …

ColumnIO(proto URI, data URI)RecordIO(proto URI, data URI)

Json(data URI)HBase(table name)

• Drill supports multiple data formats by having per-format scan operators• Queries involving multiple data formats/sources are supported

• Fields and predicates can be pushed down into the scan operator

• Scan operators may have adaptive side-effects (database cracking)• Produce ColumnIO from RecordIO• Google PowerDrill stores materialized expressions with the data

©MapR Technologies

Execution Engine Layers• Drill execution engine has two layers

• Operator layer is serialization-aware• Processes individual records

• Execution layer is not serialization-aware• Processes batches of records (blobs)• Responsible for communication, dependencies and fault tolerance

©MapR Technologies

Hadoop Integration• Hadoop data sources

• Hadoop FileSystem API (HDFS/MapR-FS)

• HBase

• Hadoop data formats• Apache Avro

• RCFile

• MapReduce-based tools to create column-based formats

• Table registry in HCatalog

• Run long-running services in YARN

©MapR Technologies

Fully Open

MomentumOver 200 people on the Drill mailing list

Over 200 members of the Bay Area Drill User Group

Over 100 participants the first meetup in Sunnyvale, CA

• MapR, Cisco, Intel, eBay, Google, Yahoo!, LinkedIn, …

Drill meetups across the US and Europe

OpenDremel team and source code merged with Apache Drill

Simba Technologies – ODBC inventor developing a Drill ODBC driver

• Tableau, MicroStrategy, Excel, SAP Crystal Reports, …

©MapR Technologies

Storm

©MapR Technologies

Before Storm

Queues Workers

©MapR Technologies

Example

(simplified)

©MapR Technologies

Storm

Guaranteed data processing

Horizontal scalability

Fault-tolerance

No intermediate message brokers!

Higher level abstraction than message passing

“Just works”

©MapR Technologies

Concepts

©MapR Technologies

Unbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Streams

©MapR Technologies

public interface ISpout extends Serializable { void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void nextTuple(); void ack(Object msgId); void fail(Object msgId);}

Spouts

©MapR Technologies

public class DoubleAndTripleBolt extends BaseRichBolt { private OutputCollectorBase _collector;

public void prepare(Map conf, TopologyContext context, OutputCollectorBase collector) { _collector = collector; }

public void execute(Tuple input) { int val = input.getInteger(0); _collector.emit(input, new Values(val*2, val*3)); _collector.ack(input); }

public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("double", "triple")); } }

Bolts

©MapR Technologies

TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6);

Trident

©MapR Technologies

Hadoop

batchprocesses

Apps

Business

Value

RawData

realtime processes

Storm

Queue

Parallel Cluster Ingest

Georg

Queue

©MapR Technologies

Hadoop

batchprocesses

Apps

Business

Value

RawData

realtime processes

Storm

TailSpout

Georg and TailSpout

©MapR Technologies

Get Involved!• Slides

• http://slideshare.net/boorad/phillydb

• Join the Apache Drill mailing list• [email protected]

• Watch TailSpout & Georg development• https://github.com/{tdunning | boorad | rlankenau}/mapr-spout

• Join MapR• [email protected]• [email protected]

• @boorad

http://slideshare.net/boorad/phillydb

http://slideshare.net/boorad/phillydb


https://github.com/boorad/mapr-spout





PhillyDB Talk - Beyond Batch

Technology

Transcript of PhillyDB Talk - Beyond Batch