PhillyDB Talk - Beyond Batch
-
Upload
boorad -
Category
Technology
-
view
117 -
download
5
description
Transcript of PhillyDB Talk - Beyond Batch
©MapR Technologies
Beyond Batch
Drill & Storm
Brad Anderson
©MapR Technologies
whoami
• Brad Anderson
• Solutions Architect at MapR (Atlanta)
•ATLHUG co-chair
• ‘boorad’ most places (twitter, github)
MapRBenchmark MapR 2.1.1 CDH 4.1.1 MapR Speed
Increase
Terasort (1x replication, compression disabled)
Total 13m 35s 26m 6s 1.9x
Map 7m 58s 21m 8s 2.7x
Reduce 13m 32s 23m 37s 1.7x
DFSIO throughput/node
Read 1003 MB/s 656 MB/s 1.5x
Write 924 MB/s 654 MB/s 1.4x
YCSB (50% read, 50% update)
Throughput 36,584.4 op/s 12,500.5 op/s 2.9x
Runtime 3.80 hr 11.11 hr 2.9x
YCSB (95% read, 5% update)
Throughput 24,704.3 op/s 10,776.4 op/s 2.3x
Runtime 0.56 hr 1.29 hr 2.3x
MapR/Google
Apache Hadoop
Time 54 sec 62 sec
Nodes 1,003 1,460
Disks 1,003 5,840
Cores 4,012 11,680
Benchmark hardware configuration: 10 servers, 12 x 2 cores (2.4 GHz), 12 x 2TB, 48 GB, 1 x 10GbE
- Faster and More Scalable
©MapR Technologies
Beyond Batch
HBase & M7
Apache Drill
Storm
Solr & Elastic Search
©MapR Technologies
Latency Matters
Batch Interactive Streaming
©MapR Technologies
Big Data PictureBatch processing Interactive analysis Stream processing
Query runtime Minutes to hours Milliseconds to minutes Never-ending
Data volume TBs to PBs GBs to PBs Continuous stream
Programming model MapReduce Queries DAG
Users Developers Analysts and Developers Developers
Google project MapReduce Dremel
Open source project Hadoop MapReduce Storm, S4
Apache Drill
Interactive SQL Initiatives for Hadoop
SQL conversion to MapReduce
SQL based OLTP SQL based analytics
Impala*
Real-time interactive
queries
Real-time interactive
queries
* Does not work with other distributions
©MapR Technologies
©MapR Technologies
Google Dremel• Interactive analysis of large-scale datasets
• Trillion records at interactive speeds• Complementary to MapReduce• Used by thousands of Google employees• Paper published at VLDB 2010
• Model• Nested data model with schema
• Most data at Google is stored/transferred in Protocol Buffers
• SQL-like query language with nested data support
• Implementation• Column-based storage and processing• In-situ data access (GFS and Bigtable)• Tree architecture as in Web search (and databases)
©MapR Technologies
Google BigQuery• Hosted Dremel (Dremel as a Service)• CLI (bq) and Web UI• Import data from Google Cloud Storage or local files
• Files must be in CSV format • Nested data not supported [yet] except built-in datasets• Schema definition required
©MapR Technologies
Drill Design PrinciplesFlexible• Pluggable query languages• Extensible execution engine• Pluggable data formats
• Columns and Rows• Schema and Schema-less
• Pluggable data sources
Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages
Dependable• No SPOF• Instant recovery from crashes
Fast• C/C++ core with Java support
• Google C++ style guide• Min latency and max throughput
(limited only by hardware)
©MapR Technologies
DrQL ExampleDocId: 10Links Forward: 20 Forward: 40 Forward: 60Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A'Name Url: 'http://B'Name Language Code: 'en-gb' Country: 'gb'
SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS StrFROM tWHERE REGEXP(Name.Url, '^http') AND DocId < 20;
Id: 10Name Cnt: 2 Language Str: 'http://A,en-us' Str: 'http://A,en'Name Cnt: 0
* Example from the Dremel paper
©MapR Technologies
Data Flow
©MapR Technologies
Extensibility• Nested query languages
• DrQL
• Mongo Query Language
• Cascading, Hive, Pig
• Distributed execution engine• Extensible model (eg, Dryad)
• Low-latency
• Fault tolerant
©MapR Technologies
ExtensibilityNested data formats
• Pluggable model• Column-based (ColumnIO/Dremel, Trevni, RCFile) • Row-based (RecordIO, Avro, JSON, CSV)• Schema (Protocol Buffers, Avro, CSV) • Schema-less (JSON, BSON)
Scalable data sources• Pluggable model• Hadoop• HBase
Drill Architecture
Public interfaces enable extensibility– Add a new query language by implementing a parser– Add a new data source by implementing an API– Provide a plan directly to the execution engine to control execution
Each level of the plan has a human readable representation– Facilitates debugging and development
Drill Architecture (2)
©MapR Technologies
Query Components• Query components:
• SELECT• FROM• WHERE• GROUP BY• HAVING• JOIN
• Key logical operators:• Scan• Filter• Aggregate• Join
©MapR Technologies
Scan Operators
Scan with schema Scan without schema
Operator output
Protocol Buffers JSON-like (MessagePack)
Supported data formats
ColumnIO (column-based protobuf/Dremel)RecordIO (row-based protobuf)CSV
JSONHBase
SELECT …FROM …
ColumnIO(proto URI, data URI)RecordIO(proto URI, data URI)
Json(data URI)HBase(table name)
• Drill supports multiple data formats by having per-format scan operators• Queries involving multiple data formats/sources are supported
• Fields and predicates can be pushed down into the scan operator
• Scan operators may have adaptive side-effects (database cracking)• Produce ColumnIO from RecordIO• Google PowerDrill stores materialized expressions with the data
©MapR Technologies
Execution Engine Layers• Drill execution engine has two layers
• Operator layer is serialization-aware• Processes individual records
• Execution layer is not serialization-aware• Processes batches of records (blobs)• Responsible for communication, dependencies and fault tolerance
©MapR Technologies
Hadoop Integration• Hadoop data sources
• Hadoop FileSystem API (HDFS/MapR-FS)
• HBase
• Hadoop data formats• Apache Avro
• RCFile
• MapReduce-based tools to create column-based formats
• Table registry in HCatalog
• Run long-running services in YARN
©MapR Technologies
Fully Open
MomentumOver 200 people on the Drill mailing list
Over 200 members of the Bay Area Drill User Group
Over 100 participants the first meetup in Sunnyvale, CA
• MapR, Cisco, Intel, eBay, Google, Yahoo!, LinkedIn, …
Drill meetups across the US and Europe
OpenDremel team and source code merged with Apache Drill
Simba Technologies – ODBC inventor developing a Drill ODBC driver
• Tableau, MicroStrategy, Excel, SAP Crystal Reports, …
©MapR Technologies
Storm
©MapR Technologies
Before Storm
Queues Workers
©MapR Technologies
Example
(simplified)
©MapR Technologies
Storm
Guaranteed data processing
Horizontal scalability
Fault-tolerance
No intermediate message brokers!
Higher level abstraction than message passing
“Just works”
©MapR Technologies
Concepts
©MapR Technologies
Unbounded sequence of tuples
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Streams
©MapR Technologies
Source of streams
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Spouts
©MapR Technologies
public interface ISpout extends Serializable { void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void nextTuple(); void ack(Object msgId); void fail(Object msgId);}
Spouts
©MapR Technologies
Processes input streams and produces new streams
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Tuple Tuple Tuple Tuple
Bolts
©MapR Technologies
public class DoubleAndTripleBolt extends BaseRichBolt { private OutputCollectorBase _collector;
public void prepare(Map conf, TopologyContext context, OutputCollectorBase collector) { _collector = collector; }
public void execute(Tuple input) { int val = input.getInteger(0); _collector.emit(input, new Values(val*2, val*3)); _collector.ack(input); }
public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("double", "triple")); } }
Bolts
©MapR Technologies
Network of spouts and bolts
Topologies
©MapR Technologies
Cascading for Storm
Trident
©MapR Technologies
TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6);
Trident
©MapR Technologies
Interoperability
©MapR Technologies
Kafka (with transactions)KestrelJMSAMQP
Spouts
©MapR Technologies
FunctionsFiltersAggregationJoinsTalk to databases, Hadoop write-
behind
Bolts
©MapR Technologies
Hadoop
batchprocesses
Apps
Business
Value
RawData
realtime processes
Storm
Queue
Parallel Cluster Ingest
Georg
Queue
©MapR Technologies
Hadoop
batchprocesses
Apps
Business
Value
RawData
realtime processes
Storm
TailSpout
Georg and TailSpout
©MapR Technologies
Get Involved!• Slides
• http://slideshare.net/boorad/phillydb
• Join the Apache Drill mailing list• [email protected]
• Watch TailSpout & Georg development• https://github.com/{tdunning | boorad | rlankenau}/mapr-spout
• Join MapR• [email protected]• [email protected]
• @boorad