WebGL Up and Running - O'Reilly Media - Technology Books, Tech
Real-time Streaming Analysis - O'Reilly Media - Technology Books
Transcript of Real-time Streaming Analysis - O'Reilly Media - Technology Books
![Page 1: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/1.jpg)
![Page 2: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/2.jpg)
Real-time Streaming Analysisfor Hadoop and Flume
Aaron Kimball
odiago, inc.
OSCON Data 2011
![Page 3: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/3.jpg)
The plan
• Background: Flume introduction
• The need for online analytics
• Introducing FlumeBase
• Demo!
• FlumeBase architecture
• Wrap up
![Page 4: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/4.jpg)
Flume is…
• A distributed data transport and aggregation system for event- or log-structured data
• Principally designed for continuous data ingestion into Hadoop… But more flexible than that
Flume “collector” node
HDFS
Data origins a.k.a. Flume “agents”
Click streams, etc. Aggregate data
![Page 5: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/5.jpg)
Flume terminology
• Every machine in Flume is a “node”
• Each node has a “source” and a “sink”
– Example source: tail(“/var/log/httpd/access_log”)
– Example sink: dfs(“hdfs://namenode/logs/%{host}/%Y%M%D”)
• Some sinks send data to “collector” nodes, which aggregate data from many agents before writing to HDFS
![Page 6: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/6.jpg)
Flume control plane
• All Flume nodes heartbeat to/receive config from master
• Operator tools interact with the master via a Thrift API
– e.g., the Flume shell
• Nodes can be reconfigured to use different sources, sinks
Collector
HDFS
Agents
Flume master
Operator
Thrift API
![Page 7: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/7.jpg)
Real-time data moves through Flume
• Events enter Flume within seconds of generation
• Hadoop MapReduce analysis runs at best once/10 minutes
• Desirable behavior: analyze this data on-the-fly
– Ad campaign cut-off
– Real-time personalization, recommendations
– Load and performance monitoring
– Error alerting
![Page 8: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/8.jpg)
But Flume isn’t an analytic system
• No ability to inspect message bodies
• No notion of aggregates, rolling counters, etc
– … or even filtersCollector
HDFS
Agents
mysql
Web backend
clickstream
Real-time aggregates
![Page 9: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/9.jpg)
But Flume isn’t an analytic system
• No ability to inspect message bodies
• No notion of aggregates, rolling counters, etc
– … or even filters
• This leads to fascinating hacks (see right)
Collector
HDFS
Agents
mysql
Web backend
clickstream
Real-time aggregates
![Page 10: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/10.jpg)
Flume and Flexibility
• New sources, sinks can be added from plugins
• Flume topology can be dynamically reconfigured by sending commands to master over Thrift API
• Contents of Flume events (messages) are uninterpreted
![Page 11: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/11.jpg)
Flume and Flexibility
• New sources, sinks can be added from plugins
• Flume topology can be dynamically reconfigured by sending commands to master over Thrift API
• Contents of Flume events (messages) are uninterpreted
• …Meaning we can define new endpoints for Flume data, store arbitrary data in events, and control Flume programmatically.
![Page 12: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/12.jpg)
FlumeBase: Online Analytics for Flume
![Page 13: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/13.jpg)
FlumeBase server
• Runs persistent queries analyzing data streams
– Events interpreted relative to a user-specified schema, parser
• Transparently reconfigures source Flume nodes to tee data
• Acts as a Flume node
– Output events are just another Flume data stream
![Page 14: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/14.jpg)
rtsql: FlumeBase’s query language
• SQL-like language for defining event schemas, queries
CREATE STREAM foo(status INT, msg STRING,
priority INT)
FROM NODE ‘backend-server-5’;
SELECT * FROM foo WHERE priority > 10;
![Page 15: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/15.jpg)
rtsql language features
• Lots of standard SQL features available
– SELECT, WHERE, GROUP BY, HAVING, JOIN…
• Streams are infinite: GROUP BY and JOIN both use windowing to operate over rolling time windows of events
– Standard aggregate functions: COUNT, MIN, MAX, SUM, AVG
![Page 16: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/16.jpg)
Demo time
• (Buckle your seatbelts)
![Page 17: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/17.jpg)
Under the hood…
Client process
FlumeBase Server
rtsql compiler
Stream dictionary
Flume “in” node
Incoming data from Flume network
EventParser
Flow operator DAG
Submits queries
Flume “out” node
Emitted records return to Flume
network
Output data printed to
client console Flume controller
“Flow”
generates
![Page 18: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/18.jpg)
Life of a query
• Clients submit rtsql queries as simple strings to server
• Compiler parses query to an AST, generates a logical plan (DAG), and maps that to a DAG of physical operators (“HashJoin”, “Filter”, etc)
Client process
FlumeBase Server
rtsql compiler
Stream dictionary
Flume “in” node
Incoming data from Flume network
EventParser
Flow operator DAG
Submits queries
Flume “out” node
Emitted records return to Flume
network
Output data printed to
client console Flume controller
“Flow”
generates
![Page 19: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/19.jpg)
Life of a query
• Physical operators form a “flow”, which is injected into the execution thread; continuously reads from input and processes events
• Many flows (queries) may run in parallel.
• Flows must be explicitly dropped when they’re no longer useful
Client process
FlumeBase Server
rtsql compiler
Stream dictionary
Flume “in” node
Incoming data from Flume network
EventParser
Flow operator DAG
Submits queries
Flume “out” node
Emitted records return to Flume
network
Output data printed to
client console Flume controller
“Flow”
generates
![Page 20: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/20.jpg)
Schemas, types and serialization
• Event data can enter FlumeBase in any format
• Each stream has:
– A schema, specifying which fields it has, and their type
– An EventParser, which can extract fields from the input event
• Data is internally represented in Avro generic records
• Output events have Avro binary-encoded records for bodies
![Page 21: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/21.jpg)
Interacting with Flume
• CREATE STREAM defines a schema that could be applied to the output of a Flume node or source
• Submitting a query against that stream requires reading from Flume
– The Flume controller reconfigures the upstream node to send datato FlumeBase, or hosts a new source locally
• Dropping a query restores the upstream node’s original configuration
![Page 22: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/22.jpg)
FlumeBase components and processes
• FlumeBase abstracts the “server” concept into an ExecEnvironment
• Everything can run in a single process: client shell, ExecEnvironment, even Flume nodes and master
• Better is to leave a long-lived FlumeBase server running and connect clients as needed to examine output, submit or modify queries
![Page 23: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/23.jpg)
Conclusions
• Real-time analytics require a different system than HadoopMapReduce
• Flume provides a suitable basis for an online analytic system
• SQL-like language allows sophisticated queries with a low learning curve
![Page 24: Real-time Streaming Analysis - O'Reilly Media - Technology Books](https://reader036.fdocuments.net/reader036/viewer/2022071603/613d7826736caf36b75db284/html5/thumbnails/24.jpg)
Check it out!
• Web site: flumebase.org (docs, blog, etc.)
– Binary release is “batteries included” with a data set + walkthrough
• Get the source: github.com/flumebase/flumebase
• 100% Apache 2.0 licensed – contributors welcome!
Thanks for listening!