Learning Stream Processing with Apache Storm
-
Upload
eugene-dvorkin -
Category
Engineering
-
view
555 -
download
2
description
Transcript of Learning Stream Processing with Apache Storm
a
[
b
K
Z
•
•
•
CONTACT ME @edvorkin
•
•
•
•
•
•
•
•
•
•
[
real-time medical news from curated Twitter feed
Every second, on average, around 6,000 tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day
350,000 ^
1 % = 3500 ^
• How to scale
• How to deal with failures
• What to do with failed messages
• A lot of infrastructure concerns
• Complexity
• Tedious coding
DB
t
*Image credit:Nathanmarz: slideshare: storm
Inherently BATCH-Oriented System
• Exponential rise in real-time data
• New business opportunity
• Economics of OSS and commodity hardware
Stream processing has emerged as a key use case*
*Source: Discover HDP2.1: Apache Storm for Stream Data Processing. Hortonworks. 2014
• Detecting fraud while someone swiping credit card
• Place ad on website while someone is reading a specific article
• Alerts on application and machine failures
• Use stream-processing in batch oriented fashion
4
%
å å
Created by Nathan Martz
Acquired by Twitter
Apache Incubator Project
Open sourced
Part of Hortonworks HDP2 platform
U
a
x
Top Level Apache Project
Most mature, widely adopted framework
Source: http://storm.incubator.apache.org/
Process endless stream
of data.
1M+ messages / sec on a 10-15 node cluster
/
4
Guaranteed message
processing
Û
Tuples, Streams, Spouts, Bolts and Topologies
Z
å å å
TUPLE
Storm data type: Immutable List of Key/Value pair of any data type
word: “Hello” Count: 25 Frequency: 0.25
Unbounded Sequence of Tuples between nodes
STREAM
SPOUT
The Source of the Stream
Read from stream of data – queues, web logs, API calls, databases
Spout responsibilities
BOLT
⚡
• Process tuples and perform actions: calculations, API calls, DB calls
• Produce new output stream based on computations
Bolt
⚡
F(x)
• A topology is a network of spouts and bolts
• Defines data flow
4
• May have multiple spouts
4
• Each spout and bolt may have many instances that perform all the processing in parallel
4
•
•
•
•
•
How tuples are send between instances of spouts and bolts
Random Distribution.
Routes tuples to bolt based on the value of the field.
Same values always route to the same bolt
Replicates the tuple stream across all the
bolt tasks. Each task receive a copy of tuple.
Routes all tuple in the stream to
single task. Should be used
with caution.
4
å å å å
compile 'org.apache.storm:storm-core:0.9.2’
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>0.9.2</version>
</dependency>
Two 1 Households 1 Both 1 Alike 1 In 1 Dignity 1
sentence word
Word
⚡
⚡ ⚡
3 final count: Two 20 Households 24 Both 22 Alike 1 In 1 Dignity 10
"Two households, both alike in dignity" Two Households Both alike in dignity
Data Source
SplitSentenceBolt
Resource initialization
WordCountBolt
PrinterBolt
Linking it all together
How to scale stream processing
q
å å å å å
storm main components
Machines in a storm cluster
JVM processes
running on a node. One or
more per node.
Java thread
running within worker JVM
process.
Instances of spouts and
bolts.
q
q
How tuples are send between instances of spouts and bolts
a
å å å å å å
Tuple tree
Reliable vs unreliable topologies
Methods from ISpout interface
Reliability in Bolts
Anchoring Ack Fail
Unit testing Storm components
a
BDD style of testing
Extending OutputCollector
Extending OutputCollector
Z
å å å å å å å
Physical View
4
deploying topology to a cluster
storm jar wordcount-1.0.jar com.demo.storm.WordCountTopology word-count-topology
Monitoring and performance tuning
x
å å å å å å å å
Run under supervision: Monit, supervisord
Nimbus move work to another node
Supervisor will restart worker
Micro-Batch Stream Processing
K
å å å å å å å å å
Functions, Filters, aggregations, joins, grouping
Ordered batches of tuples. Batches can be partitioned.
Similar to Pig or Cascading
Transactional spouts
Trident has first class abstraction for reading and writing to stateful sources
Ü
4
Stream processed in small batches
• Each batch has a unique ID which is always the same on each replay • If one tuple failed, the whole batch is reprocessed • Higher throutput than storm but higher latency as well
How trident provides exactly –one semantics?
Store the count along with BatchID COUNT 100
BATCHID 1
COUNT 110
BATCHID 2
10 more tuples with batchId 2
Failure: Batch 2 replayed The same batchId (2)
• Spout should replay a batch exactly as it was played before
• Trident API hide dealing with batchID complexity
Word count with trident
Word count with Trident
Word count with Trident
Style of computation
4
By styles of computation
4
å å å å å å å å å å
Enhancing Twitter feed with lead Image and Title
• Readability enhancements • Image Scaling • Remove duplicates • Custom Business Logic
Writing twitter spout
Status
use Twitter4J java library
use existing Spout from Storm contrib project on GitHub
Spouts exists for: Twitter, Kafka,
JMS, RabbitMQ, Amazon SQS, Kinesis, MongoDB….
• Storm takes care of scalability and fault-tolerance • What happens if there is burst in traffic?
Introducing Queuing Layer with Kafka
Ñ
4
Solr Indexing
Processing Groovy Rules (DSL) on a scale in real-time
å å å å å å å å å å å
Statsd and Storm Metrics API
http://www.michael-noll.com/blog/2013/11/06/sending-metrics-from-storm-to-graphite/
• Use cache if you can: for example Google Guava caching utilities
• In memory DB
• Tick tuples (for batch updates)
• Linear classification (Perceptron, Passive-Aggresive, Winnow, AROW)
• Linear regression (Perceptron, Passive-Aggresive)
• Clustering (KMeans)
• Feature scaling (standardization, normalization)
• Text feature extraction
• Stream statistics (mean, variance)
• Pre-Trained Twitter sentiment classifier
Trident-ML
http://www.michael-noll.com http://www.bigdata-cookbook.com/post/72320512609/storm-metrics-how-to http://svendvanderveken.wordpress.com/
edvorkin/Storm_Demo_Spring2GX
Go ahead. Ask away.