Learning Stream Processing with Apache Storm

100

description

Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc. In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well. Following topics will be covered: • Why use Apache Storm? • Common use cases • Storm Architecture - components, concepts, topology • Building simple Storm topology with Java and Groovy • Trident and micro-batch processing • Fault tolerance and guaranteed message delivery • Running and monitoring Storm in production • Kafka • Storm at WebMD • Resources

Transcript of Learning Stream Processing with Apache Storm

Page 1: Learning Stream Processing with Apache Storm
Page 2: Learning Stream Processing with Apache Storm

a

[

b

K

Z

Page 3: Learning Stream Processing with Apache Storm

CONTACT ME @edvorkin

Page 4: Learning Stream Processing with Apache Storm

Page 5: Learning Stream Processing with Apache Storm

[

Page 6: Learning Stream Processing with Apache Storm

real-time medical news from curated Twitter feed

Page 7: Learning Stream Processing with Apache Storm
Page 8: Learning Stream Processing with Apache Storm

Every second, on average, around 6,000 tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day

350,000 ^

1 % = 3500 ^

Page 9: Learning Stream Processing with Apache Storm

• How to scale

• How to deal with failures

• What to do with failed messages

• A lot of infrastructure concerns

• Complexity

• Tedious coding

DB

t

*Image credit:Nathanmarz: slideshare: storm

Page 10: Learning Stream Processing with Apache Storm

Inherently BATCH-Oriented System

Page 11: Learning Stream Processing with Apache Storm

• Exponential rise in real-time data

• New business opportunity

• Economics of OSS and commodity hardware

Stream processing has emerged as a key use case*

*Source: Discover HDP2.1: Apache Storm for Stream Data Processing. Hortonworks. 2014

Page 12: Learning Stream Processing with Apache Storm

• Detecting fraud while someone swiping credit card

• Place ad on website while someone is reading a specific article

• Alerts on application and machine failures

• Use stream-processing in batch oriented fashion

Page 13: Learning Stream Processing with Apache Storm

4

Page 14: Learning Stream Processing with Apache Storm

%

å å

Page 15: Learning Stream Processing with Apache Storm

Created by Nathan Martz

Acquired by Twitter

Apache Incubator Project

Open sourced

Part of Hortonworks HDP2 platform

U

a

x

Top Level Apache Project

Page 16: Learning Stream Processing with Apache Storm
Page 17: Learning Stream Processing with Apache Storm

Most mature, widely adopted framework

Source: http://storm.incubator.apache.org/

Page 18: Learning Stream Processing with Apache Storm

Process endless stream

of data.

1M+ messages / sec on a 10-15 node cluster

/

4

Page 19: Learning Stream Processing with Apache Storm

Guaranteed message

processing

Û

Page 20: Learning Stream Processing with Apache Storm

Tuples, Streams, Spouts, Bolts and Topologies

Z

å å å

Page 21: Learning Stream Processing with Apache Storm

TUPLE

Storm data type: Immutable List of Key/Value pair of any data type

word: “Hello” Count: 25 Frequency: 0.25

Page 22: Learning Stream Processing with Apache Storm

Unbounded Sequence of Tuples between nodes

STREAM

Page 23: Learning Stream Processing with Apache Storm

SPOUT

The Source of the Stream

Page 24: Learning Stream Processing with Apache Storm

Read from stream of data – queues, web logs, API calls, databases

Spout responsibilities

Page 25: Learning Stream Processing with Apache Storm

BOLT

Page 26: Learning Stream Processing with Apache Storm

• Process tuples and perform actions: calculations, API calls, DB calls

• Produce new output stream based on computations

Bolt

F(x)

Page 27: Learning Stream Processing with Apache Storm

• A topology is a network of spouts and bolts

• Defines data flow

4

Page 28: Learning Stream Processing with Apache Storm

• May have multiple spouts

4

Page 29: Learning Stream Processing with Apache Storm

• Each spout and bolt may have many instances that perform all the processing in parallel

4

Page 30: Learning Stream Processing with Apache Storm

How tuples are send between instances of spouts and bolts

Random Distribution.

Routes tuples to bolt based on the value of the field.

Same values always route to the same bolt

Replicates the tuple stream across all the

bolt tasks. Each task receive a copy of tuple.

Routes all tuple in the stream to

single task. Should be used

with caution.

4

Page 31: Learning Stream Processing with Apache Storm

å å å å

Page 32: Learning Stream Processing with Apache Storm

compile 'org.apache.storm:storm-core:0.9.2’

<dependency>

<groupId>org.apache.storm</groupId>

<artifactId>storm-core</artifactId>

<version>0.9.2</version>

</dependency>

Page 33: Learning Stream Processing with Apache Storm

Two 1 Households 1 Both 1 Alike 1 In 1 Dignity 1

sentence word

Word

⚡ ⚡

3 final count: Two 20 Households 24 Both 22 Alike 1 In 1 Dignity 10

"Two households, both alike in dignity" Two Households Both alike in dignity

Page 34: Learning Stream Processing with Apache Storm

Data Source

Page 35: Learning Stream Processing with Apache Storm
Page 36: Learning Stream Processing with Apache Storm

SplitSentenceBolt

Resource initialization

Page 37: Learning Stream Processing with Apache Storm

WordCountBolt

Page 38: Learning Stream Processing with Apache Storm

PrinterBolt

Page 39: Learning Stream Processing with Apache Storm

Linking it all together

Page 40: Learning Stream Processing with Apache Storm

How to scale stream processing

q

å å å å å

Page 41: Learning Stream Processing with Apache Storm

storm main components

Machines in a storm cluster

JVM processes

running on a node. One or

more per node.

Java thread

running within worker JVM

process.

Instances of spouts and

bolts.

Page 42: Learning Stream Processing with Apache Storm

q

Page 43: Learning Stream Processing with Apache Storm

q

Page 44: Learning Stream Processing with Apache Storm

How tuples are send between instances of spouts and bolts

Page 45: Learning Stream Processing with Apache Storm
Page 46: Learning Stream Processing with Apache Storm

a

å å å å å å

Page 47: Learning Stream Processing with Apache Storm

Tuple tree

Reliable vs unreliable topologies

Page 48: Learning Stream Processing with Apache Storm

Methods from ISpout interface

Page 49: Learning Stream Processing with Apache Storm

Reliability in Bolts

Anchoring Ack Fail

Page 50: Learning Stream Processing with Apache Storm

Unit testing Storm components

a

Page 51: Learning Stream Processing with Apache Storm
Page 52: Learning Stream Processing with Apache Storm

BDD style of testing

Page 53: Learning Stream Processing with Apache Storm

Extending OutputCollector

Page 54: Learning Stream Processing with Apache Storm

Extending OutputCollector

Page 55: Learning Stream Processing with Apache Storm

Z

å å å å å å å

Page 56: Learning Stream Processing with Apache Storm

Physical View

4

Page 57: Learning Stream Processing with Apache Storm

deploying topology to a cluster

storm jar wordcount-1.0.jar com.demo.storm.WordCountTopology word-count-topology

Page 58: Learning Stream Processing with Apache Storm

Monitoring and performance tuning

Page 59: Learning Stream Processing with Apache Storm

x

å å å å å å å å

Page 60: Learning Stream Processing with Apache Storm
Page 61: Learning Stream Processing with Apache Storm

Run under supervision: Monit, supervisord

Page 62: Learning Stream Processing with Apache Storm

Nimbus move work to another node

Page 63: Learning Stream Processing with Apache Storm
Page 64: Learning Stream Processing with Apache Storm

Supervisor will restart worker

Page 65: Learning Stream Processing with Apache Storm

Micro-Batch Stream Processing

K

å å å å å å å å å

Page 66: Learning Stream Processing with Apache Storm

Functions, Filters, aggregations, joins, grouping

Ordered batches of tuples. Batches can be partitioned.

Similar to Pig or Cascading

Transactional spouts

Trident has first class abstraction for reading and writing to stateful sources

Ü

4

Page 67: Learning Stream Processing with Apache Storm

Stream processed in small batches

• Each batch has a unique ID which is always the same on each replay • If one tuple failed, the whole batch is reprocessed • Higher throutput than storm but higher latency as well

Page 68: Learning Stream Processing with Apache Storm

How trident provides exactly –one semantics?

Page 69: Learning Stream Processing with Apache Storm

Store the count along with BatchID COUNT 100

BATCHID 1

COUNT 110

BATCHID 2

10 more tuples with batchId 2

Failure: Batch 2 replayed The same batchId (2)

• Spout should replay a batch exactly as it was played before

• Trident API hide dealing with batchID complexity

Page 70: Learning Stream Processing with Apache Storm

Word count with trident

Page 71: Learning Stream Processing with Apache Storm

Word count with Trident

Page 72: Learning Stream Processing with Apache Storm

Word count with Trident

Page 73: Learning Stream Processing with Apache Storm
Page 74: Learning Stream Processing with Apache Storm

Style of computation

4

Page 75: Learning Stream Processing with Apache Storm

By styles of computation

4

Page 76: Learning Stream Processing with Apache Storm

å å å å å å å å å å

Page 77: Learning Stream Processing with Apache Storm
Page 78: Learning Stream Processing with Apache Storm
Page 79: Learning Stream Processing with Apache Storm

Enhancing Twitter feed with lead Image and Title

• Readability enhancements • Image Scaling • Remove duplicates • Custom Business Logic

Page 80: Learning Stream Processing with Apache Storm
Page 81: Learning Stream Processing with Apache Storm

Writing twitter spout

Page 82: Learning Stream Processing with Apache Storm

Status

Page 83: Learning Stream Processing with Apache Storm

use Twitter4J java library

Page 84: Learning Stream Processing with Apache Storm

use existing Spout from Storm contrib project on GitHub

Spouts exists for: Twitter, Kafka,

JMS, RabbitMQ, Amazon SQS, Kinesis, MongoDB….

Page 85: Learning Stream Processing with Apache Storm

• Storm takes care of scalability and fault-tolerance • What happens if there is burst in traffic?

Page 86: Learning Stream Processing with Apache Storm

Introducing Queuing Layer with Kafka

Ñ

Page 87: Learning Stream Processing with Apache Storm

4

Page 88: Learning Stream Processing with Apache Storm
Page 89: Learning Stream Processing with Apache Storm

Solr Indexing

Page 90: Learning Stream Processing with Apache Storm

Processing Groovy Rules (DSL) on a scale in real-time

Page 91: Learning Stream Processing with Apache Storm

å å å å å å å å å å å

Page 92: Learning Stream Processing with Apache Storm
Page 93: Learning Stream Processing with Apache Storm

Statsd and Storm Metrics API

http://www.michael-noll.com/blog/2013/11/06/sending-metrics-from-storm-to-graphite/

Page 94: Learning Stream Processing with Apache Storm

• Use cache if you can: for example Google Guava caching utilities

• In memory DB

• Tick tuples (for batch updates)

Page 95: Learning Stream Processing with Apache Storm

• Linear classification (Perceptron, Passive-Aggresive, Winnow, AROW)

• Linear regression (Perceptron, Passive-Aggresive)

• Clustering (KMeans)

• Feature scaling (standardization, normalization)

• Text feature extraction

• Stream statistics (mean, variance)

• Pre-Trained Twitter sentiment classifier

Trident-ML

Page 96: Learning Stream Processing with Apache Storm
Page 97: Learning Stream Processing with Apache Storm

http://www.michael-noll.com http://www.bigdata-cookbook.com/post/72320512609/storm-metrics-how-to http://svendvanderveken.wordpress.com/

Page 98: Learning Stream Processing with Apache Storm

edvorkin/Storm_Demo_Spring2GX

Page 99: Learning Stream Processing with Apache Storm
Page 100: Learning Stream Processing with Apache Storm

Go ahead. Ask away.