Real-time streams and logs with Storm and Kafka

47
Real-time Streams & Logs Andrew Montalenti, CTO Keith Bourgoin, Backend Lead 1 of 47

description

Some of the biggest issues at the center of analyzing large amounts of data are query flexibility, latency, and fault tolerance. Modern technologies that build upon the success of “big data” platforms, such as Apache Hadoop, have made it possible to spread the load of data analysis to commodity machines, but these analyses can still take hours to run and do not respond well to rapidly-changing data sets. A new generation of data processing platforms -- which we call “stream architectures” -- have converted data sources into streams of data that can be processed and analyzed in real-time. This has led to the development of various distributed real-time computation frameworks (e.g. Apache Storm) and multi-consumer data integration technologies (e.g. Apache Kafka). Together, they offer a way to do predictable computation on real-time data streams. In this talk, we will give an overview of these technologies and how they fit into the Python ecosystem. As part of this presentation, we also released streamparse, a new Python that makes it easy to debug and run large Storm clusters. Links: * http://parse.ly/code * https://github.com/Parsely/streamparse * https://github.com/getsamsa/samsa

Transcript of Real-time streams and logs with Storm and Kafka

Page 1: Real-time streams and logs with Storm and Kafka

Real-time Streams & LogsAndrew Montalenti, CTO

Keith Bourgoin, Backend Lead

1 of 47

Page 2: Real-time streams and logs with Storm and Kafka

Agenda

Parse.ly problem space

Aggregating the stream (Storm)

Organizing around logs (Kafka)

2 of 47

Page 3: Real-time streams and logs with Storm and Kafka

Admin

Our presentations and code:

http://parse.ly/code

This presentation's slides:

http://parse.ly/slides/logs

This presentation's notes:

http://parse.ly/slides/logs/notes

3 of 47

Page 4: Real-time streams and logs with Storm and Kafka

What is Parse.ly?

4 of 47

Page 5: Real-time streams and logs with Storm and Kafka

What is Parse.ly?

Web content analytics for digital storytellers.

5 of 47

Page 6: Real-time streams and logs with Storm and Kafka

Velocity

Average post has <48-hour shelf life.

6 of 47

Page 7: Real-time streams and logs with Storm and Kafka

Volume

Top publishers write 1000's of posts per day.

7 of 47

Page 8: Real-time streams and logs with Storm and Kafka

Time series data

8 of 47

Page 9: Real-time streams and logs with Storm and Kafka

Summary data

9 of 47

Page 10: Real-time streams and logs with Storm and Kafka

Ranked data

10 of 47

Page 11: Real-time streams and logs with Storm and Kafka

Benchmark data

11 of 47

Page 12: Real-time streams and logs with Storm and Kafka

Information radiators

12 of 47

Page 13: Real-time streams and logs with Storm and Kafka

Architecture evolution

13 of 47

Page 14: Real-time streams and logs with Storm and Kafka

Queues and workers

Queues: RabbitMQ => Redis => ZeroMQ

Workers: Cron Jobs => Celery

14 of 47

Page 15: Real-time streams and logs with Storm and Kafka

Workers and databases

15 of 47

Page 16: Real-time streams and logs with Storm and Kafka

Lots of moving parts

16 of 47

Page 17: Real-time streams and logs with Storm and Kafka

In short: it started to get messy

17 of 47

Page 18: Real-time streams and logs with Storm and Kafka

Introducing Storm

Storm is a distributed real-time computation system.

Hadoop provides a set of general primitives for doing batchprocessing.

Storm provides a set of general primitives for doingreal-time computation.

Perfect as a replacement for ad-hoc workers-and-queuessystems.

18 of 47

Page 19: Real-time streams and logs with Storm and Kafka

Storm features

Speed

Fault tolerance

Parallelism

Guaranteed Messages

Easy Code Management

Local Dev

19 of 47

Page 20: Real-time streams and logs with Storm and Kafka

Storm primitives

Streaming Data Set, typically from Kafka.

ZeroMQ used for inter-process communication.

Bolts & Spouts; Storm's Topology is a DAG.

Nimbus & Workers manage execution.

Tuneable parallelism + built-in fault tolerance.

20 of 47

Page 21: Real-time streams and logs with Storm and Kafka

Wired Topology

21 of 47

Page 22: Real-time streams and logs with Storm and Kafka

Tuple Tree

Tuple tree, anchoring, and retries.

22 of 47

Page 23: Real-time streams and logs with Storm and Kafka

Word Stream Spout (Storm)

;; spout configuration{"word-spout" (shell-spout-spec

;; Python Spout implementation:;; - fetches words (e.g. from Kafka)["python" "words.py"]

;; - emits (word,) tuples["word"]

)}

23 of 47

Page 24: Real-time streams and logs with Storm and Kafka

Word Stream Spout in Python

import itertools

from streamparse import storm

class WordSpout(storm.Spout):

def initialize(self, conf, ctx):self.words = itertools.cycle(['dog', 'cat',

'zebra', 'elephant'])

def next_tuple(self):word = next(self.words)storm.emit([word])

WordSpout().run()

24 of 47

Page 25: Real-time streams and logs with Storm and Kafka

Word Count Bolt (Storm)

;; bolt configuration{"count-bolt" (shell-bolt-spec

;; Bolt input: Spout and field grouping on word{"word-spout" ["word"]}

;; Python Bolt implementation:;; - maintains a Counter of word;; - increments as new words arrive["python" "wordcount.py"]

;; Emits latest word count for most recent word["word" "count"]

;; parallelism = 2:p 2

)}

25 of 47

Page 26: Real-time streams and logs with Storm and Kafka

Word Count Bolt in Python

from collections import Counter

from streamparse import storm

class WordCounter(storm.Bolt):

def initialize(self, conf, ctx):self.counts = Counter()

def process(self, tup):word = tup.values[0]self.counts[word] += 1storm.emit([word, self.counts[word]])storm.log('%s: %d' % (word, self.counts[word]))

WordCounter().run()

26 of 47

Page 27: Real-time streams and logs with Storm and Kafka

streamparse

sparse provides a CLI front-end to streamparse, aframework for creating Python projects for running,debugging, and submitting Storm topologies for dataprocessing. (still in development)

After installing the lein (only dependency), you can run:

pip install streamparse

This will offer a command-line tool, sparse. Use:

sparse quickstart

27 of 47

Page 28: Real-time streams and logs with Storm and Kafka

Running and debugging

You can then run the local Storm topology using:

$ sparse runRunning wordcount topology...Options: {:spec "topologies/wordcount.clj", ...}#<StormTopology StormTopology(spouts:{word-spout=...storm.daemon.nimbus - Starting Nimbus with conf {...storm.daemon.supervisor - Starting supervisor with id 4960ac74...storm.daemon.nimbus - Received topology submission with conf {...... lots of output as topology runs...

Interested? Lightning talk!

28 of 47

Page 29: Real-time streams and logs with Storm and Kafka

Organizing around logs

29 of 47

Page 30: Real-time streams and logs with Storm and Kafka

Not all logs are application logs

A "log" could be any stream of structured data:

Web logs

Raw data waiting to be processed

Partially processed data

Database operations (e.g. mongo's oplog)

A series of timestamped facts about a given system.

30 of 47

Page 31: Real-time streams and logs with Storm and Kafka

LinkedIn's lattice problem

31 of 47

Page 32: Real-time streams and logs with Storm and Kafka

Enter the unified log

32 of 47

Page 33: Real-time streams and logs with Storm and Kafka

Log-centric is simpler

33 of 47

Page 34: Real-time streams and logs with Storm and Kafka

Parse.ly is log-centric, too

34 of 47

Page 35: Real-time streams and logs with Storm and Kafka

Introducing Apache Kafka

Log-centric messaging system developed at LinkedIn.

Designed for throughput; efficient resource use.

Persists to disk; in-memory for recent data

Little to no overhead for new consumers

Scalable to 10,000's of messages per second

As of 0.8, full replication of topic data.

35 of 47

Page 36: Real-time streams and logs with Storm and Kafka

Kafka concepts

Concept DescriptionCluster An arrangement of Brokers & Zookeeper

nodes

Broker An individual node in the Cluster

Topic A group of related messages (a stream)

Partition Part of a topic, used for replication

Producer Publishes messages to stream

ConsumerGroup

Group of related processes reading a topic

Offset Point in a topic that the consumer has read to

36 of 47

Page 37: Real-time streams and logs with Storm and Kafka

What's the catch?

Replication isn't perfect. Network partitions can causeproblems.

No out-of-order acknowledgement:

"Offset" is a marker of where consumer is in log;nothing more.

On a restart, you know where to start reading, butnot if individual messages before the stored offsetwas fully processed.

In practice, not as much of a problem as it sounds.

37 of 47

Page 38: Real-time streams and logs with Storm and Kafka

Kafka is a "distributed log"

Topics are logs, not queues.

Consumers read into offsets of the log.

Logs are maintained for a configurable period of time.

Messages can be "replayed".

Consumers can share identical logs easily.

38 of 47

Page 39: Real-time streams and logs with Storm and Kafka

Multi-consumer

Even if Kafka's availability and scalability story isn'tinteresting to you, the multi-consumer story should be.

39 of 47

Page 40: Real-time streams and logs with Storm and Kafka

Queue problems, revisited

Traditional queues (e.g. RabbitMQ / Redis):

not distributed / highly available at core

not persistent ("overflows" easily)

more consumers mean more queue server load

Kafka solves all of these problems.

40 of 47

Page 41: Real-time streams and logs with Storm and Kafka

Kafka + Storm

Good fit for at-least-once processing.

No need for out-of-order acks.

Community work is ongoing for at-most-once processing.

Able to keep up with Storm's high-throughput processing.

Great for handling backpressure during traffic spikes.

41 of 47

Page 42: Real-time streams and logs with Storm and Kafka

Kafka in Python (1)python-kafka (0.8+)

https://github.com/mumrah/kafka-python

from kafka.client import KafkaClientfrom kafka.consumer import SimpleConsumer

kafka = KafkaClient('localhost:9092')consumer = SimpleConsumer(kafka, 'test_consumer', 'raw_data')start = time.time()for msg in consumer:

count += 1if count % 1000 == 0:

dur = time.time() - startprint 'Reading at {:.2f} messages/sec'.format(dur/1000)start = time.time()

42 of 47

Page 43: Real-time streams and logs with Storm and Kafka

Kafka in Python (2)samsa (0.7x)

https://github.com/getsamsa/samsa

import timefrom kazoo.client import KazooClientfrom samsa.cluster import Cluster

zk = KazooClient()zk.start()cluster = Cluster(zk)queue = cluster.topics['raw_data'].subscribe('test_consumer')start = time.time()for msg in queue:

count += 1if count % 1000 == 0:

dur = time.time() - startprint 'Reading at {:.2f} messages/sec'.format(dur/1000)queue.commit_offsets() # commit to zk every 1k msgs

43 of 47

Page 44: Real-time streams and logs with Storm and Kafka

Other Log-Centric Companies

Company Logs WorkersLinkedIn Kafka* Samza

Twitter Kafka Storm*

Pinterest Kafka Storm

Spotify Kafka Storm

Wikipedia Kafka Storm

Outbrain Kafka Storm

LivePerson Kafka Storm

Netflix Kafka ???

44 of 47

Page 45: Real-time streams and logs with Storm and Kafka

Conclusion

45 of 47

Page 46: Real-time streams and logs with Storm and Kafka

What we've learned

There is no silver bullet data processing technology.

Log storage is very cheap, and getting cheaper.

"Timestamped facts" is rawest form of data available.

Storm and Kafka allow you to develop atop those facts.

Organizing around real-time logs is a wise decision.

46 of 47

Page 47: Real-time streams and logs with Storm and Kafka

Questions?

Go forth and stream!

Parse.ly:

http://parse.ly/code

http://twitter.com/parsely

Andrew & Keith:

http://twitter.com/amontalenti

http://twitter.com/kbourgoin

47 of 47