Event Stream Processing with Kafka and Samza

download Event Stream Processing with Kafka and Samza

of 49

  • date post

    12-Jul-2015
  • Category

    Software

  • view

    2.367
  • download

    2

Embed Size (px)

Transcript of Event Stream Processing with Kafka and Samza

  • EventStreamProcessingwithKafkaandSamza

    Zach Cox - @zcox - zcox522@gmail.comIowa Code Camp - 1 Nov 2014

  • Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

  • References

    Kafka

    Samza

    Kafka DocumentationThe Log: What every software engineer should know about real-time data's unifying abstractionBenchmarking Apache Kafka

    Samza DocumentationQuestioning the Lamba ArchitectureMoving faster with data streams: The rise of Samza at LinkedInWhy local state is a fundamental primitive in stream processingReal time insights into LinkedIn's performance using ApacheSamza

  • Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

  • EventSomething happenedRecord that fact so we can process it

  • EventDescribes what happened

    Who did it?What did they do?What was the result?

    Provides contextWhen did it happen?Where did it happen?How did they do it?Why did they do it?

  • EventExample:PageviewUser viewed web pageUser

    ID: a2be9031-9465-4ecb-9302-9b962fa854acIP: 65.121.142.238User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101Safari/537.36

    Web PageURL:

    ContextTime: 2014-10-14T10:49:24.438-05:00

    https://www.mycompany.com/page.html

  • EventExample:ClickthroughUser clicked linkUser

    ID: a2be9031-9465-4ecb-9302-9b962fa854acIP: 65.121.142.238User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 1095)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101Safari/537.36

    LinkURL: Referer:

    ContextTime: 2014-10-14T10:49:24.438-05:00

    https://www.mycompany.com/product.htmlhttps://www.othersite.com/foo.html

  • EventExample:UserUpdateUser changed first nameUser

    ID: 161fa4bf-6ae9-4f4e-b72e-01c40e7783e5First name: ZachContext

    Time: 2014-10-14T10:59:56.481-05:00IP: 65.121.142.238

  • EventExample:UserUpdateUser uploaded a new profile imageUser

    ID: 161fa4bf-6ae9-4f4e-b72e-01c40e7783e5Profile Image

    URL: Context

    Time: 2014-10-14T10:59:56.481-05:00IP: 65.121.142.238Using: webcam

    http://profile-images.s3.amazonaws.com/katy-perry.jpg

  • EventExample:TweetUser posted a tweetUser

    ID:Username: @zcoxName: Zach CoxBio: Developer @BannoHQ | @iascala organizer | co-founded@Pongr

    TweetID: 527152511568719872URL: URL: Text: Going to talk about processing event streams using@apachekafka and @samzastream this Saturday @iowacodecamp

    Mentions: @apachekafka, @samzastream, @iowacodecampURLs:

    ContextTime: 2014-10-14T10:59:56.481-05:00Using: Twitter for AndroidLocation: 41.7146365,-93.5914038

    https://twitter.com/zcox/status/527152511568719872

    http://iowacodecamp.com/session/list#66

    http://iowacodecamp.com/session/list#66

  • EventExample:HTTPRequestLatencySome measured code took some time to executeCode

    production.my-app.some-server.http.get-user-profileTime to execute

    Min: 20 msecMax: 950 msecAverage: 190 msecMedian: 110 msec50%: 100 msec75%: 120 msec95%: 150 msec99%: 500 msec

    ContextTime: 2014-10-14T11:17:01.597-05:00

  • EventExample:RuntimeExceptionSome code threw a runtime exceptionSome code

    Stack trace: [...]Exception

    Message: HBase read timed outContext

    Time: 2014-10-14T11:21:23.749-05:00Application: my-appMachine: some-server.my-company.com

  • EventExample:ApplicationLoggingSome code logged some information[INFO] [2014-10-14 11:25:44,750] [sentry-akka.actor.default-dispatcher-2]a.e.s.Slf4jEventHandler: Slf4jEventHandler startedMessage: Slf4jEventHandler startedLevel: INFOTime: 2014-10-14 11:25:44,750Thread: sentry-akka.actor.default-dispatcher-2Logger: akka.event.slf4j.Slf4jEventHandler

  • Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

  • UnifiedLogEvents need to be sent somewhereEvents should be accessible to any programLog provides a place for events to be sent and accessedKafka is a great log service

  • DataIntegration

  • DataIntegration

  • Log

    Sequence of recordsAppend-onlyOrdered by timeEach record assigned unique sequential numberRecords stored persistently on disk

  • LogService

  • LogsinDistributedDatabases

  • TraditionalCache

    Cache missesCache invalidation

  • InfrastructureasDistributedDatabase

    Cache is now replicated from DB

  • InfrastructureasDistributedDatabase

    Cache can be in-process with web app

  • LogforEventStreamsSimple to send events toBroadcasts events to all consumersBuffers events on disk: producers and consumers decoupledConsumers can start reading at any offset

  • KafkaApache OSS, mainly from LinkedInHandles all the logs/event streamsHigh-throughput: millions events/secHigh-volume: TBs - PBs of eventsLow-latency: single-digit msec from producer to consumerScalable: topics are partitioned across clusterDurable: topics are replicated across clusterAvailable: auto failover

  • TwitterExample

    Receive messages via long-lived HTTP connection as JSONWrite messages to a Kafka topic

    Twitter Streaming API

  • TwitterExample

    Twitter rate-limits clients

  • Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

  • EventStreamProcessingTurn events into valuable, actionable informationProcess events as they happen, not later (batch)Do all of this reliably, at scale

  • EventStreamProcessor

  • EventStreamProcessor:Input

  • EventStreamProcessor:Output

  • SamzaEvent stream processing frameworkApache OSS, mainly from LinkedInSimple Java APIScalable: runs jobs in parallel across clusterReliable: fault-tolerance and durability built-inTools for stateful stream processing

  • SamzaJob1) Class that extends StreamTask:

    class MyTask extends StreamTask { override def process( envelope: IncomingMessageEnvelope, collector: MessageCollector, coordinator: TaskCoordinator): Unit = { //process message in envelope }}

    2) my-task.properties config filejob.factory.class=org.apache.samza.job.local.ThreadJobFactoryjob.name=my-task

    task.class=com.banno.MyTask...

  • StatelessProcessingOne event at a timeTake action using only that event

    SELECT * FROM raw_messages WHERE message_type = 'status';

  • SamzaJob:SeparateMessageTypes

    Many message types from TwitterSamza job to separate into type-specific streamsOther jobs process specific message types

  • StatefulStreamProcessingOne event at a timeTake action using that event and stateState = data built up from past eventsAggregationGroupingJoins

  • AggregationState = aggregated values (e.g. count)Incorporate each new event into that aggregationOutput aggregated values as events to new streamWhat happens if job stops?

    Crash, deploy, ...Can't lose state!Samza handles this all for you

    SELECT COUNT(*) FROM statuses;

  • SamzaJob:TotalStatusCount

    Increment a counter on every status (tweet)Periodically output current count

  • GroupingState = some data per groupTwo Samza jobs:

    Output statuses by user (map)Count statuses per user (reduce)

    Output: (user, count)Could use as input to job that sorts by count (most active users)

    SELECT user_id, COUNT(user_id) FROM statuses GROUP BY user_id;

    SELECT user_id, COUNT(user_id) FROM statuses GROUP BY user_id ORDER BY COUNT(user_id) DESC LIMIT 5;

  • JoinsSamza job has multiple input streamsStream-Stream join: ad impressions + ad clicksStream-Table join: page views + user zip codeTable-Table join: user data + user settingsJoins involving tables need DB changelog

    SELECT u.username, s.text FROM statuses s JOIN users u ON u.id = s.user_id;

  • Whatelsecanwecompute?Tweets per sec/min/hour (recent, not for-all-time)Enrich tweets with weather at current locationMost active users, locations, etcEmojis: % of tweets that contain, top emojisHashtags: % of tweets that contain, top #hashtagsURLs: % of tweets that contain, top domainsPhoto URLs: % of tweets that contain, top domainsText analysis: sentiment, spam

  • Reprocessinghttp://samza.incubator.apache.org/learn/documentation/0.7.0/jobs/reprocessing.html

  • OtherStreamProcessingFrameworksStormSpark StreamingHadoop StreamingAkkaRiemannEsper

  • Druid

    Send it eventsDruid reads from Kafka topicThat Kafka topic is a Samza output stream

    Super fast time-series queries: aggregations, filters, top-n, etc

    http://druid.io

  • Why?Businesses generate and process eventsUnified event log promotes data integrationProcess event streams to take actions quickly

  • References

    Kafka

    Samza

    Kafka DocumentationThe Log: What every software engineer should know about real-time data's unifying abstractionBenchmarking Apache Kafka

    Samza DocumentationQuestioning the Lamba ArchitectureMoving faster with data streams: The rise of Samza at LinkedInWhy local state is a fundamental primitive in stream processingReal time insights into LinkedIn's performance using ApacheSamza

  • Let'schat!Zach Cox@zcoxzcox522@gmail.comBanno is hiring!