Apache Kafka at LinkedIn
date post
14-Jul-2015Category
Engineering
view
360download
4
Embed Size (px)
Transcript of Apache Kafka at LinkedIn
LinkedIn Recruiting Solutions tagline
2013 LinkedIn Corporation. All Rights Reserved.Apache Kafka at LinkedInGuozhang WangBDTC 20142013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureAbout Me2
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure2Agenda3Overview of Kafka
Kafka Design
Kafka Usage at LinkedIn
Roadmap
Q & A2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure3Why We Build Kafka?2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureWe Have a lot of Data5User activity trackingPage views, ad impressions, etc
Server logs and metricsSyslogs, request-rates, etc
MessagingEmails, news feeds, etc
Computation derivedResults of Hadoop / data warehousing, etc2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureData-serving websites, LinkedIn has a lot of data5.. and We Build Products on Data62013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure6Newsfeed7
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure7Recommendation8
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureBased on relevence8Recommendation9
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure9Search10
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure10Metrics and Monitoring11
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureWe have this variety of data and and we need to build all these products around such data.11.. and a LOT of Monitoring12
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureWe have this variety of data and and we need to build all these products around such data.12The Problem:
How to integrate this variety of data and make it available to all products? 2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure14Life back in 2010:
Point-to-Point Pipeplines
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureMessaging: ActiveMQUser Activity: In house log aggregationLogging: SplunkMetrics: JMX => ZenossDatabase data: Databus, custom ETL
1415Example: User Activity Data Flow
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure1516What We WantA centralized data pipeline
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure1617Apache Kafka
We tried some systems off-the-shelf, but2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureActiveMQ: they do not fly1718What We REALLY WantA centralized data pipeline that is
Elastically scalable
Durable
High-throughput
Easy to use
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure18A distributed pub-sub messaging system
Scale-out from groundup
Persistent to disks
High-Throughput (10s MB/sec per server)19Apache Kafka
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure1920Life Since Kafka in Production
Apache Kafka
Developed and maintained by 5 Devs + 2 SRE2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureNow you maybe wondering why it works so well? For example, why it can be both highly durable by persisting data to disks while still maintaining high throughput?20Agenda21Overview of Kafka
Kafka Design
Kafka Usage at LinkedIn
Roadmap
Q & A2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure21Key Idea #1:
Data-parallelism leads to scale-out2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureProduce/consume requests are randomly balanced among brokers23Distribute Clients across Partitions
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureTopic = message streamTopic has partitions, partitions are distributed to brokers23Key Idea #2:
Disks are fast when used sequentially2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureDo not be afraid of disks24Appends are effectively O(1)
Reads from known offset are fast still, when cached25Store Messages as a Log3455789101112...Producer WriteConsumer1 Reads (offset 7)Consumer2 Reads (offset 7)Partition i of Topic A2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureFile system caching25Key Idea #3:
Batching makes best use of network/IO2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureBatched send and receive
Batched compression
No message caching in JVM
Zero-copy from file to socket (Java NIO)27Batch Transfer2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureAnd finally after all these tricks, the client interface we exposed to the users, are very simple.2728The API (0.8)Producer: send(topic, message)
Consumer:
Iterable stream = createMessageStreams().get(topic)
for (message: stream) {// process the message}2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure28Agenda29Overview of Kafka
Kafka Design
Kafka Usage at LinkedIn
Pipeline deployment
Schema for data cleanliness
O(1) ETL
Auditing for correctness
Roadmap
Q & A2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureNow I will switch my gear and talk a little bit about Kafka usage at Linkedin2930Kafka Usage at LinkedInMainly used for tracking user-activity and metrics data
16 - 32 brokers in each cluster (615+ total brokers)
527 billion messages/day
7500+ topics, 270k+ partitions
Byte rates:Writes: 97 TB/dayReads: 430 TB/day2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure21st, October.
3031Kafka Usage at LinkedIn
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure3132Kafka Usage at LinkedIn
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureMulti-colo3233Kafka Usage at LinkedIn
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure33Agenda34Overview of Kafka
Kafka Design
Kafka Usage at LinkedIn
Pipeline deployment
Schema for data cleanliness
O(1) ETL
Auditing for correctness
Roadmap
Q & A2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure34ProblemsHundreds of message types
Thousands of fields
What do they all mean?
What happens when they change?2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure36Standardized Schema on AvroSchema
Message structure contract
Performance gain
Workflow
Check in schemaAuto compatibility check
Code review
Ship it!
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure36Agenda37Overview of Kafka
Kafka Design
Kafka Usage at LinkedIn
Pipeline deployment
Schema for data cleanliness
O(1) ETL
Auditing for correctness
Roadmap
Q & A2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure3738Kafka to Hadoop
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure3839Hadoop ETL (Camus)Map/Reduce job does data load
One job loads all events
~10 minute ETA on average from producer to HDFS
Hive registration done automatically
Schema evolution handled transparently
Open sourced: https://github.com/linkedin/camus2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure39Agenda40Overview of Kafka
Kafka Design
Kafka Usage at LinkedIn
Pipeline deployment
Schema for data cleanliness
O(1) ETL
Auditing for correctness
Roadmap
Q & A2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure40Does it really work?All published messages must be delivered to all consumers (quickly)2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureAudit Trail
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure99.99%4243More Features in Kafka 0.8Intra-cluster replication (0.8.0)
Highly availability,
Reduced latency
Log compaction (0.8.1)State storage
Operational tools (0.8.2)Topic management
Automated leader rebalance
etc ..
Checkout our page for more: http://kafka.apache.org/2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure0.8.2:
Delete topicAutomated leader rebalancing Controlled shutdown Offset management Parallel recoverymin.isr and clean leader election4344Kafka 0.9
Clients Rewrite
Remove ZK dependency
Even better throughput
Security
More operability, multi-tenancy ready
Transactional Messaing
From at-least-one to exactly-onceCheckout our page for more: http://kafka.apache.org/2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data Infrastructure44Kafka Users: Next Maybe You?
2013 LinkedIn Corporation. All Rights Reserved.KAFKA Team, Data InfrastructureNon-Java / Scala
C / C++ / .NETGoClojureRubyNode.jsPHPPythonErlangHTTP RESTCommand lineetc ..
https://cwiki.apache.org/confluence/display/KAFKA/ClientsPython - Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP an