Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows...

24
1.5 Million Log Lines per Second Big Data Everywhere Chicago 2014 Mike Keane [email protected] Building and maintaining Flume flows at Conversant

description

Mike Keane, Senior Software Engineer, Conversant When faced with rapid increase in the volume of log data, the regular addition of new log files, the evolution of content in log files, and business demands for quicker insight into production log data, it was time for Conversant to move away from hourly batch log collection and processing and into the event-driven world of Flume. In this talk, we'll discuss how Conversant migrated to Flume, and how we manage agents on nearly 1000 servers across 4 data centers, processing over 50 billion log lines per day with peak hourly averages of over 1.5million log lines per second. We'll also discuss scalability, file channel performance and durability, dealing with duplicates, monitoring through TSDB, benefits to the business, and lessons learning in a large-scale implementation.

Transcript of Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows...

Page 1: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

1.5 Million Log Lines per Second

Big Data Everywhere Chicago 2014

Mike Keane [email protected]

Building and maintaining Flume flows at Conversant

Page 2: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

• Quicker insight into production data

• Reduce complexity of administering/managing new servers,

data centers, etc.

• Scalable

• No data loss or duplication

• Replace TSV files with Avro objects

• Able to be monitored by Network Operations Center (NOC)

• Able to recover from downtime quickly

R SLA for Event Driven Logging with Flume

Page 3: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

• A Flume Flow is a series of flume agents data follows from origination

to final destination

• Data on a Flume Flow is packaged in FlumeEvent Avro objects

• A FlumeEvent is composed of

• Headers – A map of string value pairs

• Body – A byte array

• A FlumeEvent is an atomic unit of data

• FlumeEvents are sent in batches

• When a batch of FlumeEvents only partially makes it to the next flume

agent in the flow, the entire batch is resent resulting in duplicates

R Simplistic Flume Overview

Page 4: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

R Simplistic Flume Overview

Flume Agent

Page 5: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

R Simplistic Flume Overview

EmbeddedAgent Compressor Agent

Landing Agent

Page 6: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Overview of existing network topology

• 3 data centers divided into 12 lanes participating in the OpenRTB market

• 6 lanes in the east coast data center

• 4 lanes in the west coast data center

• 2 lanes in the European data center

• Each lane has approximately 75 servers handling OpenRTB

operations.

• 30 different logs

• Over 60,000,000,000 log lines per day

Page 7: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Overview of existing network topology.

Page 8: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

• 2 Server Flume Flow from East Coast (IAD) to Chicago (ORD) with

over 250K TSV lines per second

• No Data Loss

• Failover

• Compression performance

P.O.C. Can Flume handle our log volume reliably?

Page 9: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

P.O.C. Overview

Page 10: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

P.O.C. passes

• Larger Batch sizes helped, but could not reach 250K per second

• Multiple TSV lines Per FlumeEvent hits over 360K per second

• Failover passed with duplicates

• Compression passed but needed to parallelize 7X sinks

Page 11: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Taking Flume to Production

• Embedding the EmbeddedAgent in existing servers

• Modify EmbeddedAgent

• Properties from existing infrastructure

• Implement Monitoring

• Create “Flume”Implementation of proprietary logging interface

• Replace POJO to TSV with Avro to AvroDataFile

• Preventing duplicates, not removing

• Add LogType header

Page 12: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Taking Flume to Production

• Custom Sink for AvroDataFile body (based on HDFSEventSink)

• Check if UUID header is in HBase

• Yes – increment duplicate count metric

• No

• Write AvroDataFile body to HDFS using Custom Writer

• Put UUID to HBase

Page 13: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Taking Flume to Production

• Custom Selector based on MultiplexingChannelSelector

• Route FlumeEvents to channels by log type or groups of log

types

• Bifurcate to multiple locations each log and each location

with its own percentage of data to bifurcate

Page 14: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Configuring Flume Flows

• Configuring Flume can be tedious, use a templating engine

• In Q2 2014 Conversant expanded from 7 lanes in 2 data centers

to 12 lanes in 3 data centers (~400 more servers to configure).

• Static headers useful for tracking flows

• 15 minutes to configure all Q2 expansion CompressorLane('iad6', [CompressorAgent("dtiad06flm01p"),

CompressorAgent("dtiad06flm02p"),

CompressorAgent("dtiad06flm03p")])

compressor.list = dtiad06flm01p, dtiad06flm02p,dtiad06flm03p

Page 15: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Monitoring the Flume Flows

• Flume metrics are available by JMX or Json over HTTP

• Metrics to monitor

• ChannelFillPercentage

• Rate of change on EventDrainSuccessCount on failover sinks

• FLUME-2307 – File channel deletes fail after timeout (fixed 1.5)

• Publishing metrics to TSDB provides great visual insight

Page 16: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Monitoring the Flume Flows

ChannelFillPercentage

Page 17: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Monitoring the Flume Flows

Rate of taking events off “Critical Logs” file channel

Page 18: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Monitoring the Flume Flows

Rate of Flume Events by data center East Coast, West Coast, Europe

Page 19: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Monitoring the Flume Flows

Monitoring by Groups

Page 20: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Benefits of migrating to Flume

• Business has insight into data in under 10 minutes

• Configuring expansion trivial

• Failover enables automatic recovery from down time

• Bifurcation

• enables scaled constant regression lane(s)

• Subset of data to analytics development cluster

Page 21: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Benefits of migrating to Flume

5 minute aggregations to business within 10 minutes

Page 22: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Gotchas…

• Scaling for Compression

• Auto reloading of properties inconsistent

• “It is recommended (though not required) to use a separate disk

for the File Channel checkpoint.”

RAID-6 raid array, Force Write Back

• Bad configurations not easy to see, not always clear in log file.

• NetcatSource – Not too useful beyond trivial usage

Page 23: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

Gotchas…

• POM file edits

• JUnits are not deterministic

• Hadoop jars added to classpath by startup script – IDE

• Avoiding cost of Avro schema evolution

Page 24: Big Data Everywhere Chicago: 1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Coversant (Conversant)

What is next

• Upgrade to Flume 1.5

• Bifurcate to micro batch (Storm? Spark?)

• Disable sink switch