Insight Data Engineering: Open source data ingestion

30
Open Source Data Collection/Ingestion Treasure Data, Inc. www.treasuredata.com

Transcript of Insight Data Engineering: Open source data ingestion

Page 1: Insight Data Engineering: Open source data ingestion

Open SourceData Collection/Ingestion

Treasure Data, Inc.www.treasuredata.com

Page 2: Insight Data Engineering: Open source data ingestion

Hello!

- “Committer” of Fluentd

- Treasure Data, Inc.

- Former Algorithmic Trader

- Stanford Math and CS

Page 3: Insight Data Engineering: Open source data ingestion

Table of Contents

1. Why you should care2. Data Collection v. Data Ingestion3. Examples: Data Collection Tools4. Examples: Data Ingestion Tools5. Case Study: Async App Logging

Links to be added after the talk.

Page 4: Insight Data Engineering: Open source data ingestion

Data Collection/Ingestion is HARD

Page 5: Insight Data Engineering: Open source data ingestion

Data Sources Raw Data Storage

Processed Data

AnalysisEnvironment

(Big) Data Pipeline

Data Collection and Ingestion

Data Pre-processing

Data Fetching

Data Engineers

Page 6: Insight Data Engineering: Open source data ingestion

Data Sources Raw Data Storage

Processed Data

AnalysisEnvironment

If Data Collection Goes Awry...

Data Collection and Ingestion

Data Pre-processing

Data Fetching

Data Engineers

Page 7: Insight Data Engineering: Open source data ingestion

Collection v. Ingestion

Page 8: Insight Data Engineering: Open source data ingestion

Data Collection

- Happens where data originates

- “logging code”

- Batch v. Streaming

- Pull v. Push

log.error(“FUUUUU....WHY!?”)

cln.send({“uid”:1,”action”:”died”})

200 GET a.com/?utm=big%20data

Page 9: Insight Data Engineering: Open source data ingestion

Data Ingestion

- Receives data

- Sometimes coupled with storage

- Routing data Data Ingestion Layer

Page 10: Insight Data Engineering: Open source data ingestion

ex. Data Collection Tools

Page 11: Insight Data Engineering: Open source data ingestion

rsyslog

- The grandfather of data collectors

- Streaming

- Installed by default, widely understood

- Not as easy to extend/configure

Page 12: Insight Data Engineering: Open source data ingestion

rsyslog

https://github.com/rsyslog/rsyslog/blob/master/ChangeLog

Page 13: Insight Data Engineering: Open source data ingestion

Scribe

- Written originally at Facebook

- Streaming

- Fast (C++)

- Nightmare to build, largely

abandoned

Page 14: Insight Data Engineering: Open source data ingestion

Flume-ng- Written and maintained by

Cloudera (successor to Flume)

- Commercial support by

Cloudera. Track record for

Hadoop

- Java can be heavy-handed for

some orgs/cases

Page 15: Insight Data Engineering: Open source data ingestion

Logstash

- Pluggable architecture, rich

ecosystem

- The “L” of the ELK stack by

Elastic

- JRuby

- HA uses Redis as a queuehttp://apuntesdetrabajo.es/?p=263

Page 16: Insight Data Engineering: Open source data ingestion

Heka

- Developed at Mozilla

- Written in Go, extensible w/ Lua

- Plugin system, but compilation

needed (Go’s limitation, may

change)

Page 17: Insight Data Engineering: Open source data ingestion

Fluentd

- Plugin architecture

- Built-in HA

- CRuby (JRuby on the roadmap)

- google-fluentd, td-agent

- Lightweight multi-source, multi-

destination log routing

Page 18: Insight Data Engineering: Open source data ingestion

Embulk

- Plugin architecture

- Focuses on Batch workloads

- Java/JRuby

- Very new! (looking for

contributors!)

Page 19: Insight Data Engineering: Open source data ingestion

ex. Data Ingestion Tools

Page 20: Insight Data Engineering: Open source data ingestion

RabbitMQ

- Written in Erlang, supported by

Pivotal

- Implements AMQP

Page 21: Insight Data Engineering: Open source data ingestion

Kafka

- Begun at LinkedIn, now Confluent

- Topic-based Message Broker:

Producer/Broker/Consumer

- Distributed design

- Provides at least once, at most

once by consumers

Page 22: Insight Data Engineering: Open source data ingestion

Fluentd!?

- Used (abused?) as a bus/MQ

- tag-based event routing

- Can be combined with

RabbitMQ/Kafka, etc.

Page 23: Insight Data Engineering: Open source data ingestion

case study: Async App Logging

Page 24: Insight Data Engineering: Open source data ingestion

Application Logging

- Common ask: “How’s our new feature doing?”

GET /foobar

API Server200 {...}

Page 25: Insight Data Engineering: Open source data ingestion

Application Logging

- What NOT to do: synchronous logging

GET /foobar

API Server200 {...} Data Backend

write

ack

Page 26: Insight Data Engineering: Open source data ingestion

Application Logging

- What NOT to do: synchronous logging

GET /foobar

API Server200 {...} Local Data Collector

write Flush

DataBackendack

Buffer

Page 27: Insight Data Engineering: Open source data ingestion

- Is writing to a local log collector safe?

- What if the log collector retries by error?

But wait...

- A lot of problems to think about!

Page 28: Insight Data Engineering: Open source data ingestion

“Much of the blame, little of the glory”(Just kidding. The entire data team relies on YOU!)

Page 29: Insight Data Engineering: Open source data ingestion

Thank you!(...and we are hiring!)

www.treasuredata.com/careers