Observability for Data Pipelines - Conferences

64
Observability for Data Pipelines Jiaqi Liu, Software Engineer at Button Director, @WomenWhoCodeNYC @jiaqicodes #OSCON

Transcript of Observability for Data Pipelines - Conferences

Page 1: Observability for Data Pipelines - Conferences

Observability for Data PipelinesJiaqi Liu, Software Engineer at Button Director, @WomenWhoCodeNYC @jiaqicodes #OSCON

Page 2: Observability for Data Pipelines - Conferences

usebutton.com

Page 3: Observability for Data Pipelines - Conferences
Page 4: Observability for Data Pipelines - Conferences

Agenda• Data Pipelines

• Challenges with Data Pipelines

• Designing Features:

• Immutable Data

• Dry Run Mode

• Data Lineage

• Testing, Monitoring & Alerting

Page 5: Observability for Data Pipelines - Conferences

Data Pipeline

Page 6: Observability for Data Pipelines - Conferences

ETL Pipeline

Extract data from a source, this could be scraping from a site, a large file, a realtime stream of data feeds.

Transform the data - this could be joining the data with additional information for an enhanced data set, running through a machine learning model, or aggregating the data in some way.

Load the data into a data warehouse or a User facing dashboard - wherever the end storage and display for data might be.

Page 7: Observability for Data Pipelines - Conferences

Data Pipeline Example

Page 8: Observability for Data Pipelines - Conferences
Page 9: Observability for Data Pipelines - Conferences
Page 10: Observability for Data Pipelines - Conferences
Page 11: Observability for Data Pipelines - Conferences
Page 12: Observability for Data Pipelines - Conferences
Page 13: Observability for Data Pipelines - Conferences
Page 14: Observability for Data Pipelines - Conferences

OSCON

Convert HTML to structured data

Write to file

Select Random Talk

Write to File

Page 15: Observability for Data Pipelines - Conferences
Page 16: Observability for Data Pipelines - Conferences

Batch Periodic Process that reads data in bulk (typically from a filesystem or a database)

StreamHigh throughput, low latency system that reads data from a stream or a queue

Page 17: Observability for Data Pipelines - Conferences

Apache Airflow

Luigi

Google Dataflow

Page 18: Observability for Data Pipelines - Conferences

What Could Go Wrong?

Page 19: Observability for Data Pipelines - Conferences

Problems • Batch Job is never scheduled • Batch Job takes too long to run • Data is malformed or corrupt • Data is lost • Stream is backed-up, Stream data is lost • Non-deterministic models

Page 20: Observability for Data Pipelines - Conferences

Batch Jobs

Page 21: Observability for Data Pipelines - Conferences

Confidential

Stream Jobs

Page 22: Observability for Data Pipelines - Conferences

Data is exposed or lost or malformed. A statistical model is producing highly inaccurate results

Data Integrity

Speed in Data Processing could be Core to Business

Delayed Processing

Data Pipeline Concerns

Page 23: Observability for Data Pipelines - Conferences

It’s not enough to know that the pipeline is healthy, you also have to know that the data being processed

is accurate.

Page 24: Observability for Data Pipelines - Conferences

Build data pipelines that support interpretability and observability

Page 25: Observability for Data Pipelines - Conferences

Interpretability

Not just understand what a model predicted but also why.

Allows for debugging and auditing machine learning models.

Page 26: Observability for Data Pipelines - Conferences

Because Button is a marketplace, we see the side effects of user behavior

in our data and have to decipher what assumptions are safe to make.

Page 27: Observability for Data Pipelines - Conferences

Observability

Cindy Sridharan - https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c

Page 28: Observability for Data Pipelines - Conferences

Mystery Free Prod

Page 29: Observability for Data Pipelines - Conferences

Pipeline Features Build feature to support Interpretability and observability

Page 30: Observability for Data Pipelines - Conferences

Features to Include

• Immutable Data

• Data Lineage

• Having a Test Run Feature

Page 31: Observability for Data Pipelines - Conferences

Immutable Data

Reproducible Outcomes

Page 32: Observability for Data Pipelines - Conferences
Page 33: Observability for Data Pipelines - Conferences

Data Lineage Diagnostics

Page 34: Observability for Data Pipelines - Conferences
Page 35: Observability for Data Pipelines - Conferences

Tag Records With Metadata (version of code, source of data)

Log to a distributed tracer (use consistent unique identifiers to

track)

Page 36: Observability for Data Pipelines - Conferences

Test Run Validate Assumptions

Page 37: Observability for Data Pipelines - Conferences

What assumptions did you make about your data?

Page 38: Observability for Data Pipelines - Conferences

Schema

Page 39: Observability for Data Pipelines - Conferences

Best Practices

Protocol buffers are a language-neutral, platform-neutral extensible mechanism for serializing

structured data.

Page 40: Observability for Data Pipelines - Conferences
Page 41: Observability for Data Pipelines - Conferences

Ability to test output of data transformation before committing to database

Page 42: Observability for Data Pipelines - Conferences

Testing

Page 43: Observability for Data Pipelines - Conferences

Test Pyramid

https://martinfowler.com/articles/practical-test-pyramid.html

Page 44: Observability for Data Pipelines - Conferences

SystemUnit Test

Functional Regression

Different Types of Tests

Page 45: Observability for Data Pipelines - Conferences

Unit Tests

Page 46: Observability for Data Pipelines - Conferences

Functional Tests• Can also be known as Integration Tests • In the case of data, it’s the golden tests

• Sets the gold standard for data in and data out • Doesn’t need to be logic specific like Unit Tests are • Build a Golden Test framework and define fixtures (expected

input, and expected output)

Failed Gold Test

Page 47: Observability for Data Pipelines - Conferences

Regression Tests

https://www.ibeta.com/regression-testing-nutshell/

Page 48: Observability for Data Pipelines - Conferences

93%Model A Precision

95%Model B Precision

Champion/Challenger Model

Page 49: Observability for Data Pipelines - Conferences

Black Box

Measures Health of Whole System

Continuous

Relies on Other Signals

System Tests

Page 50: Observability for Data Pipelines - Conferences

Confidential

Monitoring & Testing

Page 51: Observability for Data Pipelines - Conferences

Monitoring

Page 52: Observability for Data Pipelines - Conferences

Confidential

Monitoring Tools

Page 53: Observability for Data Pipelines - Conferences
Page 54: Observability for Data Pipelines - Conferences
Page 55: Observability for Data Pipelines - Conferences

Metric Types• Counter - single monotonically increasing counter

• Gauge - single numerical value that can go up or down (think temperature)

• Histogram - samples observations and counts them in buckets

• Summary  - samples observations and calculates quantiles over a sliding window

Page 56: Observability for Data Pipelines - Conferences

Time Series Metrics

Page 57: Observability for Data Pipelines - Conferences

Alerting

Page 58: Observability for Data Pipelines - Conferences

Monitoring with Time Series Data

Page 59: Observability for Data Pipelines - Conferences

Best Practices

Set a threshold that works for you.Establish a baseline and go from there.

Page 60: Observability for Data Pipelines - Conferences

Page on symptoms not root causes. Create trail of causes for diagnostics

Page 61: Observability for Data Pipelines - Conferences

Data Lineage, Immutable Data, Test Run -> Ease of development when working with

evolving data

Page 62: Observability for Data Pipelines - Conferences

Monitoring & Alerting -> Overall Pipeline Observable

Page 63: Observability for Data Pipelines - Conferences

Rate today’s session

Session page on conference website O’Reilly Events App

Page 64: Observability for Data Pipelines - Conferences

Thanks! We’re hiring!www.usebutton.com [email protected]