Gobblin' Big Data With Ease @ QConSF 2014

34
©2014 LinkedIn Corporation. All Rights Reserved. Gobblin’ Big Data with Ease Lin Qiao Data Analytics Infra @ LinkedIn

description

QConSF 2014 talk

Transcript of Gobblin' Big Data With Ease @ QConSF 2014

Page 1: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Gobblin’ Big Data with Ease

Lin Qiao

Data Analytics Infra @ LinkedIn

Page 2: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Overview

• Challenges

• What does Gobblin provide?

• How does Gobblin work?

• Retrospective and lookahead

Page 3: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Overview

• Challenges

• What does Gobblin provide?

• How does Gobblin work?

• Retrospective and lookahead

Page 4: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Perception

Analytics Platform

Ingest Framework

PrimaryData

SourcesTransformations Business

FacingInsights

MemberFacing

Insights and Data Products

Load

Load

Validation

Validation

Page 5: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Reality

5

Hadoop

Camus

Lumos

Teradata

External Partner

Data

Ingest Framework

DWH ETL(fact tables)

Product, Sciences, EnterpriseAnalytics

Site (Member Facing

Products)

Kafka

Activity(tracking)

Data

R/W store (Oracle/

Espresso)

Profile Data

Databus

Changes

Derived Data Set

Core Data Set(Tracking, Database, External)

Computed Results for Member Facing Products

Enterprise Products

Change dump on filer

Ingest utilities

Lassen(facts and

dimensions)

Read store (Voldemort)

Page 6: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Challenges @ LinkedIn

• Large variety of data sources

• Multi-paradigm: streaming data, batch data

• Different types of data: facts, dimensions, logs, snapshots, increments, changelog

• Operational complexity of multiple pipelines

• Data quality

• Data availability and predictability

• Engineering cost

Page 7: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Open source solutions

sqoopp

flumep morphlinep

RDBMS vendor-specific connectorsp

aegisthus

logstashCamus

Page 8: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Goals

• Unified and Structured Data Ingestion Flow

– RDBMS -> Hadoop

– Event Streams -> Hadoop

• Higher level abstractions

– Facts, Dimensions

– Snapshots, increments, changelog

• ELT oriented

– Minimize transformation in the ingest pipeline

Page 9: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Central Ingestion Pipeline

Hadoop

Teradata

External Partner

Data

Gobblin

DWH ETL(fact tables)

Product, Sciences, EnterpriseAnalytics

Site (Member Facing

Products)

Kafka

Tracking

R/W store (Oracle/

Espresso)

OLTP Data

Databus

Changes

Derived Data Set

Core Data Set(Tracking, Database, External)

Enterprise Products

Change dump on filer

RESTJDBCSOAP

Custom

Compaction

Page 10: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Overview

• Challenges

• What does Gobblin provide?

• How does Gobblin work?

• Retrospective and lookahead

Page 11: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Gobblin Usage @ LinkedIn

• Business Analytics– Source data for, sales analysis, product sentiment

analysis, etc.

• Engineering– Source data for issue tracking, monitoring, product

release, security compliance, A/B testing

• Consumer product– Source data for acquisition integration

– Performance analysis for email campaign, ads campaign, etc.

Page 12: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Key Features

Horizontally scalable and robust framework

Unified computation paradigm

Turn-key solution

Customize your own Ingestion

Page 13: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Scalable and Robust Framework

13

Scalable

CentralizedState Management

State is carried over between jobs automatically, so metadata can be used to track offsets, checkpoints, watermarks, etc.

Jobs are partitioned into tasks that run concurrently

Fault Tolerant Framework gracefully deals with machine and job failures

Query Assurance Baked in quality checking throughout the flow

Page 14: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Unified computation paradigm

Common execution flow

Common execution flow between batch ingestion and streaming ingestion pipelines

Shared infra components

Shared job state management, job metrics store, metadata management.

Page 15: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Turn Key Solution

Built-in Exchange Protocols

Existing adapters can easily be re-used for sources with common protocols (e.g. JDBC, REST, SFTP, SOAP, etc.)

Built-in Source Integration

Fully integrated with commonly used sources including MySQL, SQLServer, Oracle, SalesForce, HDFS, filer, internal dropbox)

Built-in Data Ingestion Semantics

Covers full dump and incremental ingestion for fact and dimension datasets.

Policy driven flow execution & tuning

Flow owners just need to specify pre-defined policy for handling job failure, degree of parallelism, what data to publish, etc.

Page 16: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Customize Your Own Ingestion Pipeline

Extendable Operators

Configurable Operator Flow

Operators for doing extraction, conversion, quality checking, data persistence, etc., can be implemented or extended against common API.

Configuration allows for multiple plugin points to add in customized logic and code

Page 17: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Overview

• Challenges

• What does Gobblin provide?

• How does Gobblin work?

• Lookahead

Page 18: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Under the Hood

Page 19: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Computation Model

• Gobblin standalone

– single process, multi-threading

– Testing, small data, sampling

• Gobblin on Map/Reduce

– Large datasets, horizontally scalable

• Gobblin on Yarn

– Better resource utilization

– More scheduling flexibilities

Page 20: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Scalable Ingestion Flow

20

Source

WorkUnit

WorkUnit

WorkUnit

Data Publisher

Extractor ConverterQuality Checker

Writer

Extractor ConverterQuality Checker

Writer

Extractor ConverterQuality Checker

Writer

Task

Task

Task

Page 21: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Sources

• Determines how to partition work

- Partitioning algorithm can leverage source sharding

- Group partitions intelligently for performance

• Creates work-units to be scheduled

SourceWorkUnit PublisherExtractor Converter

Quality Checker

Writer

Page 22: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Job Management

• Job execution states

– Watermark

– Task state, job state, quality checker output, error code

• Job synchronization

• Job failure handling: policy driven

22

State Store

Job run 1 Job run 3Job run 2

Page 23: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Gobblin Operator Flow

Extract Schema

Extract Record

Convert Record

Check Record Data

Quality

Write Record

Convert Schema

Check Task Data

Quality

Commit Task Data

23

Page 24: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Extractors SourceWorkUnit PublisherExtractor Converter

Quality Checker

Writer

• Specifies how to get the schema and pull data from the source

• Return ResultSet iterator• Track high watermark• Track extraction metrics

Page 25: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Converters

• Allow for schema and data transformation– Filtering

– projection

– type conversion

– Structural change

• Composable: can specify a list of converters to be applied in the given order

SourceWorkUnit PublisherExtractor Converter

Quality Checker

Writer

Page 26: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Quality Checkers

• Ensure quality of any data produced by Gobblin• Can be run on a per record, per task, or per job basis• Can specify a list of quality checkers to be applied

– Schema compatibility– Audit check– Sensitive fields– Unique key

• Policy driven– FAIL – if the check fails then so does the job– OPTIONAL – if the checks fails the job continues– ERR_FILE – the offending row is written to an error file

26

SourceWorkUnit PublisherExtractor Converter

Quality Checker

Writer

Page 27: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Writers

• Writing data in Avro format onto HDFS

– One writer per task

• Flexibility

– Configurable compression codec (Deflate, Snappy)

– Configurable buffer size

• Plan to support other data format (Parquet, ORC)

SourceWorkUnit PublisherExtractor Converter

Quality Checker

Writer

Page 28: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Publishers

• Determines job success based on Policy.- COMMIT_ON_FULL_SUCCESS

- COMMIT_ON_PARTIAL_SUCCESS

• Commits data to final directories based on job success.

Task 1

Task 2

Task 3

File 1

File 2

File 3

Tmp DirFile 1File 2File 3

Final DirFile 1File 2File 3

SourceWorkUnit PublisherExtractor Converter

Quality Checker

Writer

Page 29: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Gobblin Compaction

• Dimensions:

– Initial full dump followed by incremental extracts in Gobblin

– Maintain a consistent snapshot by doing regularly scheduled compaction

• Facts:

– Merge small files

29

Ingestion HDFS Compaction

Page 30: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Overview

• Challenges

• What does Gobblin provide?

• How does Gobblin work?

• Retrospective and lookahead

Page 31: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Gobblin in Production

• > 350 datasets

• ~ 60 TB per day

• Salesforce• Responsys• RightNow• Timeforce• Slideshare• Newsle• A/B testing• LinkedIn JIRA• Data retention

31

Production Instances

Data Volume

Page 32: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Lesson Learned

• Data quality has a lot more work to do

• Small data problem is not small

• Performance optimization opportunities

• Operational traits

Page 33: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.

Gobblin Roadmap

• Gobblin on Yarn

• Streaming Sources

• Gobblin Workbench with ingestion DSL

• Data Profiling for richer quality checking

• Open source in Q4’14

33

Page 34: Gobblin' Big Data With Ease @ QConSF 2014

©2014 LinkedIn Corporation. All Rights Reserved.