The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at...

45
The “Big Data” Ecosystem at LinkedIn Presented by Zhongfang Zhuang

Transcript of The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at...

Page 1: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

The “Big Data” Ecosystem at LinkedIn

Presented by Zhongfang Zhuang

Page 2: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

• Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah.

Page 3: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

The Ecosystems

Page 4: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Hadoop Ecosystem

Page 5: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

HDFS

MapReduce

Pig Hive OozieH

Base

ZooKeeper

Flume

Sqoop

Page 6: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Data Integration Problem

Page 7: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

–Sean Kandel et al. Enterprise Data Analysis and Visualization: An Interview Study!

!

“Some analysts performed this integration themselves. In other cases, analysts—especially application users and scripters—relied on the IT team to assemble this data for them.”!”

Page 8: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Ecosystem at LinkedIn

Page 9: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Online datacenters

Offline production datacenters Offline development datacenters

Hadoop for ETL

Page 10: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Online datacenters

Offline production datacenters

Page 11: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

ETL Extract-Transform-Load

More information about ETL is available from Cloudera

Page 12: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Online Datacenters• Voldemort!• Oracle

Page 13: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Voldemort• Combines In-memory caching and storage

system!• Possible to emulate the storage layer!• Reads and Writes scale horizontally!• Simple API!• Transparent data partitioning

More information about Voldemort is available at: http://www.project-voldemort.com/voldemort/

Page 14: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Next• Ingress and egress out of

Hadoop system!• Data flows

Page 15: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

IngressFrom online to offline

Page 16: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Online datacenters

Hadoop for ETL

Page 17: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

• Database includes information about users, companies, connections, and other primary site data.!

• Event data includes logs of page views being served, search queries, and clicks.

Online datacenters

Hadoop for ETL

Page 18: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Challenges• Provide infrastructure that can

make all data available without manual intervention or processing.

Page 19: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Challenges• Datasets are so large!• Data is diverse!• Data schemas are evolving!• Data monitoring and validating

Page 20: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

KafkaA high-throughput distributed messaging system, for all activity data.

Page 21: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

• Persists messages in a write-ahead log, partitioned and distributed over multiple brokers!

• Allows data publishers to add records to a log!

• New data streams to be added and scaled independent of one another.

Page 22: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Example• Collecting data about searches being

performed. The search service would produce these records and publish them to a topic named “SearchQueries” where any number of subscribers might read these messages.

Page 23: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Data Evolution• Requires schemas changes!• Two common solutions:!

• Load data streams in whatever form they appear.!

• Manually map the source data into a stable, well-thought-out schema.

Page 24: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

LinkedIn’s Solution• Retains the same structure throughout our data pipeline!• Enforces compatibility and other correctness

conventions!• Schema evolves automatically!• Schema check is done at compile and run time.!• Review process.

Page 25: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Load into Hadoop• Pull data from Kafka to Hadoop with

a Map-only job.!• Recurrence every 10 minutes.!• Scheduled by Azkaban scheduling

system

Page 26: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

• Two Kafka clusters kept synchronized automatically.!

• The primary Kafka supports production.

Online Datacenters

ServicesServices

Kafka

Offline Development Datacenters

Kafka

Page 27: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

• Each step in the flow all publish an audit trail!

• Consist of the topic, machine name, etc.!

• Confirm all published event reached all consumers by aggregating audit data.

Audit

Page 28: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

WorkflowProcessed offline

Page 29: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Workflow in Hadoop

• A directed acyclic graph of dependencies.!

• Wrappers help restrict data being read to a certain time range based on parameters specified.!

• Hadoop-Kafka data pull job places the data into time-partitioned folders.

Page 30: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Workflow in Hadoop

• A directed acyclic graph of dependencies.!

• One workflow can have a size of 50-100 jobs.

Page 31: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

AzkabanAn open-sourced workflow scheduler.

Page 32: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Azkaban

• Supports multiple job types!• Run as individual or chained!• Configurations and

dependencies are maintained!• Visualize and manipulate

dependencies via graphs in UI.

Page 33: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Construction of

Workflows

• Experimenting, massage it into a useful form!

• Transform features into feature vectors!

• Trained into models!• Iterate workflows

Page 34: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Offline Development Datacenters

Kafka

Hadoop for ETLHadoop for

development

Azkaban Azkaban

Page 35: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Offline Production Datacenters

Staging Voldemort Read-

Hadoop for production

Azkaban

Page 36: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

EgressBack online!

Page 37: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Results• Usually pushed to other systems

(Back online or for further consumption)

Page 38: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Three main mechanism

• Key-Value!• Streams!• OLAP

Page 39: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Voldemort

• Based on Amazon’s Dynamo!• Distributed and Elastic!• Horizontally scalable!• Bulk load pipeline from Hadoop!• Simple to use

Reference: Slides of The big data ecosystem at LinkedIn, presented at SIGMOD 2013

Page 40: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Stream output performed by"

Kafka

• Implemented using Hadoop API!• Each MR slot acts as a Kafka producer!• More details available at

http://kafka.apache.org

Page 41: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

OLAP by

Avatara

• Automatically generates corresponding Azkaban job pipelines.!

• Small cubes are multidimensional arrays of tuples!

• Each tuple is combination of dimension and measure pairs!

• Output cubes to Voldemort as a read-only store.

Page 42: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Applications

Page 43: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Applications

Key-Value

Stream

OLAP

“People you may know”

Collaborative Filtering

Skill Endosements

Related Searches

News Feed Updates

Email

Relationship Strength

“Who’s viewed my profile?”

“Who’s viewed this job?”

Page 44: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

Future Work

Page 45: The bigdata ecosystem at linkedIn€¦ · • Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems. Hadoop

• MapReduce for processing large graphs!

• Streaming System for near line low-latency data processing