A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

A Small Overview of Big Data Products, Analytics and Infrastructure at Linkedin

Bhaskar Ghosh Senior Director of Engineering Data Infrastructure

LinkedIn Confidential ©2013 All Rights Reserved

Big Data Science A Symposium in Honor of Martin Schultz Yale University 26 Oct 2012

Outline

LinkedIn Confidential ©2013 All Rights Reserved 2

1. Martin and Me 2. Company and Mission 3. Products and Science 4. Data Infrastructure 5. P, S, DI: People You May Know 6. Linkedin + Yale 7. Conclusion

Martin and Me


Thank you Martin! Best mentor. Versatility, big-picture thinking and leadership. Yale CS Ph.D. 1995 (Parallel Algorithms)

12y @ Informix & Oracle building parallel database systems

4y @ Yahoo! building Ads systems & leading the Display Ads Exchange organization

2y+ @ LinkedIn building & leading the Data Infrastructure Engineering Organization

The World’s Largest Professional Network

Members Worldwide 2 new

Members Per Second 100M+

Monthly Unique Visitors 175M+ 2M+

Company Pages

Connecting Talent Opportunity. At scale…


..and a bunch of Data-Driven Products


Pandora Search for People

Events You May Be Interested In

Groups browse maps

The LinkedIn Mission. Connect the world’s professionals to make them more productive and successful

Linkedin Product Philosophy


Goals

Approach

Provide a uniquely personalized experience to

members (professionals)

Build an ecosystem to balance the interests of

members and partners (companies)

Launch Often and Early

Data-Driven Experiment and Test

Fail Fast

Prepare for Virality and Scale

Two Product Families


Data

Data Infrastructure

Science and Analytics

Professionals Companies

Connections

Profiles Actions

Content

For Members For Partners People You May Know Who’s Viewed My Profile Jobs You May Be

Interested In News/Sharing Today Search Subscriptions

Hire

Market

Sell

The Big-Data Feedback Loop


Value ↑

Insights ↑

Scale ↑

Product

Science Data

Member

Engagement ↑

Virality ↑

Signals ↑

Refinement ↑

Infrastructure Analytics ↑


Product Family Products Science

Identity and Engagement

Search and Analysis

Recommendations

Monetization

1. Profile and Connections 2. Activity Streams 3. Messages (email) 4. Endorsements & Skills

Blending and ranking of heterogeneous content (e.g. Network Updates, Group Discussions, Job Postings)

1. People Search 2. Group Search 3. Who Viewed My Profile

1. People You May Know 2. Jobs You May Be

Interested In 3. Events You May Be

Interested In

Entity disambiguation and matching

1. Subscription Packages 2. Sponsored Content

Response Prediction Inventory Forecasting

Data Infra

Member-Facing Products: Diversity at Scale

Recommendations…Are Effective .. And Drive


> 50% of connections

> 50% of job applications > 50% of group joins

• Find data that is useful for Members • Guiding Principle

• Provide Relevant Content • Establish Social Connections • In Appropriate Context

Behavior Analysis

Collaborative Filtering Popularity

Sim

ilar P

rofil

es

Ref

erra

l Cen

ter

Tale

ntM

atch

Peop

le B

row

se

Map

People

Recom- mendation Types

Shared, Dynamic, Unified Core Service

Products

Recom- mendation Entities

Jobs

Bro

wse

M

ap

Sim

ilar J

obs

Jobs

Jobs

You

May

be

inte

rest

ed in

… Ads Companies Searches News Events … and more

GYM

L

Gro

ups

Br

owse

Map

Groups

Sim

ilar G

roup

s

User Feedback

API

(R-T) Feature Extraction, Entity Resolution & Enrichment

(R-T) matching computations

A/B

Offline data munging (hadoop)

LinkedIn Recommendation Engine


Product Family Products Science

Identity and Engagement

Search and Analysis

Recommendations

Monetization

1. Profile and Connections 2. Activity Streams 3. Messages (email) 4. Endorsements & Skills

Blending and ranking of heterogeneous content (e.g. Network Updates, Group Discussions, Job Postings)

1. People Search 2. Group Search 3. Who Viewed My Profile

1. People You May Know 2. Jobs You May Be

Interested In 3. Events You May Be

Interested In

Entity disambiguation and matching

1. Subscription Packages 2. Sponsored Content Response prediction

Data Infra

• Scale • Full text and

secondary ind • Real-time

• Faceted search • Near RT index

freshness • Drill-down

exploration

• Graph analysis • Content serving • Real-time tuning

Member-Facing Products: Diversity at Scale

LinkedIn Data Infrastructure: Three-Phase Abstraction


Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Infrastructure Latency & Freshness Requirements Products

Online Activity that should be reflected immediately • Member Profiles • Company Profiles • Connections

• Messages • Endorsements • Skills

Near-Line Activity that should be reflected soon • Activity Streams • Profile Standardization • News

• Recommendations • Search • Messages

Offline Activity that can be reflected later • People You May Know • Connection Strength • News

• Recommendations • Next best idea…

LinkedIn Data Infrastructure: Sample Stack

15

Infra challenges in 3-phase ecosystem are diverse, complex and specific

Some off-the-shelf. Significant investment in home-grown, deep and

interesting platforms

LinkedIn Data Infrastructure: Data Stores



Near-Line Infra


Systems Capabilities

Transactions Rich structures (e.g. indexes) Change capture capability Key value / document storage

Voldemort

ICDE 2012 (Data Infra Overview) FAST 2012 (Voldemort for Serving)

LinkedIn Data Infrastructure: Specialized Indexes



Near-Line Infra



Search platform

Distributed graph engine Zoie Bobo Sensei

GraphDB

LinkedIn Data Infrastructure: Pipelines



Near-Line Infra



Messaging for site events, monitoring High throughput

Change data capture stream Reliable, consistent, low latency pipe

ACM SOCC 2012: “Databus” IEEE Data Eng. Bulletin 2012: “Kafka”

LinkedIn Data Infrastructure: Off-line Analysis



Near-Line Infra



ML, Ranking, Relevance Insights and Analytics ETL, Metadata and Pipes Business Source of Truth

LinkedIn Data Infrastructure: Cluster Management



Near-Line Infra



Generic framework for building distributed systems

Cluster Management Primitives

ACM SOCC 2012: Untangling Cluster Management with Helix

HELIX: Generalizing Cluster Management


STATE MACHINE

CONSTRAINTS OBJECTIVE

COUNT=2

COUNT=1

minimize(maxnj∈N S(nj) )

t1≤ 5 S

M O

t1 t2

t3 t4

minimize(maxnj∈N M(nj) )

Helix

Declare distributed system behavior via {S, C, O} Enforce Partition constraints Fault detection and tolerance (e.g. promote S to M) Elasticity (e.g. Re-balance; Minimize migrations)

Used in Espresso, Search, Databus

LinkedIn Data Infrastructure: A few take-aways


1. Infrastructure decisions matter and are hard to transform in a hyper-growth environment.

2. Balance open-source products with home-grown platforms (**)

3. Operability, Capacity Planning and On-line Multi-tenancy are hard

4. Data Movement: Pipes and Feedback Loops are critical (**)

5. Data Model and Integration e2e are key (*) 6. Few vs Many: Balance over-specialized (agile)

vs generic efforts (leverage-able) platforms (*) 7. Off-line Multi-Platform story is evolving.

Science and Infrastructure: Giving Back


Research Publications

ACM SOCC 2012 ACM RecSys 2012 SIGIR 2012 CIKM 2012 VLDB 2012 ICDE 2012 FAST 2012 NetDB 2011 …

Open Source Projects

Apache Helix new

ParSeq new

DataFu new

Apache Kafka

Sensei

Azkaban

Voldemort

A Recommendation Product:


People You May Know (PYMK)

Probability that you may know someone else?


Bob

Alice

Carol

Known as “triangle closing”

??

PYMK: Science, Members and Connections


1) Feature selection is key Common Connections Geo Company Age

2) ML and data model • Traditional ML (e.g. matrix factorization) on O(n^2) of 175M

tend to not scale easily 3) Interplay: Data Model + ML + Parallel Computation model 4) Adding edges: Why do it?

• Creates positive-feedback social loops for members • More useful content and activity available to members • Denser graph improves signal strength in science-driven

products

Virality ↑

Value ↑

Insights ↑

Product

Science Data

Member

Signals ↑

The Feedback Loop

PYMK: Off-line Model Build



Near-Line Infra


Use generic off-line Infra (Hadoop and Pig) to build recommendations off-line. Very complex workflow due to extraction and selection of large num of features.

Built Azkaban for Hadoop. Small Input and final look-up structure but large intermediate data (100’s of TB)

due to MR. Problem (who you do not know) itself has an inherent blow-up. Special optimizations (e.g. Bloom Join to remove connected)

PYMK: Off-line to Near-Line Serving



Near-Line Infra


Build serving structure on Hadoop. Scan versus Index compactness tradeoff. Voldemort: Partitioned k-v; Load-balancing; Pluggable storage layer; Failover. Bulk load for efficiency. Fast Rollback for safety. Atomic swap. Serving: Per-partition index in memory. PYMK blobs on disk. Retrieval ~msec. Decoration in App FE is more expensive.

PYMK: Science and Feedback Loop



Near-Line Infra


Response vs Latency: Fast refresh helps user experience. (e.g. showing connections of very recent connections). “Social” phenomenon.

Very agile feature: Lots of on-line A/B testing and tweaking of features Huge Impact: > 50% of accepted invites are created by PYMK

PYMK: Tying It All Together


P (B knows C) α large number of features

Distance

Common connections

Organizational Overlap

Age

Bob

Alice

Carol

Dave Eve

Offline Model

Near-Line Serving

Offline

Near-Line

User Interactions

PYMK Application

LinkedIn + Yale


What is my career path? How can I prepare? How do I get my first

internship and first job?

Students

Where did my students go after they left the university?

How is my school seeding the various industries with the best talent?

How does my school compare with other institutions

Students: Transformation of

Careers Yale: Get a data-driven view Uncover opportunities

Wins based on data and insights

Thank you colleagues for the beautiful slides!


David Henke SVP Operations

Amy Tang Sr. Program Manager

Sam Shah Principal Engineer

Shirshanka Das Principal Engineer

Kapil Surlaker Principal Engineer

Anmol Bhasin Sr. Engineering Manager

Daniel Tunkelang Principal Data Scientist

Summary


Read more @ data.linkedin.com

1. E2E: The Big-Data feedback loop of social-network product design is cool 2. Infrastructure

1. Data Infrastructure needs continuous innovation and iteration to keep pace for scale and cost.

2. Fast moving, Big, Clean Data + Agile Metadata = Goodness 3. Data-driven products need agile feedback infrastructure and

measurement methodology. 3. Methodology

1. Data-Driven experimentation enables insights and agile products 2. Recommendation-driven products have big impact.

Help us. Come Have Fun with Us!


Info: data.linkedin.com

1. Science and Data Mining: Recommendation and Optimization Problems 2. Next-generation ad-hoc and OLAP query processing on Hadoop 3. Graph Computations: Off-line mining and On-line integration loops 4. nRT Data Streams in Near-line infrastructure 5. And much more…

In Closing


[email protected]

Thank You!

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Technology

Transcript of A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn