A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
-
Upload
amy-w-tang -
Category
Technology
-
view
1.082 -
download
20
description
Transcript of A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics and Infrastructure at Linkedin
Bhaskar Ghosh Senior Director of Engineering Data Infrastructure
LinkedIn Confidential ©2013 All Rights Reserved
Big Data Science A Symposium in Honor of Martin Schultz Yale University 26 Oct 2012
Outline
LinkedIn Confidential ©2013 All Rights Reserved 2
1. Martin and Me 2. Company and Mission 3. Products and Science 4. Data Infrastructure 5. P, S, DI: People You May Know 6. Linkedin + Yale 7. Conclusion
Martin and Me
LinkedIn Confidential ©2013 All Rights Reserved 3
Thank you Martin! Best mentor. Versatility, big-picture thinking and leadership. Yale CS Ph.D. 1995 (Parallel Algorithms)
12y @ Informix & Oracle building parallel database systems
4y @ Yahoo! building Ads systems & leading the Display Ads Exchange organization
2y+ @ LinkedIn building & leading the Data Infrastructure Engineering Organization
The World’s Largest Professional Network
Members Worldwide 2 new
Members Per Second 100M+
Monthly Unique Visitors 175M+ 2M+
Company Pages
Connecting Talent Opportunity. At scale…
LinkedIn Confidential ©2013 All Rights Reserved 4
..and a bunch of Data-Driven Products
LinkedIn Confidential ©2013 All Rights Reserved 5
Pandora Search for People
Events You May Be Interested In
Groups browse maps
The LinkedIn Mission. Connect the world’s professionals to make them more productive and successful
Linkedin Product Philosophy
LinkedIn Confidential ©2013 All Rights Reserved 7
Goals
Approach
Provide a uniquely personalized experience to
members (professionals)
Build an ecosystem to balance the interests of
members and partners (companies)
Launch Often and Early
Data-Driven Experiment and Test
Fail Fast
Prepare for Virality and Scale
Two Product Families
LinkedIn Confidential ©2013 All Rights Reserved 8
Data
Data Infrastructure
Science and Analytics
Professionals Companies
Connections
Profiles Actions
Content
For Members For Partners People You May Know Who’s Viewed My Profile Jobs You May Be
Interested In News/Sharing Today Search Subscriptions
Hire
Market
Sell
The Big-Data Feedback Loop
LinkedIn Confidential ©2013 All Rights Reserved 9
Value ↑
Insights ↑
Scale ↑
Product
Science Data
Member
Engagement ↑
Virality ↑
Signals ↑
Refinement ↑
Infrastructure Analytics ↑
LinkedIn Confidential ©2013 All Rights Reserved 10
Product Family Products Science
Identity and Engagement
Search and Analysis
Recommendations
Monetization
1. Profile and Connections 2. Activity Streams 3. Messages (email) 4. Endorsements & Skills
Blending and ranking of heterogeneous content (e.g. Network Updates, Group Discussions, Job Postings)
1. People Search 2. Group Search 3. Who Viewed My Profile
1. People You May Know 2. Jobs You May Be
Interested In 3. Events You May Be
Interested In
Entity disambiguation and matching
1. Subscription Packages 2. Sponsored Content
Response Prediction Inventory Forecasting
Data Infra
Member-Facing Products: Diversity at Scale
Recommendations…Are Effective .. And Drive
LinkedIn Confidential ©2013 All Rights Reserved 11
> 50% of connections
> 50% of job applications > 50% of group joins
• Find data that is useful for Members • Guiding Principle
• Provide Relevant Content • Establish Social Connections • In Appropriate Context
Behavior Analysis
Collaborative Filtering Popularity
Sim
ilar P
rofil
es
Ref
erra
l Cen
ter
Tale
ntM
atch
Peop
le B
row
se
Map
People
Recom- mendation Types
Shared, Dynamic, Unified Core Service
Products
Recom- mendation Entities
Jobs
Bro
wse
M
ap
Sim
ilar J
obs
Jobs
Jobs
You
May
be
inte
rest
ed in
… Ads Companies Searches News Events … and more
GYM
L
Gro
ups
Br
owse
Map
Groups
Sim
ilar G
roup
s
User Feedback
API
(R-T) Feature Extraction, Entity Resolution & Enrichment
(R-T) matching computations
A/B
Offline data munging (hadoop)
LinkedIn Recommendation Engine
LinkedIn Confidential ©2013 All Rights Reserved 13
Product Family Products Science
Identity and Engagement
Search and Analysis
Recommendations
Monetization
1. Profile and Connections 2. Activity Streams 3. Messages (email) 4. Endorsements & Skills
Blending and ranking of heterogeneous content (e.g. Network Updates, Group Discussions, Job Postings)
1. People Search 2. Group Search 3. Who Viewed My Profile
1. People You May Know 2. Jobs You May Be
Interested In 3. Events You May Be
Interested In
Entity disambiguation and matching
1. Subscription Packages 2. Sponsored Content Response prediction
Data Infra
• Scale • Full text and
secondary ind • Real-time
• Faceted search • Near RT index
freshness • Drill-down
exploration
• Graph analysis • Content serving • Real-time tuning
Member-Facing Products: Diversity at Scale
LinkedIn Data Infrastructure: Three-Phase Abstraction
LinkedIn Confidential ©2013 All Rights Reserved 14
Users Online Data Infra
Near-Line Infra
Application Offline Data Infra
Infrastructure Latency & Freshness Requirements Products
Online Activity that should be reflected immediately • Member Profiles • Company Profiles • Connections
• Messages • Endorsements • Skills
Near-Line Activity that should be reflected soon • Activity Streams • Profile Standardization • News
• Recommendations • Search • Messages
Offline Activity that can be reflected later • People You May Know • Connection Strength • News
• Recommendations • Next best idea…
LinkedIn Data Infrastructure: Sample Stack
15
Infra challenges in 3-phase ecosystem are diverse, complex and specific
Some off-the-shelf. Significant investment in home-grown, deep and
interesting platforms
LinkedIn Data Infrastructure: Data Stores
LinkedIn Confidential ©2013 All Rights Reserved 16
Users Online Data Infra
Near-Line Infra
Application Offline Data Infra
Systems Capabilities
Transactions Rich structures (e.g. indexes) Change capture capability Key value / document storage
Voldemort
ICDE 2012 (Data Infra Overview) FAST 2012 (Voldemort for Serving)
LinkedIn Data Infrastructure: Specialized Indexes
LinkedIn Confidential ©2013 All Rights Reserved 17
Users Online Data Infra
Near-Line Infra
Application Offline Data Infra
Systems Capabilities
Search platform
Distributed graph engine Zoie Bobo Sensei
GraphDB
LinkedIn Data Infrastructure: Pipelines
LinkedIn Confidential ©2013 All Rights Reserved 18
Users Online Data Infra
Near-Line Infra
Application Offline Data Infra
Systems Capabilities
Messaging for site events, monitoring High throughput
Change data capture stream Reliable, consistent, low latency pipe
ACM SOCC 2012: “Databus” IEEE Data Eng. Bulletin 2012: “Kafka”
LinkedIn Data Infrastructure: Off-line Analysis
LinkedIn Confidential ©2013 All Rights Reserved 19
Users Online Data Infra
Near-Line Infra
Application Offline Data Infra
Systems Capabilities
ML, Ranking, Relevance Insights and Analytics ETL, Metadata and Pipes Business Source of Truth
LinkedIn Data Infrastructure: Cluster Management
LinkedIn Confidential ©2013 All Rights Reserved 20
Users Online Data Infra
Near-Line Infra
Application Offline Data Infra
Systems Capabilities
Generic framework for building distributed systems
Cluster Management Primitives
ACM SOCC 2012: Untangling Cluster Management with Helix
HELIX: Generalizing Cluster Management
LinkedIn Confidential ©2013 All Rights Reserved 21
STATE MACHINE
CONSTRAINTS OBJECTIVE
COUNT=2
COUNT=1
minimize(maxnj∈N S(nj) )
t1≤ 5 S
M O
t1 t2
t3 t4
minimize(maxnj∈N M(nj) )
Helix
Declare distributed system behavior via {S, C, O} Enforce Partition constraints Fault detection and tolerance (e.g. promote S to M) Elasticity (e.g. Re-balance; Minimize migrations)
Used in Espresso, Search, Databus
LinkedIn Data Infrastructure: A few take-aways
LinkedIn Confidential ©2013 All Rights Reserved 22
1. Infrastructure decisions matter and are hard to transform in a hyper-growth environment.
2. Balance open-source products with home-grown platforms (**)
3. Operability, Capacity Planning and On-line Multi-tenancy are hard
4. Data Movement: Pipes and Feedback Loops are critical (**)
5. Data Model and Integration e2e are key (*) 6. Few vs Many: Balance over-specialized (agile)
vs generic efforts (leverage-able) platforms (*) 7. Off-line Multi-Platform story is evolving.
Science and Infrastructure: Giving Back
LinkedIn Confidential ©2013 All Rights Reserved 23
Research Publications
ACM SOCC 2012 ACM RecSys 2012 SIGIR 2012 CIKM 2012 VLDB 2012 ICDE 2012 FAST 2012 NetDB 2011 …
Open Source Projects
Apache Helix new
ParSeq new
DataFu new
Apache Kafka
Sensei
Azkaban
Voldemort
A Recommendation Product:
LinkedIn Confidential ©2013 All Rights Reserved 24
People You May Know (PYMK)
Probability that you may know someone else?
LinkedIn Confidential ©2013 All Rights Reserved 25
Bob
Alice
Carol
Known as “triangle closing”
??
PYMK: Science, Members and Connections
LinkedIn Confidential ©2013 All Rights Reserved 26
1) Feature selection is key Common Connections Geo Company Age
2) ML and data model • Traditional ML (e.g. matrix factorization) on O(n^2) of 175M
tend to not scale easily 3) Interplay: Data Model + ML + Parallel Computation model 4) Adding edges: Why do it?
• Creates positive-feedback social loops for members • More useful content and activity available to members • Denser graph improves signal strength in science-driven
products
Virality ↑
Value ↑
Insights ↑
Product
Science Data
Member
Signals ↑
The Feedback Loop
PYMK: Off-line Model Build
LinkedIn Confidential ©2013 All Rights Reserved 27
Users Online Data Infra
Near-Line Infra
Application Offline Data Infra
Use generic off-line Infra (Hadoop and Pig) to build recommendations off-line. Very complex workflow due to extraction and selection of large num of features.
Built Azkaban for Hadoop. Small Input and final look-up structure but large intermediate data (100’s of TB)
due to MR. Problem (who you do not know) itself has an inherent blow-up. Special optimizations (e.g. Bloom Join to remove connected)
PYMK: Off-line to Near-Line Serving
LinkedIn Confidential ©2013 All Rights Reserved 28
Users Online Data Infra
Near-Line Infra
Application Offline Data Infra
Build serving structure on Hadoop. Scan versus Index compactness tradeoff. Voldemort: Partitioned k-v; Load-balancing; Pluggable storage layer; Failover. Bulk load for efficiency. Fast Rollback for safety. Atomic swap. Serving: Per-partition index in memory. PYMK blobs on disk. Retrieval ~msec. Decoration in App FE is more expensive.
PYMK: Science and Feedback Loop
LinkedIn Confidential ©2013 All Rights Reserved 29
Users Online Data Infra
Near-Line Infra
Application Offline Data Infra
Response vs Latency: Fast refresh helps user experience. (e.g. showing connections of very recent connections). “Social” phenomenon.
Very agile feature: Lots of on-line A/B testing and tweaking of features Huge Impact: > 50% of accepted invites are created by PYMK
PYMK: Tying It All Together
LinkedIn Confidential ©2013 All Rights Reserved 30
P (B knows C) α large number of features
Distance
Common connections
Organizational Overlap
Age
Bob
Alice
Carol
Dave Eve
Offline Model
Near-Line Serving
Offline
Near-Line
User Interactions
PYMK Application
LinkedIn + Yale
LinkedIn Confidential ©2013 All Rights Reserved 31
What is my career path? How can I prepare? How do I get my first
internship and first job?
Students
Where did my students go after they left the university?
How is my school seeding the various industries with the best talent?
How does my school compare with other institutions
Students: Transformation of
Careers Yale: Get a data-driven view Uncover opportunities
Wins based on data and insights
Thank you colleagues for the beautiful slides!
LinkedIn Confidential ©2013 All Rights Reserved 32
David Henke SVP Operations
Amy Tang Sr. Program Manager
Sam Shah Principal Engineer
Shirshanka Das Principal Engineer
Kapil Surlaker Principal Engineer
Anmol Bhasin Sr. Engineering Manager
Daniel Tunkelang Principal Data Scientist
Summary
LinkedIn Confidential ©2013 All Rights Reserved 33
Read more @ data.linkedin.com
1. E2E: The Big-Data feedback loop of social-network product design is cool 2. Infrastructure
1. Data Infrastructure needs continuous innovation and iteration to keep pace for scale and cost.
2. Fast moving, Big, Clean Data + Agile Metadata = Goodness 3. Data-driven products need agile feedback infrastructure and
measurement methodology. 3. Methodology
1. Data-Driven experimentation enables insights and agile products 2. Recommendation-driven products have big impact.
Help us. Come Have Fun with Us!
LinkedIn Confidential ©2013 All Rights Reserved 34
Info: data.linkedin.com
1. Science and Data Mining: Recommendation and Optimization Problems 2. Next-generation ad-hoc and OLAP query processing on Hadoop 3. Graph Computations: Off-line mining and On-line integration loops 4. nRT Data Streams in Near-line infrastructure 5. And much more…
LinkedIn Confidential ©2013 All Rights Reserved 36