The "Big Data" Ecosystem at LinkedInSIGMOD 2013Roshan Sumbaly, Jay Kreps, & Sam ShahJune 2013
©2012 LinkedIn Corporation. All Rights Reserved.
2
LinkedIn: the professional profile of record
225MMembers 225M MemberProfiles
1 2
3
Applications
4
Application examples
People You May Know (2 people) Year In Review Email (1 person, 1 month) Skills and Endorsements (2 people) Network Updates Digest (1 person, 3
months) Who’s Viewed My Profile (2 people) Collaborative Filtering (1 person) Related Searches (1 person, 3 months) and more…
5
Skill sets
©2013 LinkedIn Corporation. All Rights Reserved. 6
Rich Hadoop-based ecosystem
©2013 LinkedIn Corporation. All Rights Reserved. 7
“Last mile” problems
Ingress– Moving data from online to offline system
Workflow management– Managing offline processes
Egress– Moving results from offline to online
systems Key/Value Streams OLAP
8
Application examples
People You May Know (2 people) Year In Review Email (1 person, 1 month) Skills and Endorsements (2 people) Network Updates Digest (1 person, 3
months) Who’s Viewed My Profile (2 people) Collaborative Filtering (1 person) Related Searches (1 person, 3 months) and more…
9
People You May Know
10
People You May Know – Workflow
Perform triangle closing for all
members
Triangle closing
Rank by discounting previously shown recommendations
Push recommendations to
online service
Connection stream
Impression stream
©2013 LinkedIn Corporation. All Rights Reserved. 11
“Last mile” problems
Ingress– Moving data from online to offline system
Workflow management– Managing offline processes
Egress– Moving results from offline to online
systems Key/Value Streams OLAP
©2013 LinkedIn Corporation. All Rights Reserved. 12
Ingress - O(n2) data integration complexity
Point to point Fragile, delayed and potentially lossy Non-standardized
©2013 LinkedIn Corporation. All Rights Reserved. 13
Ingress - O(n) data integration
14
Ingress – Kafka
Distributed and elastic– Multi-broker system
Categorized topics– “PeopleYouMayKnowTopic”– “ConnectionUpdateTopic”
15
Ingress
Standardized schemas– Avro– Central repository– Programmatic compatibility
Audited ETL to Hadoop
©2013 LinkedIn Corporation. All Rights Reserved. 16
“Last mile” problems
Ingress– Moving data from online to offline system
Workflow management– Managing offline processes
Egress– Moving results form offline to online
systems Key/Value Streams OLAP
17
People You May Know – Workflow
Perform triangle closing for all
members
Rank by discounting previously shown recommendations
Push recommendations to
online service
Connection stream
Impression stream
18
People You May Know – Workflow (in reality)
19
Workflow Management - Azkaban
Dependency management– Historical logs
Diverse job types– Pig, Hive, Java
Scheduling Monitoring Visualization Configuration Retry/restart on failure Resource locking
20
People You May Know – Workflow
Perform triangle closing for all
members
Rank by discounting previously shown recommendations
Push recommendations to
online service
Connection stream
Impression stream
Member Id 1213 => [ Recommended member id 1734,
Recommended member id 1523 … Recommended member id 6332 ]
©2013 LinkedIn Corporation. All Rights Reserved. 21
“Last mile” problems
Ingress– Moving data from online to offline system
Workflow management– Managing offline processes
Egress– Moving results from offline to online
systems Key/Value Streams OLAP
22
Egress – Key/Value
Voldemort– Based on Amazon’s Dynamo
Distributed and Elastic Horizontally scalable Bulk load pipeline from Hadoop Simple to usestore results into ‘url’ using KeyValue(‘member_id’)
23
People You May Know - Summary
24
Application examples
People You May Know (2 people) Year In Review Email (1 person, 1 month) Skills and Endorsements (2 people) Network Updates Digest (1 person, 3
months) Who’s Viewed My Profile (2 people) Collaborative Filtering (1 person) Related Searches (1 person, 3 months) and more…
25
Year In Review Email
26
Year In Review EmailmemberPosition = LOAD '$latest_positions' USING BinaryJSON;memberWithPositionsChangedLastYear = FOREACH ( FILTER memberPosition BY ((start_date >= $start_date_low ) AND (start_date <= $start_date_high))) GENERATE member_id, start_date, end_date;
allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON;
allConnectionsWithChange_nondistinct = FOREACH ( JOIN memberWithPositionsChangedLastYear BY member_id, allConnections BY dest ) GENERATE allConnections::source AS source, allConnections::dest AS dest;
allConnectionsWithChange = DISTINCT allConnectionsWithChange_nondistinct;
memberinfowpics = LOAD '$latest_memberinfowpics' USING BinaryJSON;pictures = FOREACH ( FILTER memberinfowpics BY ((cropped_picture_id is not null) AND ( (member_picture_privacy == 'N') OR (member_picture_privacy == 'E'))) ) GENERATE member_id, cropped_picture_id, first_name as dest_first_name, last_name as dest_last_name;
resultPic = JOIN allConnectionsWithChange BY dest, pictures BY member_id;connectionsWithChangeWithPic = FOREACH resultPic GENERATE allConnectionsWithChange::source AS source_id, allConnectionsWithChange::dest AS member_id, pictures::cropped_picture_id AS pic_id, pictures::dest_first_name AS dest_first_name, pictures::dest_last_name AS dest_last_name;
joinResult = JOIN connectionsWithChangeWithPic BY source_id, memberinfowpics BY member_id; withName = FOREACH joinResult GENERATE
connectionsWithChangeWithPic::source_id AS source_id, connectionsWithChangeWithPic::member_id AS member_id, connectionsWithChangeWithPic::dest_first_name as first_name, connectionsWithChangeWithPic::dest_last_name as last_name, connectionsWithChangeWithPic::pic_id AS pic_id, memberinfowpics::first_name AS firstName, memberinfowpics::last_name AS lastName, memberinfowpics::gmt_offset as gmt_offset, memberinfowpics::email_locale as email_locale, memberinfowpics::email_address as email_address;
resultGroup = GROUP withName BY (source_id, firstName, lastName, email_address, email_locale, gmt_offset);
-- Get the count of results per recipientresultGroupCount = FOREACH resultGroup GENERATE group , withName as toomany, COUNT_STAR(withName) as num_results;resultGroupPre = filter resultGroupCount by num_results > 2;resultGroup = FOREACH resultGroupPre { withName = LIMIT toomany 64; GENERATE group, withName, num_results;}
x_in_review_pre_out = FOREACH resultGroup GENERATE FLATTEN(group) as (source_id, firstName, lastName, email_address, email_locale, gmt_offset), withName.(member_id, pic_id, first_name, last_name) as jobChanger, '2013' as changeYear:chararray, num_results as num_results;
x_in_review = FOREACH x_in_review_pre_out GENERATE source_id as recipientID, gmt_offset as gmtOffset, firstName as first_name, lastName as last_name, email_address, email_locale, TOTUPLE( changeYear, source_id,firstName, lastName, num_results,jobChanger) as body;
rmf $xir;STORE x_in_review INTO '$url' USING Kafka();
27
Year In Review Email – Workflow
Find users that have changed jobs
Join with connections and metadata
(pictures)
Group by connections of these users
Push content to email service
©2013 LinkedIn Corporation. All Rights Reserved. 28
“Last mile” problems
Ingress– Moving data from online to offline system
Workflow management– Managing offline processes
Egress– Moving results from offline to online
systems Key/Value Streams OLAP
29
Egress - Streams
Service acts as consumer “EmailContentTopic”store emails into ‘url’ using Stream(“topic=x“)
30
Conclusion
Hadoop: simple programmatic model, rich developer ecosystem
Primitives for – Ingress:
Structured, complete data available Automatically handles data evolution
– Workflow management Run and operate production processes
– Egress 1-line command for data for exporting data Horizontally scalable, little need for capacity planning
Empowers data scientists to focus on new product ideas, not infrastructure
Future work: models of computation
• Alternating Direction Method of Multipliers (ADMM)
• Distributed Conjugate Gradient Descent (DCGD)• Distributed L-BFGS• Bayesian Distributed Learning (BDL)
Graphs
Distributed learning
Near-line processing
32
data.linkedin.com
Top Related