Facebook Retrospective - Big data-world-europe-2012
-
Upload
joydeep-sen-sarma -
Category
Technology
-
view
450 -
download
1
description
Transcript of Facebook Retrospective - Big data-world-europe-2012
Data Infrastructure at FacebookA retrospective
Joydeep Sen SarmaEx-Facebook DI Lead, Founder Qubole
Intro
• File/Database Systems developer (ex- Netapp/Oracle)• Yahoo (2005-07), Facebook (2007-11)
• @Facebook:– SysAdmin: operated massive Hadoop/Hive installs– Architect: conceived/wrote Apache Hive. made Hbase@FB
happen– Herded cats: first manager of Data Infra team– IT engineer/DBA: built ETL tools, warehouse/reporting for FB
Virtual Currency– Vested my stock options!
• Founder Qubole Inc. (2011-)
What not to do: Yahoo
• Want to add ‘feed’ in warehouse? Fill form, schmooze PM, wait 2 months.
• Want to justify project? Take $100M, double count 5 times.
• Hard to find out what data exists in company., silos
• Lots of grand architecture, but no progress
Goals going in
• Universal ability to log data and compute against it
• Build infrastructure for data processing– Help people help themselves– Get out of the way
• Done is better than perfect, Move Fast.– Iterate, Fix Failures Fast, Do everything twice
• Sep, 2007:– Use Case: compute relationship strength between friends– Data Sets: user graph, interaction and page-view logs– ~10TB cluster
…• July, 2011:– Ads reporting/data-mining, News Feed ranking, Spam
classification, PYMK, Search Indexing, Entitization, Sentiment Analysis, Fraud Analysis ..
– ~10k queries a day, hundreds of users, scores concurrent– 50PB cluster, 15 engineers/ops in total manning.
State of the Union
User Feedback
• Ex-Yahoo Senior-Directory Ads Product Mgmt.: "I haven't done SQL for ages - but I can use this stuff easily“
• Ex-Yahoo Data Scientist: "This is so amazing. That all data is stored in one place and I can get access instantly without having to wait months and contact multiple groups/silos“
• Ex-Paypal Fraud Analyst: "So much better data and infrastructure than I have ever had in the past"
Key Highways
• Hive– Centrally managed Hadoop service, no setup– SQL is easy, add scripts for map-reduce– Browser based query wizards for SQL dummies• Download results to Excel• Schedule queries periodically with a few clicks
• Scribe– Just log data using Scribe from any application– Dead simple to add attributes to user page views– Easy to pull data from RDBMS
Key Highways
• Simple Workflow authoring system (Databee)
• Reporting is easy– Provision MySQL Data-marts in hours– Easy self-service charting/dashboarding software
• Data Explorer– Wiki like system for documenting tables, columns, types– Keyword Search, find table authors, users– Help people help people
Democracies – Ugh!
“Democracy may not be the perfect … but it is better than the alternatives.”
“The family that poops together stays together”
Maintaining Order
• Hadoop Fair Scheduler– Guarantee resources to projects/users. Share excess capacity
• Multiple Compute tiers– Production, Large Ad-hoc, Small Ad-hoc, Local-mode queries
• Kill the bad guys– Code to hunt down bad queries/apps– Track cpu/disk usage – go after biggies
• Ban assault rifles– Basic ACLs – can’t delete important tables, directories
Why did we succeed?
DATA
DATA
All Heil Data Consolidation
(9pm, FB Hack Night)Ads Engineering Director:“Hey Joy, I want to join user fb-currency purchases with friend request data to test a thesis – pointers?”
Hadoop• Cheap
– Can consolidate everything.– We made it cheaper (RCFile, HDFS-RAID)
• Reduces governance cost– Only worry about really really large stuff.– Less data replication processes to manage
• Separates compute from storage– Most legacy vendors don’t get this
• Disk Based analytic systems degrade gracefully– No tipping point (vs. in-memory only)– Ability to catchup, go back in past (vs. real-time stream processing only)
Things we missed
Things we missed
• SLOOOOOOW– Extensive work on FB Hadoop repo for faster scheduling– Make testing faster (approx. queries)– Watch @Qubole
• SQL as rope– Need higher level templates. Don’t need 10 versions of a 30-day
moving average calculator
• Duplication of queries/jobs– How to discover if there’s existing summaries?– People help people, but still ..
• Didn’t build enough APIs
Final Words
• It’s not the software stupid– Software is easy to write and fix– Can be slow
• It’s the service that matters– Making everything work seamlessly– Ability to fix/improve things FAST
Q&A