Fb talk arch_summit
-
Upload
drewz-lin -
Category
Technology
-
view
260 -
download
0
Transcript of Fb talk arch_summit
![Page 1: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/1.jpg)
Evolution of Big Data Architectures@
FacebookArchitecture Summit, Shenzhen, August 2012
Ashish Thusoo
![Page 2: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/2.jpg)
About Me
• Currently Co-founder/CEO of Qubole
• Ran the Data Infrastructure Team at Facebook till 2011
• Co-founded Apache Hive @ Facebook
![Page 3: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/3.jpg)
Outline
• Big Data @ Facebook - Scope & Scale
• Evolution of Big Data Architectures @ FB
• Qubole
![Page 4: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/4.jpg)
Big Data @ FB(2011): Scale
• 25 PB of compressed data ~ 150 PB of uncompressed data
• 400 TB/day (uncompressed) of new data
• 1 new job every second
![Page 5: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/5.jpg)
Big Data @ FB: Scope
• Simple reporting
• Model generation
• Adhoc analysis + data science
• Index generation
• Many many others...
![Page 6: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/6.jpg)
A/B Testing Email #1
![Page 7: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/7.jpg)
A/B Testing Email #2
![Page 8: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/8.jpg)
A/B Testing Email #2 is 3x Better
![Page 9: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/9.jpg)
Evolution: 2007-2011
0
7500
15000
22500
30000
2007 2008 2009 2010 2011
15 250 800
8000
25000
DW Size in TB
![Page 10: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/10.jpg)
2007: Traditional EDW
Web Clusters
Scribe Mid-Tier
MySQL Clusters
Summarization Cluster
NAS Filers
RDBMS Data Warehouse
![Page 11: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/11.jpg)
2007: Pain Points
Summarization Cluster
Web Clusters
Scribe Mid-Tier
MySQL Clusters
NAS Filers
RDBMS Data Warehouse
- daily ETL > 24 hours- Lots of tuning/indexes etc.- Lots of hardware planning
- compute close to storage(early map/reduce)
![Page 12: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/12.jpg)
2007: Limitations
• Most use cases were in business metrics - data science, model building etc. not possible
• Only summary data was stored online - details archived away
![Page 13: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/13.jpg)
2008: Move to Hadoop
Web Clusters
Scribe Mid-Tier
MySQL Clusters
NAS Filers
Summarization Cluster
RDBMS Data Warehouse
![Page 14: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/14.jpg)
2008: Move to Hadoop
Web Clusters
Scribe Mid-Tier
MySQL Clusters
NAS Filers
RDBMS Data Mart
Hadoop/Hive Data Warehouse
Batch copier/loaders
![Page 15: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/15.jpg)
2008: Immediate Pros
• Data science at scale became possible
• For the first time all of the instrumented data could be held online
• Use cases expanded
![Page 16: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/16.jpg)
2009: Democratizing Data
Web Clusters
Scribe Mid-Tier
MySQL Clusters
NAS Filers
RDBMS Data Mart
Hadoop/Hive Data Warehouse
![Page 17: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/17.jpg)
2009: Democratizing Data
Hadoop/Hive Data Warehouse
Databee & Chronos: Data
Pipeline Framework
HiPal: Adhoc Queries + Data
Discovery
Nectar: instrumentation & schema aware data
collection
Scrapes: Configuration
Driven
![Page 18: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/18.jpg)
2009: Democratizing Data(Nectar)
• Typical Nectar Pipeline
• Simple schema evolution built in
• json encoded short term data
• decomposing json for long term storage
![Page 19: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/19.jpg)
2009: Democratizing Data (Tools)
• HiPal - data discovery and query authoring
• Charting and dashboard generation tools
![Page 20: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/20.jpg)
2009: Democratizing Data (Tools)
• Databee: Workflow language
• Chronos: Scheduling tool
![Page 21: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/21.jpg)
2009: Cons of Democratization
• Isolation to protect against Bad Jobs
• Fair sharing of the cluster - what is a high priority job and how to enforce it
![Page 22: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/22.jpg)
2010: Controlling Chaos
• Isolation
• Reducing operational overhead
• Better resource utilization
• Measurement, ownership, accountability
![Page 23: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/23.jpg)
2010: Isolation
Web Clusters
Scribe Mid-Tier
MySQL Clusters
NAS Filers
Hadoop/Hive Data Warehouse
![Page 24: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/24.jpg)
2010: Isolation
Web Clusters
Scribe Mid-Tier
MySQL Clusters
NAS Filers
Platinum Warehouse
Silver Warehouse
Hive Replication
![Page 25: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/25.jpg)
2010: Ops Efficiency
Web Clusters Scribe HDFS
MySQL Clusters
Platinum Warehouse
Silver Warehouse
Hive Replication
ptail: parallel tail on hdfs
near real time data consumers
![Page 26: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/26.jpg)
2010: Resource Utilization (Disk)
• HDFS-RAID: from 3 replicas to 2.2 replicas
• RCFile: Row columnar format for compressing Hive tables
![Page 27: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/27.jpg)
2010: Resource Utilization (CPU)
• Continuous copier/loaders
• Incremental scrapes
• Hive optimizations to save CPU
![Page 28: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/28.jpg)
2010: Monitoring(SLAs)
• Per job statistics rolled up to owner/group/team
• Expected time of arrival vs Actual time of arrival of data
• Simple data quality metrics
![Page 29: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/29.jpg)
2011: New Requirements
• More real time requirements for aggregations
• Optimizing resource utilization
![Page 30: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/30.jpg)
2011: Beyond Hadoop
• Puma for real time analytics
• Peregrine for simple and fast queries
![Page 31: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/31.jpg)
2011: Puma
Web Clusters Scribe HDFS
MySQL Clusters
Platinum Warehouse
Silver Warehouse
Hive Replication
ptail: parallel tail on hdfs
near real time data consumers
![Page 32: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/32.jpg)
2011: Puma
Scribe HDFS
ptail: parallel tail on hdfs
Puma Clusters Hbase Cluster
![Page 33: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/33.jpg)
Some takeaways
• Operating and optimizing Data Infrastructure is a hard problem
• Lots of components from log collection, storage, compute, query processing, tools and interfaces
• Lots of choices within each part of the stack
![Page 34: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/34.jpg)
Qubole
• Mission:
• Data Infrastructure in the Cloud made Easy, Fast and Reliable
• We take care of operating and optimizing this infrastructure so that you can focus on your data, analysis, algorithms and building your data apps
![Page 35: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/35.jpg)
Qubole - Information
• Early Trial(by invitation):
• www.qubole.com
• Come talk to us to join a small and passionate team
• Follow us on twitter/facebook/linkedin
![Page 36: Fb talk arch_summit](https://reader034.fdocuments.net/reader034/viewer/2022052506/5577e088d8b42a7b7b8b4b92/html5/thumbnails/36.jpg)