20100513brown
-
Upload
jeff-hammerbacher -
Category
Documents
-
view
980 -
download
0
Transcript of 20100513brown
![Page 1: 20100513brown](https://reader034.fdocuments.net/reader034/viewer/2022042814/555354d0b4c9059e688b536a/html5/thumbnails/1.jpg)
Thursday, May 13, 2010
![Page 2: 20100513brown](https://reader034.fdocuments.net/reader034/viewer/2022042814/555354d0b4c9059e688b536a/html5/thumbnails/2.jpg)
Evolving a New Analytical PlatformWhat Works and What’s Missing
Jeff HammerbacherChief Scientist and Vice President of Products, ClouderaMay 13, 2010
Thursday, May 13, 2010
![Page 3: 20100513brown](https://reader034.fdocuments.net/reader034/viewer/2022042814/555354d0b4c9059e688b536a/html5/thumbnails/3.jpg)
My BackgroundThanks for Asking
▪ [email protected]▪ Studied Mathematics at Harvard▪ Worked as a Quant on Wall Street▪ Conceived, built, and led Data team at Facebook▪ Nearly 30 amazing engineers and data scientists▪ Several open source projects and research papers
▪ Founder of Cloudera▪ Vice President of Products and Chief Scientist▪ Also, check out the book “Beautiful Data”
Thursday, May 13, 2010
![Page 4: 20100513brown](https://reader034.fdocuments.net/reader034/viewer/2022042814/555354d0b4c9059e688b536a/html5/thumbnails/4.jpg)
Presentation Outline▪ Architectures for large scale data analysis▪ Reference architecture: ETL, DW, BI, Analytics▪ New foundations: HDFS and MapReduce
▪ SQL Server 2008 R2▪ The new platform emerges
▪ Building a new platform▪ Motivations▪ Implementation
▪ Questions and Discussion
Thursday, May 13, 2010
![Page 5: 20100513brown](https://reader034.fdocuments.net/reader034/viewer/2022042814/555354d0b4c9059e688b536a/html5/thumbnails/5.jpg)
Summary of the Presentation(I have a short attention span, too)
▪ The abstractions provided by a relational database are no longer useful on their own for analytical data management.
▪ The abstraction layer needs to be redrawn to include the functionality provided by ETL, MDM, stream management, reporting, OLAP, and search tools, with a unified user interface for collaboration on investigation and results.
▪ I don’t think the cloud has much to do with the above, except to kill “scale up” once and for all.
Thursday, May 13, 2010
![Page 6: 20100513brown](https://reader034.fdocuments.net/reader034/viewer/2022042814/555354d0b4c9059e688b536a/html5/thumbnails/6.jpg)
Experiences at FacebookEarly 2006: The First Research Scientist
▪ Source data living on horizontally partitioned MySQL tier▪ Intensive historical analysis difficult▪ No way to assess impact of changes to the site
▪ First try: Python scripts pull data into MySQL▪ Second try: Python scripts pull data into Oracle
▪ ...and then we turned on impression logging
Thursday, May 13, 2010
![Page 7: 20100513brown](https://reader034.fdocuments.net/reader034/viewer/2022042814/555354d0b4c9059e688b536a/html5/thumbnails/7.jpg)
Facebook Data Infrastructure2007▪ “Data Warehousing”▪ Began with Oracle database▪ Schedule data collection via cron▪ Collect data every 24 hours▪ “ETL” scripts: hand-coded Python▪ Data volumes quickly grew▪ Started at tens of GB in early 2006▪ Up to about 1 TB per day by mid-2007▪ Log files largest source of data growth
Oracle Database Server
Data Collection Server
MySQL TierScribe Tier
Thursday, May 13, 2010
![Page 8: 20100513brown](https://reader034.fdocuments.net/reader034/viewer/2022042814/555354d0b4c9059e688b536a/html5/thumbnails/8.jpg)
Facebook Data Infrastructure2008
MySQL TierScribe Tier
Hadoop Tier
Oracle RAC Servers
Thursday, May 13, 2010
![Page 9: 20100513brown](https://reader034.fdocuments.net/reader034/viewer/2022042814/555354d0b4c9059e688b536a/html5/thumbnails/9.jpg)
SQL Server 2008 R2Old Features
▪ ETL: SQL Server Integration Services▪ DW: SQL Server▪ Reporting: SQL Server Reporting Services▪ Analytics: SQL Server Analysis Services▪ Search: Full-Text Search
Thursday, May 13, 2010
![Page 10: 20100513brown](https://reader034.fdocuments.net/reader034/viewer/2022042814/555354d0b4c9059e688b536a/html5/thumbnails/10.jpg)
SQL Server 2008 R2New Features
▪ Stream management: StreamInsight▪ OLAP: PowerPivot▪ Collaboration: SharePoint▪ MDM: Master Data Services▪ Scale-out: Parallel Data Warehouse
Thursday, May 13, 2010
![Page 11: 20100513brown](https://reader034.fdocuments.net/reader034/viewer/2022042814/555354d0b4c9059e688b536a/html5/thumbnails/11.jpg)
A New FoundationMotivations and Implementation
▪ Orders of magnitude growth in data volumes and complexity▪ Often from machine-generated logs▪ Complex data is vast majority of data
▪ Built by consumer web teams and not enterprise software firms▪ Open source▪ Modular collection of tools, not an opaque abstraction▪ Applications, not just analysis▪ Solve user needs, don’t implement a spec
Thursday, May 13, 2010
![Page 12: 20100513brown](https://reader034.fdocuments.net/reader034/viewer/2022042814/555354d0b4c9059e688b536a/html5/thumbnails/12.jpg)
(c) 2009 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Thursday, May 13, 2010