Cloud Optimized Big Data
-
Upload
joydeep-sen-sarma -
Category
Engineering
-
view
228 -
download
5
description
Transcript of Cloud Optimized Big Data
![Page 1: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/1.jpg)
Cloud-Optimized Big-Data as a Service
Joydeep Sen SarmaCo-Founder Qubole, Apache-Hive
![Page 2: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/2.jpg)
About Me
• @Facebook (2007-2011):– First Hadoop Engineer– Founder - Apache Hive project, PMC Member– Contributor to Apache Hadoop/HBase
• Founder Qubole (2012-)– Hadoop-as-a-Service– 30+ customers: Pinterest, Quora, Mediamath, Tubemogul …– Design/Code/Ops/Support/…
![Page 3: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/3.jpg)
Big Data Cloud
• Elasticity:– Workloads are Bursty– Allows easy rolling upgrades and testing
• Lower TCO:– Cloud Storage is Inexpensive (2-3c/GB/month – globally replicated)– Zero cost to try new projects– Upgrade to new hardware easily (no cluster migrations!)
![Page 4: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/4.jpg)
Big Data Cloud
• Global:– Easily set up where employees/customer/entities are located
• Collaboration:– Zero-Copy sharing of data with Partners and across Departments– Easy access to great public data sets
• As-a-Service delivery model vastly lowers Operational Cost
![Page 5: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/5.jpg)
Cloud-Optimized Big Data?
• Optimized for lower TCO
• Optimized for Speed
• Optimized for Operations/Support
![Page 6: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/6.jpg)
Cloud-Optimized Big Data
Optimized for lower TCO
![Page 7: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/7.jpg)
7
select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county;
insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip;
hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache…
AdCo Hadoop
Automated LifeCycle Mgmt
![Page 8: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/8.jpg)
insert overwrite table dest select … from ads join campaigns on …group by …;
8
StarCluster
Map Tasks
ReduceTasks
Demand
Supply
AWS
Progress
Master
Slaves
Job Tracker
Auto-Scaling
![Page 9: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/9.jpg)
9
Spot Instances
On an average 50-60% cheaper
• Fallback to regular instances when Spot unavailable
• Replace regular instances with Spot when available
![Page 10: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/10.jpg)
10
Using Fast but ‘Thin’ nodes
• C3 instances: 50% better performance at 20% lower cost• Little local storage
![Page 11: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/11.jpg)
11
Using Fast but ‘Thin’ nodes
Modify Hadoop to use Network drives for overflow
Map-Reduce HDFS
LocalSSD
Network Drives
Disk I/O
Overflow
![Page 12: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/12.jpg)
Cloud-Optimized Big Data
Optimized for Speed
![Page 13: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/13.jpg)
• Optimize I/O to AWS S3– Faster Split Computation (8x)– Prefetching S3 files (30%)– Zero-Copy writes to S3
• JVM Reuse (1.2-2x speedup)
• Columnar File Caches on local disks (1.2-2x speedup)
• 30-50% cost savings because of cluster consolidation
Faster, Faster ..
![Page 14: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/14.jpg)
• 5x Faster than nearest competitor (Hive against S3)
• 30-50% cost savings because of cluster consolidation
Faster, Faster ..
![Page 15: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/15.jpg)
• Presto-as-a-Service – 3-22x faster SQL against S3– (as tested by customer)
• 30-50% cost savings because of cluster consolidation
Faster, Faster ..
![Page 16: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/16.jpg)
Cloud-Optimized Big Data
Optimized for Operations/Support
![Page 17: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/17.jpg)
Rolling Upgrades
• @Facebook – we spent months upgrading large cluster• @Qubole: Start new cluster, Reassign label
• 30-50% cost savings because of cluster consolidation
![Page 18: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/18.jpg)
Support
CHATEMail
![Page 19: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/19.jpg)
Visually browse Historical Jobs
![Page 20: Cloud Optimized Big Data](https://reader033.fdocuments.net/reader033/viewer/2022051413/5578f220d8b42a5c5c8b522e/html5/thumbnails/20.jpg)
Visually browse Historical Jobs