Cloud Optimized Big Data

Cloud-Optimized Big-Data as a Service

Joydeep Sen SarmaCo-Founder Qubole, Apache-Hive

About Me

• @Facebook (2007-2011):– First Hadoop Engineer– Founder - Apache Hive project, PMC Member– Contributor to Apache Hadoop/HBase

• Founder Qubole (2012-)– Hadoop-as-a-Service– 30+ customers: Pinterest, Quora, Mediamath, Tubemogul …– Design/Code/Ops/Support/…

Big Data Cloud

• Elasticity:– Workloads are Bursty– Allows easy rolling upgrades and testing

• Lower TCO:– Cloud Storage is Inexpensive (2-3c/GB/month – globally replicated)– Zero cost to try new projects– Upgrade to new hardware easily (no cluster migrations!)

Big Data Cloud

• Global:– Easily set up where employees/customer/entities are located

• Collaboration:– Zero-Copy sharing of data with Partners and across Departments– Easy access to great public data sets

• As-a-Service delivery model vastly lowers Operational Cost

Cloud-Optimized Big Data?

• Optimized for lower TCO

• Optimized for Speed

• Optimized for Operations/Support

Cloud-Optimized Big Data

Optimized for lower TCO

7

select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county;

insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip;

hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache…

AdCo Hadoop

Automated LifeCycle Mgmt

insert overwrite table dest select … from ads join campaigns on …group by …;

8

StarCluster

Map Tasks

ReduceTasks

Demand

Supply

AWS

Progress

Master

Slaves

Job Tracker

Auto-Scaling

9

Spot Instances

On an average 50-60% cheaper

• Fallback to regular instances when Spot unavailable

• Replace regular instances with Spot when available

10

Using Fast but ‘Thin’ nodes

• C3 instances: 50% better performance at 20% lower cost• Little local storage

11

Using Fast but ‘Thin’ nodes

Modify Hadoop to use Network drives for overflow

Map-Reduce HDFS

LocalSSD

Network Drives

Disk I/O

Overflow


Optimized for Speed

• Optimize I/O to AWS S3– Faster Split Computation (8x)– Prefetching S3 files (30%)– Zero-Copy writes to S3

• JVM Reuse (1.2-2x speedup)

• Columnar File Caches on local disks (1.2-2x speedup)

• 30-50% cost savings because of cluster consolidation

Faster, Faster ..

• 5x Faster than nearest competitor (Hive against S3)


Faster, Faster ..

• Presto-as-a-Service – 3-22x faster SQL against S3– (as tested by customer)


Faster, Faster ..


Optimized for Operations/Support

Rolling Upgrades

• @Facebook – we spent months upgrading large cluster• @Qubole: Start new cluster, Reassign label


Support

CHATEMail

Visually browse Historical Jobs

Questions?

[email protected]@jsensarma

www.qubole.com

mailto:[email protected]

Cloud Optimized Big Data

Engineering

Transcript of Cloud Optimized Big Data