Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based...
Transcript of Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based...
![Page 1: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/1.jpg)
Data Lifecycles at Massive ScalePasadena, January 2016
![Page 2: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/2.jpg)
Background
Qubole Confidential
• Co-founder and CEO of Qubole§ Cloud Based Big Data as a Service§ Processes 250PB+ data every month
• Lead Data Infrastructure at Facebook§ Made Big Data Self-service in Facebook§ Nearly an Exabyte of data
• Co-creator of Apache Hive• Democratized Big Data through SQL
Operations Analyst
Marketing Ops
Analyst
DataArchitect
BusinessUsers
Product Support
Customer Support
Developer
Sales Ops
Product Managers
Data Infrastructure
![Page 3: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/3.jpg)
Evolution of Data
Qubole Confidential
![Page 4: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/4.jpg)
Requirements from Modern Data Infrastructure
Qubole Confidential
Scalability on Commodity Hardware
Multi-Persona support for Multiple Use-cases
DATA INFRASTRUCTURE
SQL
for Analysts
Machine Learning
for Data Scientists
Data Prep for
ETL Users
Streaming for
Developers
![Page 5: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/5.jpg)
Data Infrastructure for a Modern Enterprise
Qubole Confidential
Data Crea(on
Data Inges(on
Streaming Analy(cs
Batch Data Processing
Adhoc Analysis
Machine Learning
Visualiza(on/Dashboards
Applica(ons
CREATION COLLECTION PROCESSING USAGE
![Page 6: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/6.jpg)
Rise of Open Source
Qubole Confidential
Data Crea(on
Data Inges(on
Streaming Analy(cs
Batch Data Processing
Adhoc Analysis
Machine Learning
Visualiza(on/Dashboards
Applica(ons
CREATION COLLECTION PROCESSING USAGE
![Page 7: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/7.jpg)
Qubole – a reflection of Usage of a Modern Data Infrastructure
Qubole Confidential
Engine Usage on Qubole
Hive
Spark
MapReduce
Presto
Hbase
A Modern Data PlaEorm needs Mul(ple Engines -‐ Hive for Complex SQL -‐ Spark for Data Science and
Streaming -‐ Presto for Interac(ve Simple SQL -‐ Map Reduce for Batch ETL
250PB+ Data Processed Every Month
![Page 8: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/8.jpg)
Impediments for an Aspiring Data Driven Enterprise
Why companies struggle with
Self-Service Big Data :
• 6-18 month implementation time
• Only 27% of Big Data initiatives are classified as “Successful” in 2014
Rigid and inflexible
infrastructure
Non adap(ve soNware services
Highly specialized systems
Difficult to build and operate
• Only 13% of organizations achieve full-scale production
• 57% of organizations cite skills gap as a major inhibitor
![Page 9: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/9.jpg)
9
The Cloud Provides:
On-demand Infrastructure
Highly Scalable Object Stores
Self Service Infrastructure
Power of the Cloud
![Page 10: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/10.jpg)
10
Big Data Meets Cloud – Power of Object Stores
The Right Storage for Storing Big Data
• Elastic Scalability: petabyte scale without capacity constraints
• High Concurrency: throughput scales linearly
• High Availability: 99.99% availability
• High Durability: easily more durable than HDFS 3x replicas
• Enterprise Grade at a fraction of the cost
![Page 11: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/11.jpg)
11
Big Data Meets Cloud – Power of Elastic Compute
The Right Compute Paradigm to Fit Usage
• On-Demand: provision entire clusters in less than two minutes with no lead time for sourcing
• Elastic: cluster size should match workload; run with thousands of nodes when you need it, de-provision all nodes when idle
• Flexible: change compute infrastructure to fit workloads
• Cost Efficiency: Multiple SLA options to fit the right budget to the workloads
![Page 12: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/12.jpg)
12
Big Data Meets Cloud – Power of Self Service SaaS
![Page 13: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/13.jpg)
13
Big Data Meets Cloud – Cloud Security and Compliance
Security is a no. 1 citizen: Cloud Built from Outside–In
• Multiple Encryption options
• Industry-standard authentication for every REST API request
• Virtual Private Cloud
• Auto-logging for auditability
• Industry Compliance
![Page 14: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/14.jpg)
Big Data Platform and the Cloud
Enterprise Grade Security, Governance and Reliability
Multi-User and SaaS architecture for Best
Operational Efficiency
Auto-scaling and portability across Clouds
Successful Big Data Adoption at Scale with a Unified Big Data Platform Built for the Cloud
![Page 15: Data Lifecycles at Massive Scale...Qubole Confidential • Co-founder and CEO of Qubole! Cloud Based Big Data as a Service! Processes 250PB+ data every month • Lead Data Infrastructure](https://reader034.fdocuments.net/reader034/viewer/2022050113/5f4a0f8c91bb81620f67239f/html5/thumbnails/15.jpg)
Thank You