akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved...

84
akass@ + dmi@

Transcript of akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved...

Page 1: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

akass@ + dmi@

Page 2: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Building a Robust Analytics Platformwith an open-source stack

Page 3: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

What’s coming up:1) DigitalOcean - a company background

2) Data @ DigitalOcean

3) The Big Data Tech Stack @ DO

4) Use-cases + Demo

Page 4: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

What is DigitalOcean?

Page 5: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

What is DigitalOcean?A Cloud Hosting Company for Software Developers.

Page 6: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

What is DigitalOcean?A Cloud Hosting Company for Software Developers.

- 4 years old

Page 7: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

What is DigitalOcean?A Cloud Hosting Company for Software Developers.

- 4 years old- 12 Data Centers globally

Page 8: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

What is DigitalOcean?A Cloud Hosting Company for Software Developers.

- 4 years old- 12 Data Centers globally- Over 30 Million VMs created

Page 9: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

What is DigitalOcean?A Cloud Hosting Company for Software Developers.

- 4 years old- 12 Data Centers globally- Over 30 Million VMs created- In 196 Countries

Page 10: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

What is DigitalOcean?A Cloud Hosting Company for Software Developers.

- 4 years old- 12 Data Centers globally- Over 30 Million VMs created- In 196 Countries- 700K Developer Users

Page 11: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

What is DigitalOcean?A Cloud Hosting Company for Software Developers.

- 4 years old- 12 Data Centers globally- Over 30 Million VMs created- In 196 Countries- 700K Developer Users- 30K Developer Teams

Page 12: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

- Data & Analytics (“DnA” - Analysts + Data Scientists/Engineering)

- (Alex & Dao live here)

- Platform/Infrastructure Engineering

- Product Engineering

- Security

Data @ DO - Participating Teams

Page 13: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Data @ DO

- Product Usage

Page 14: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

- Product Revenue Forecasting- Churn Prediction- User Segmentation- Beta product uses and product cannibalization- Support Team Efficacy

Data @ DO - Product Usage

digitalocean.com

Page 15: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Data @ DO - Product Usage, v1.0

Pain Point 1: General Data Architecture

Huge, unwieldy SQL Tables → Slooooow, monolithic, unadaptive

Pain Point 2: Upstream Dependencies

E.g. Monthly Invoicing → Revenue analytics done on monthly basis

Page 16: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Data @ DO - Product Usage, v2.0

Revision 1: General Data Architecture

Microservices + Kafka pass application-level events → Faster and more robust, but teams must build their own consumers.

Revision 2: Active Downstream Consumption

E.g. Granular Billable Events → Daily, hourly, even near-RT processing of revenue for ingestion into analysis

Page 17: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Data @ DO

- Product Usage

- Sales/Marketing Leads

Page 18: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Problem:

- User behavioral data lives in AWS Redshift

- User metadata lives in MySQL on DO’s cloud

Data @ DO - Sales/Marketing Leads

digitalocean.com

Page 19: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Problem:

- User behavioral data lives in AWS Redshift

- User metadata lives in MySQL on DO’s cloud

Solution:

- Migrate warehousing to our own cloud so that all data stays on-premise

Data @ DO - Sales/Marketing Leads

digitalocean.com

Page 20: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Data @ DO

- Product Usage

- Sales/Marketing Leads

- Infrastructure

Page 21: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Problem - nay - Conundrum:

- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions

Data @ DO - Infrastructure

digitalocean.com

Page 22: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Problem - nay - Conundrum:

- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions

Data @ DO - Infrastructure

digitalocean.com

Page 23: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Problem - nay - Conundrum:

- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions

Data @ DO - Infrastructure

digitalocean.com

Page 24: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Problem - nay - Conundrum:

- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions

Data @ DO - Infrastructure

digitalocean.com

Page 25: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Problem - nay - Conundrum:

- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions

Data @ DO - Infrastructure

digitalocean.com

Page 26: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Problem - nay - Conundrum:

- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions

Data @ DO - Infrastructure

digitalocean.com

Page 27: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Problem - nay - Conundrum:

- Every 5 minutes, our entire active VM fleet is polled for OS and HW data using Prometheus and other in-house scraping solutions

- Significant scale (too big for RDBMS), inherent silos

Data @ DO - Infrastructure

digitalocean.com

Page 28: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

We need to reimagine how we process and store everything.

To recap:- Product Data in MySQL is slow and isolated- Sales/Marketing data are isolated in different warehouses- Infra data are prohibitively large and isolated

Page 29: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

We need to reimagine how we process and store everything.

Page 30: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

We need to reimagine how we process and store everything.

Page 31: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

We need to reimagine how we process and store everything.

Page 32: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

What is DigitalOcean?A Cloud Hosting Company for Software Developers.

Page 33: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

What is DigitalOcean?A Cloud Hosting Company = We have cloud infrastructure.

Page 34: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

What is DigitalOcean?A Cloud Hosting Company = We have cloud infrastructure.

= We can build our own big data environment.

Page 35: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

What is DigitalOcean?A Cloud Hosting Company = We have cloud infrastructure.

= We can build our own big data environment.

Page 36: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

The DO Big Data Stack

Page 37: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

The DO Big Data Stack

Page 38: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

The DO Big Data Stack

1. Distributed Systems Management

Page 39: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

The DO Big Data Stack

1. Distributed Systems Management2. Parallel Compute

Page 40: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

The DO Big Data Stack

1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments

Page 41: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

The DO Big Data Stack

1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments4. Distributed Data Warehousing

Page 42: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

The DO Big Data Stack

1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments4. Distributed Data Warehousing5. Distributed Streaming Events System

Page 43: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

The DO Big Data Stack

1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments4. Distributed Data Warehousing5. Distributed Streaming Events System6. Massively Parallel Query Engine

Page 44: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

The DO Big Data Stack

1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments4. Distributed Data Warehousing5. Distributed Streaming Events System6. Massively Parallel Query Engine7. Unifying IDE

Page 45: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

The DO Big Data Stack

1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments4. Distributed Data Warehousing5. Distributed Streaming Events System6. Massively Parallel Query Engine7. Unifying IDE8. Logging at Scale

Page 46: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

The DO Big Data Stack

1. Distributed Systems Management2. Parallel Compute3. Standardized Compute Environments4. Distributed Data Warehousing5. Distributed Streaming Events System6. Massively Parallel Query Engine7. Unifying IDE8. Logging at Scale

Page 47: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

“Buzz.”

– Hive

Page 48: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

“Buzz.”

– Hive

Page 49: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Use Case 1: A New ETL Pipeline for Support Ticket Events

Measuring Customer Satisfaction (CSAT)Integration with Internal TicketingMore transparency for Support Team

Page 50: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Use Case 1: A New ETL Pipeline for Support Ticket Events

Subscribe to new Kafka topic

Page 51: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Use Case 1: A New ETL Pipeline for Support Ticket Events

Subscribe to new Kafka topicPrototype reading of Kafka data within a

Spark-enabled Jupyter Notebook

Page 52: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Use Case 1: A New ETL Pipeline for Support Ticket Events

Subscribe to new Kafka topicPrototype reading of Kafka data within a Spark-enabled

Jupyter NotebookUpdate Spark Docker container to include new modules

and configurations, if necessary

$ cat spark-defaults.conf | grep dockerspark.mesos.executor.docker.image docker.internal.digitalocean.com/platform/spark:1929b8d

Page 53: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Use Case 1: A New ETL Pipeline for Support Ticket Events

Subscribe to new Kafka topicPrototype reading of Kafka data within a Spark-

enabled Jupyter NotebookUpdate Spark Docker container to include new

modules and configurations, if necessaryWrite Spark script & deploy onto Mesos

Page 54: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Use Case 1: A New ETL Pipeline for Support Ticket Events

Subscribe to new Kafka topicPrototype reading of Kafka data within a Spark-

enabled Jupyter NotebookUpdate Spark Docker container to include new

modules and configurations, if necessaryWrite Spark script & deploy onto MesosAnalyze!

Page 55: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Use Case 2: Billable Data, from Months to Minutes

Reminder:

Pain Point 2: Upstream Dependencies

E.g. Monthly Invoicing → Revenue analytics done on monthly basis

Revision 2: Active Downstream Consumption

Goal → Increase temporal granularity to daily, hourly, even near-RT processing of revenue for ingestion into analysis

Page 56: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Use Case 2: Billable Data, from Months to Minutes

Page 57: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Use Case 2: Billable Data, from Months to Minutes

Page 58: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Use Case 3: Power Failure Detection + PDU Clustering

Real-World problem: - many different drive models out in fleet…

digitalocean.com

Page 59: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Real-World problem: - many different drive models out in fleet…

- how to predict performance stability/failures with consistency?

Use Case 3: Power Failure Detection + PDU Clustering

digitalocean.com

Page 60: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Use Case 3: Power Failure Detection + PDU Clustering

Solution:

1) Measure PDU patterns & fluctuations on a podlet/rack/device level

2) Join to smartctl data to get vendor/drive-specific metadata

3) Mine for outliers to help DC teams react quicker to anomalies

digitalocean.com

Page 61: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Use Case 3: Power Failure Detection + PDU Clustering

Page 62: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Use Cases in Action:

Let’s have a DEMO...

Page 63: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Let’s have a DEMO...

Use Cases in Action:

Page 64: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Lessons learned from work thus far...

Page 65: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Lessons Learned (a small sample thus far)

- Jupyter was indispensable for prototyping Spark + Kafka interactively

digitalocean.com

Page 66: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Lessons Learned (a small sample thus far)

- Jupyter was indispensable for prototyping Spark + Kafka interactively

- Spark image containerization à consistency across Mesosphere

digitalocean.com

Page 67: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Lessons Learned (a small sample thus far)

- Jupyter was indispensable for prototyping Spark + Kafka interactively

- Spark image containerization à consistency across Mesosphere

- Multiple Docker images à developmental flexibility

digitalocean.com

Page 68: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Lessons Learned (a small sample thus far)

- Jupyter was indispensable for prototyping Spark + Kafka interactively

- Spark image containerization à consistency across Mesosphere

- Multiple Docker images à developmental flexibility

- Presto DB à significantly improved querying efficiency:- Naturally worked well with parallel data-stores on HDFS- Also improved RDBMS data retrieval through parallelization

digitalocean.com

Page 69: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Lessons Learned (a small sample thus far)

- Jupyter was indispensable for prototyping Spark + Kafka interactively

- Spark image containerization à consistency across Mesosphere

- Multiple Docker images à developmental flexibility

- Presto DB à significantly improved querying efficiency:- Naturally worked well with parallel data-stores on HDFS- Also improved RDBMS data retrieval through parallelization

- Laying pipelines to connect Kafka to Spark to HDFS was challenging;

digitalocean.com

Page 70: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Lessons Learned (a small sample thus far)

- Jupyter was indispensable for prototyping Spark + Kafka interactively

- Spark image containerization à consistency across Mesosphere

- Multiple Docker images à developmental flexibility

- Presto DB à significantly improved querying efficiency:- Naturally worked well with parallel data-stores on HDFS- Also improved RDBMS data retrieval through parallelization

- Laying pipelines to connect Kafka to Spark to HDFS was challenging; tuningeverything was often even harder!

digitalocean.com

Page 71: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Recapping…

Page 72: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

1) Consolidation/Centralization of data from across DO

Recapping…

Page 73: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

1) Consolidation/Centralization of data from across DO

Anyone can build reports/analyses

Recapping…

Page 74: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

1) Consolidation/Centralization of data from across DO

Anyone can build reports/analyses

2) Faster Reporting, Better Granularity

Recapping…

Page 75: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

1) Consolidation/Centralization of data from across DO

Anyone can build reports/analyses

2) Faster Reporting, Better Granularity

3) Tight coupling with the other engineering groups

Recapping…

Page 76: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

1) Consolidation/Centralization of data from across DO

Anyone can build reports/analyses

2) Faster Reporting, Better Granularity

3) Tight coupling with the other engineering groups

4) Modern stack allows massive scale with small headcount

Recapping…

Page 77: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Gazing into the Future...

Page 78: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Future Work- Hardware failure detection forecasting to help the DC team perform predictable

maintenance.

digitalocean.com

Page 79: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Future Work- Hardware failure detection forecasting to help the DC team perform predictable

maintenance.

digitalocean.com

Page 80: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Future Work- Hardware failure detection forecasting to help the DC team perform predictable

maintenance.

- Better inform hardware acquisition costs with detailed machine performance metrics.

digitalocean.com

Page 81: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Future Work- Hardware failure detection forecasting to help the DC team perform predictable

maintenance.

- Better inform hardware acquisition costs with detailed machine performance metrics.

- Profile hypervisors by usage to optimize how VMs are allocated to particular racks.

digitalocean.com

Page 82: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Future Work- Hardware failure detection forecasting to help the DC team perform predictable

maintenance.

- Better inform hardware acquisition costs with detailed machine performance metrics.

- Profile hypervisors by usage to optimize how VMs are allocated to particular racks.

- Beta-test/prototype/dogfood future products.

digitalocean.com

Page 83: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

digitalocean.com

Questions?

Page 84: akass@ + dmi@€¦ · - Naturally worked well with parallel data-stores on HDFS - Also improved RDBMS data retrieval through parallelization - Laying pipelines to connect Kafka to

Thank you!