(ISM213) Building and Deploying a Modern Big Data Architecture on AWS

Jason Woodlee, Datapipe Sr. Director Cloud Products

Ilya Krammer, Datapipe Software Engineer

October 2015

ISM213

Building and Deploying a Modern Big Data

Architecture on AWS

What to Expect from the Session

Presentation Overview

• Cloud analytics

• Refactoring to Amazon EMR

• Technical learnings

• Testing in a cloud Hadoop world

• Organization learnings

• Future of analytics

Cloud Analytics

Acquired in 2013

Data-mining and

intelligence tool to

govern and analyze

your AWS environment

Architecture Time Capsule

• Multiple disparate apps

• Heavy utilization of agents and APIs

• Multiple queues and high-frequency polling

• Heavy concurrent MongoDb real time access

Architecture

The Big Data Problem

• Multiple consumers: dashboard, aggregated

reports, query interfaces

• 2 GB files for each of out 1500+ clients on hourly

• 1000s of large API payloads per second

• 40+ VMs with distinct ETL processes

• Massive Mongo instances

The Big Data Problem, continued

• Processing was slow and error prone

• Infrastructure was mostly static with single points of

failure; maintenance was intense

• Release management became problematic

• Data store became a bottleneck to ETL and aggregation

• Always on MongoDB infrastructure became expensive

• Spend misaligned with client usage

“Why are we paying so

much for what is

essentially data at rest?”

Eureka Moment

Redesign Goals

Analytics

• Improve Performance

• Increase Scale

• Reduce Cost

Data Layer

• Increase Performance

• Reduce Cost

Reduce support footprint

Designing with EMR

• Separate raw data and user visible data

• EMR with Amazon S3 instead of MongoDB

• AWS services

• On-demand infrastructure

• Store user-visible data (low latency) on SSD drives with

TTLs for easy cleanup

Learnings

Resource alignment: Wide transient clusters over static

clusters reduced our cost significantly and allow massive

Static Hadoop EMR on Demand

m4.4xlarge instance Multiple jobs a day

Usage Min 20% ~ 4.8 hours ~ 4 hour a day

Monthly Cost $800 a month $ 170 a month

Learnings: Right Tool for the Job

With Amazon Elastic MapReduce we were

able to process 90% of the analytics in a

single pass through the whole dataset

performance

improvement

EMR for performance

Learnings: Data Management

• Utilize Hadoop

merge instead

of lookups

• Pipeline

Hadoop to

normalized

data before

processing

Results

• Processing reduction 75%

• Cost reduction 80%

• Improved maintainability

Testing in a Cloud Hadoop World

Approach

• Old method of testing is incomplete

• Full data size needed to validate complex analytics

Best practices emerged

• Scripts and tooling developed to rapidly create

environments

• Different strategies and approaches to validate changes

Organizational Learnings

• Architecture drives adoption

• Early pioneers lead the charge

• Adoption is more complex than

traditional stacks

• Ramp-up of teams is much slower

• Amazon EMR is very effective for

rapid prototyping

Future of Analytics

Where Do We See Analytics Going?

Ecosystem alignment

• The Hadoop world is in a tug of war

between vendors and tools

• BI vendors and platform providers will

level to a small few, while open source

competitors and startups explode

Key areas of pain will be resolved

• Current challenges will get better

• Job management

• Data reporting

• Log consolidating

Managed service providers

• Experience in handling broad sets of

data from a large client base will

continue to enable MSPs such as

Datapipe to build expertise and evolve

consulting capabilities

Thank you!

Questions

Remember to complete

your evaluations!

(ISM213) Building and Deploying a Modern Big Data Architecture on AWS

Technology

Transcript of (ISM213) Building and Deploying a Modern Big Data Architecture on AWS

Deploying Your First App on AWS with MongoDB Management Service (MMS)

AWS re:Invent 2016: Fanatics: Deploying Scalable, Self-Service Business Intelligence on AWS (BDA207)

Deploying, Scaling, and Running Grails on AWS and VPC

RDBMS in the Cloud: Deploying SQL Server on AWS

Best Practices for Deploying Microsoft Workloads on AWS

Deploying a containerized web application with AWS Cloud ... · aws-elasticbeanstalk aws-elasticloadbalancing aws-elasticloadbalancingv2 aws-elasticloadbalancingv2-targets aws-elasticsearch

Guidelines for deploying Cassandra to AWS part 2

Marketing & Analytics · AWS training content Day-1 AWS Overview ... Designing For High Availability ... Azure Websites o Building and Deploying o Deploying and Scheduling WebJobs

(BAC304) Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum Efficiency | AWS re:Invent 2014

MongoDB on the AWS Cloud · configuration steps for deploying a MongoDB cluster on the Amazon Web Services (AWS) cloud. It discusses best practices for deploying MongoDB on AWS using

Deploying Digital Promotions with AWS

GraphConnect Europe 2016 - Securely Deploying Neo4j into AWS - Benjamin Nussbaum

Delivering Modern Operations on AWS

DEPLOYING PERVASIVE SECURITY WITH AWS TRANSIT …...4 WHITE PAPER: DEPLOYING PERVASIVE SECURITY WITH AWS TRANSIT GATEWAY AND FORTINET CLOUD SERVICES HUB HIGH AVAILABILITY FortiGate

Deploying Data Science with Docker and AWS

Deploying & Supporting a Modern Trial Site

Deploying MicroStrategy on the Cloud with AWS or Azure

(BAC208) Bursting to the Cloud: Deploying a Hybrid Cloud Storage Solution with AWS | AWS re:Invent 2014

Effortlessly Deploying a PI System in Azure or AWS · #PIWorld ©2019 OSIsoft, LLC Effortlessly Deploying a PI System in Azure or AWS Eugene Lee Technology Enablement Valentin Ivanov

MMC2820BE Live Demo: 3 Best Practices for Deploying, · PDF fileLive Demo: 3 Best Practices for Deploying, Managing and Securing AWS EC2 Apps with VMware Cloud Services ... AWS VPCs,