Post on 21-Jan-2017
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jason Woodlee, Datapipe Sr. Director Cloud Products
Ilya Krammer, Datapipe Software Engineer
October 2015
ISM213
Building and Deploying a Modern Big Data
Architecture on AWS
What to Expect from the Session
Presentation Overview
• Cloud analytics
• Refactoring to Amazon EMR
• Technical learnings
• Testing in a cloud Hadoop world
• Organization learnings
• Future of analytics
Cloud Analytics
Acquired in 2013
Data-mining and
intelligence tool to
govern and analyze
your AWS environment
Architecture Time Capsule
• Multiple disparate apps
• Heavy utilization of agents and APIs
• Multiple queues and high-frequency polling
• Heavy concurrent MongoDb real time access
The Big Data Problem
• Multiple consumers: dashboard, aggregated
reports, query interfaces
• 2 GB files for each of out 1500+ clients on hourly
basis
• 1000s of large API payloads per second
• 40+ VMs with distinct ETL processes
• Massive Mongo instances
The Big Data Problem, continued
• Processing was slow and error prone
• Infrastructure was mostly static with single points of
failure; maintenance was intense
• Release management became problematic
• Data store became a bottleneck to ETL and aggregation
• Always on MongoDB infrastructure became expensive
• Spend misaligned with client usage
Redesign Goals
Analytics
• Improve Performance
• Increase Scale
• Reduce Cost
Data Layer
• Increase Performance
• Reduce Cost
Reduce support footprint
Designing with EMR
• Separate raw data and user visible data
• EMR with Amazon S3 instead of MongoDB
• AWS services
• On-demand infrastructure
• Store user-visible data (low latency) on SSD drives with
TTLs for easy cleanup
Learnings
Resource alignment: Wide transient clusters over static
clusters reduced our cost significantly and allow massive
scale
Static Hadoop EMR on Demand
m4.4xlarge instance Multiple jobs a day
Usage Min 20% ~ 4.8 hours ~ 4 hour a day
Monthly Cost $800 a month $ 170 a month
Learnings: Right Tool for the Job
With Amazon Elastic MapReduce we were
able to process 90% of the analytics in a
single pass through the whole dataset
40%
performance
improvement
EMR for performance
Learnings: Data Management
• Utilize Hadoop
merge instead
of lookups
• Pipeline
Hadoop to
normalized
data before
processing
Testing in a Cloud Hadoop World
Approach
• Old method of testing is incomplete
• Full data size needed to validate complex analytics
Best practices emerged
• Scripts and tooling developed to rapidly create
environments
• Different strategies and approaches to validate changes
Organizational Learnings
• Architecture drives adoption
• Early pioneers lead the charge
• Adoption is more complex than
traditional stacks
• Ramp-up of teams is much slower
• Amazon EMR is very effective for
rapid prototyping
Where Do We See Analytics Going?
Ecosystem alignment
• The Hadoop world is in a tug of war
between vendors and tools
• BI vendors and platform providers will
level to a small few, while open source
competitors and startups explode
Where Do We See Analytics Going?
Key areas of pain will be resolved
• Current challenges will get better
• Job management
• Data reporting
• Log consolidating
Where Do We See Analytics Going?
Managed service providers
• Experience in handling broad sets of
data from a large client base will
continue to enable MSPs such as
Datapipe to build expertise and evolve
consulting capabilities