Post on 19-Aug-2014
description
Headline Goes HereSpeaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12
Challenges of running Hadoop on AWSJune 12, 2014 @ AdvancedAWS Meetup - Citizen Space
Andrei Savu - @andreisavu
Software Engineer, Cloud Automation Team
Overview
● Introduction● Context● Challenges● Questions
Andrei Savu
Software Engineer
Cloud Automation Team @ Cloudera
Previously: founder of Axemblr, Apache Whirr PMC, contributor to jclouds, Cloudsoft, Facebook etc. (see LinkedIn)
Cloud Automation Team @ Cloudera
Focused on:
● building tools to automate deployment and ongoing management of Hadoop clusters on cloud infrastructure
● improving Hadoop cloud compatibility (e.g. s3 integration, swift, managed databases, custom network topologies etc.)
We are hiring!
Context
● Hadoop● Types of Deployments● Cluster Topology● AWS
Context: Hadoop
Hadoop is a broad, coherent stack of products for data storage and processing.
“Hadoop” is more than HDFS & MapReduce. It can do: multiple storage systems, different query engines, batch and real-time etc.
Usually running on bare metal now moving towards cloud infra.
Types of Deployments
Long running:
- store data for analytics jobs with MapReduce, Impala, Spark
- online data serving with HBase
On-demand:
- analytical workloads, fetch data on-demand
- triggered by workflows (1:1)
- disconnected lifecycle (Netflix Genie)
Cluster Topology #1
Simple:
● EC2 classic (being phased-out) ● VPC: single subnet, security group with an optional VPN
Complex:
● VPC: multiple subnets & security groups● DirectConnect● highly available with disaster recovery● multiple users & security
Cluster Topology #2
● Cloudera Reference Architecture for AWS Deployments:http://www.cloudera.com/content/cloudera/en/resources/library/whitepaper/cloudera-enterprise-reference-architecture-for-aws-deployments.html
● Best Practices for Deploying Cloudera Enterprise on AWS:http://blog.cloudera.com/blog/2014/02/best-practices-for-deploying-cloudera-enterprise-on-amazon-web-services/
Amazon Web Services
Paradigm shift in how we work with infrastructure.
Key concept: software defined - controlled by APIs
Has most of the things we need for storage and high performance data processing (placement groups, large instances, high storage density, ssds, many vCPUs etc.)
Enterprise-ready: IAM, VPC, VPN / DirectConnect, Support etc.
Challenges
● Instance Provisioning & Health● Ensuring Idempotency● Networking & Performance● AMIs & Bootstrap Speed● Data durability● S3 integration
What makes it more difficult?
… versus a typical stateless web application in an auto-scaling group monitoring request latency or OS load averages
● statefulness (think databases)● each cluster has multiple processes playing different roles● topology & configuration changes require orchestration● knowledge of service inter-dependencies is required
Instance Provisioning & Health
Questions:
● How do you define your cluster size to deal lack of capacity?● How do you define health? Is that stable during setup?● Is health a binary property? Or a threshold that needs to be
continuously evaluated?
Potential answers:
● match AWS semantics: define size as a range● make simplifying assumptions (e.g. healthy during setup)
Ensuring Idempotency
Questions:
● How do you safely retry expensive calls?● How do you build reliable workflows?
Potential answers:
● AWS User Guide via client token● Discuss: Convergence vs. Single step retries
Networking & Performance
Questions:
● What’s the ideal setup that’s both usable and secure?● How do you get consistent intra-cluster performance?
Potential answers:
● VPC with VPN or DirectConnect. Placement groups help. ● Security model: initial it was just perimeter security, now it
can do a lot more (disk encryption, SSL, kerberos)
Images & Bootstrap Speed
Questions:
● Do you allow custom AMIs or force your own choices?● If using custom AMIs how can you reduce bootstrap time?
Potential answers:
● Custom AMIs are common - integrated with existing infra● Fast bootstrap by baking on top with custom bits
Data durability
Questions:
● How do you place replicas? Datacenter topology?● How are instances distributed in different failure domains?
Potential answers:
● ignore or go with large instances that map 1:1 to hosts● would be nice to have: a way to influence host to instance
allocation or to get datacenter topology data
S3 integration
Questions:
● How do you reconcile differences in semantics with HDFS? (strongly consistent vs. eventual consistency)
● How do you get most out of it in terms of performance?
Potential answers:
● we’ve done a fair amount of work improving S3 in the open source (features, stability improvements, security etc.)
● performance is network bound
Thanks! Questions?
Andrei Savu - asavu@cloudera.com
Twitter: @andreisavu
Join us to take Hadoop to the clouds!https://hire.jobvite.com/Jobvite/job.aspx?j=orafYfwy&b=nqlg3nwW