Architecting for the cloud cloud providers

© Matthew Bass 2013

Architecting for the Cloud

Len and Matt Bass

Cloud Providers


IaaS Providers

• There are several primary providers

– Amazon: Amazon Web Services (AWS)

– Microsoft: Azure

– Google: Google Compute Engine

– …

• Each of these are set up a bit differently with slightly different internal decisions and associated services


Goals

• The goals for this talk is not to give you a definitive how to for each provider

• It’s meant to give you just an introduction

• The idea is that you’ll see how the concepts that we talked about in the course map to specific providers

• We’ll look primarily at Amazon (with some details from others thrown in)

• We’ll go through both the overall structure and look at specific services


Amazon Elastic Compute Cloud

• Amazon EC2 provides compute capacity in the cloud

• You can select the machine image with a given OS and specified capability

• You can resize the capacity as needed

• Takes minutes to spin up a new VM

• You can specify multiple instances and select where they will run – Region & availability zones

• You pay per usage/hour depending on the capability of the instance and if it’s a reserved instance (dedicated)


Regions • Amazon has divided their cloud offerings into multiple regions. Each region

should be thought of as a separate cloud – I.e. there is no automatic copying of data from one region to another.


Current AWS Regions

• North America: – US East (5 availability zones) – US West Oregon (3 availability zones) – US West Northern California (3 availability zones) – USGov Cloud (2 availability zones)

• South America – Sao Paulo (2 availability zones)

• Europe – Ireland (2 availability zones)

• Asia Pacific – Sydney (2 availability zones) – Singapore (2 availability zones) – China (1 availability zone) – Tokyo (3 availability zones)


AWS and Services

• Amazon Web Services offers a number of services

• These services are things like: – Storage

– Database

– Network capabilities

– Monitoring

– …

• Not all services are available at all regions – https://aws.amazon.com/about-aws/globalinfrastructure/regional-

product-services/


Amazon Availability Zones

• Amazon has a notion of availability zones

• Engineered to be insulated from failures in other availability zones

• Availability zones are locations within a region

• Amazon has not announced the details of an availability region but presumably they are – Physically separate data centers

– Have independent networks

– Have independent power delivery

– …


Amazon Service Level Agreement

• Amazon guarantees 99.95% availability for each region

• IaaS consumers are free to deploy their applications: – Within an availability zone

– Across availability zones but within a region

– Across regions

• Amazon does not make any claim about the availability of their availability zones (that I could find)


All-in-one Single Server


Basic 4-server Setup


Multiple Availability Zones


Multiple Regions


Elastic Compute Cloud (EC2) & Redundancy

• EC2 supports different levels of redundancy

– It is up to the customer to determine how much redundancy they wish to have and how much they wish to pay for it

• Redundant elements can be:

– Within an availability zone

– Across availability zones

– Across regions


Microsoft Azure Regions

• North America – US Central (Iowa) – US East (Virginia) – US East 2 (Virginia) – US North Central (Illinois) – US South Central (Texas) – US West (California)

• Europe – Europe North (Ireland) – Europe West (Netherlands)

• Asia Pacific – East (Hong Kong) – Southeast (Singapore)

• Japan – Japan East (Saitama) – Japan West (Osaka)

• Brazil – Sao Paulo


Fault Domains in Azure

• In Azure there is the concept of Fault Domains

• A Fault Domain is essentially a rack in a given datacenter

• A consumer is not able to define which fault zones the application are distributed to

– Unlike an availability zone

• As a result the fault zone is really an internal structure


Upgrade Domains in Azure

• An upgrade domain is similar to a fault domain

• Essentially an upgrade domain will be upgraded at one time

– When Microsoft upgrades their internal infrastructure they do so a domain at a time

• In order to guard against failures within a fault domains and upgrades you need to replicate across both fault and upgrade domains

• This is called an availability set


Azure Availability Sets


Amazon Auto Scaling

• Auto Scaling works in conjunction with Cloudwatch (Amazon’s monitoring service)

• The idea is the monitoring service monitors the metrics – CPU utilization – Latency – Memory consumption

• The Auto Scaling solution establishes the rules – Add instances when utilization exceeds 70% – Remove instances when utilization falls below 10%

• You can specify things like a “cooling off” period – Where no action is taken until the system has a chance to stabilize


Amazon Elastic Load Balancer

• This is Amazon’s load balancing solution – Recall the push/pull architecture discussion

• It tracks the status and location of instances

• Routes requests to healthy instances based on criteria that you establish

• Can be used in conjunction with Auto Scaling – When new instances are added or removed they are registered with the ELB

• Can use in conjunction with Amazon’s DNS (route 53) – You can use DNS failover to move from one region to another

– The DNS will route traffic to the ELB in the target region


Amazon Simple Queue Service

• SQS is Amazon’s queuing service

– Again recall the push/pull architecture discussion

• It’s a service that supports message queues

• Recall it can be used in conjunction with Auto Scaling to manage the elasticity of your application

• Pricing is per million requests handled


Amazon Storage Solutions

• Amazon has several storage solutions – Elastic Block Store (EBS) – Simple Storage Solution (S3) – Glacier

• These provide raw unmanaged storage • This is useful for:

– Disaster recovery – Backup – Archiving – Persistence for your own database solution


Amazon Elastic Block Store

Amazon Elastic Block Store (EBS) is Amazon’s data file system. Some of its features are

• Data is persisted independently from instances

• EBS data is placed in a specific availability zones and can be attached to instances in the same availability zone

• EBS data is automatically replicated within availability zone

• There are two networks that connect EBS instances – A high speed network to provide coordination among instances and move data between

instances.

– A lower speed network used as backup for coordination.

• $0.05 per million I/O requests


Amazon Simple Storage Solution (S3)

• S3 is a scalable storage solution

• Good for content storage and distribution

• Good for backup, archiving, and disaster recovery

• Costs $0.03 per GB of data

• More expensive but faster than Glacier

• Not as fast for I/O as EBS


Amazon Glacier

• Low cost storage solution

• Good for off site archival of Enterprise data

• Good for backup and data archiving

• Good for large volumes of data

• Costs $0.01 per GB of data


Amazon Database Solutions

• Amazon has a number of fully managed database solutions

• These are built on top of one of Amazon’s storage solutions

• They include:

– DynamoDB

– Relational Data Store (RDS)

– Redshift

– ElastiCache


DynamoDB

• Key Value data store

• Uses a throughput oriented pricing model (rather than a storage oriented model)

• Uses solid state drives

• Guarantees single digit read latencies

• You pay a flat hourly rate based on capacity that you reserve

– Costs $0.0065 per hour for every 10 units of write capacity

– Costs $0.0065 per hour for every 10 unites of read capacity


Relational Data Store

• A distributed relational web service that provides a relational database for use in applications

• It provides access to MySQL, Oracle, SQL Server, or PostgreSQL

• It simplifies installation, patching, and backup related issues

• Priced per hour according to db type, size, and number


Redshift

• Redshift is Amazon’s data warehousing solution

• Integrates with other storage solutions

• Priced at either $0.25 per hour on the low end

• $1000/year per terabyte per year


ElastiCache

• A Web Service that enables an in memory data cache

• Supports:

– Memcached

– Redis

• Improves latency and throughput for read heavy applications

• Prices are per Cache node/hour


Amazon CloudFront

• Amazon’s content delivery network

• Provides edge services

– Competes with companies such as Akamai

• This service will allow you to locate content closer to users

– Reduces latency

• You specify the edge location and point it to the origin

• You can route DNS to the edge location if you want


Amazon Elastic IP Addressing

• Amazon provides elastic IP addressing

• The IP address is associated with your account – not with an instance

• You can programmatically map the elastic IP to any instance in your account

• In this way you make the deployment configuration transparent to the user/application

– Remember the virtual network discussion?


Many Other Services Available

• Authentication services

• Analytics

• Elastic Map Reduce

• Real time data streaming and processing

• Business process automation services

• Email services

• Notification services

• …


Comparison to Other Providers

• Other major providers (Google, Microsoft, Rackspace) offer similar services

• Google doesn’t have as many services but has different pricing model

– Charges in 10 minute increments rather than one hour increment

• Microsoft has similar services

• Rackspace also provides comparable options


Outages

• In Amazon (and others) there are some kinds of outages that are specific to the structure of the provider

• We will now look at some of these outages


Zone Failure

• All of the IaaS providers have some notion of an “availability zone”

• An availability zone (or fault domain in Azure) has it’s own switch, router, and rack

• These availability zones are isolated from each other in a way that nodes within an availability zone are not


Zone Failure Modes

• A zone can fail in different ways

Zone 1 Zone 2 Zone 3

Region


Complete Failure

• If for example you have a power outage you’ll have a complete failure

• If you try to route traffic to any of these machines you’ll get a “no route to host”

– This happens quickly – fast fail

• You’ll know the zone is out

• You can then spin up a new zone elsewhere


Zone Failure Modes

• You could have a network failure


Region


Network Failure

• If you have a network failure it’s typically not a complete failure

• The machines are still working but the network is having trouble

• There is often still a route to host but your data isn’t reaching the host

• As a result you don’t get a fast fail

– You’ll get long timeouts


Network Failure

• With the long timeouts your system will start to back up

• It’s difficult to tell the difference between this issue and other issues that result in latency lags

• This problem can be intermittent as some of the routers might be down but not all


Zone Failure Modes

• You could have a failure of some zone service


Region


Zone Service Failure

• This is some when a service fails that the zone is dependent on

– It could be something that is part of the platform as a service (e.g. EBS)

– It could also be a central service in your application

• This causes cascading failures

• Difficult to figure out what is going on


Region Failure

• It’s rare but a Region can fail as well

• Both complete and partial failures have happened

• Typically this starts with isolated issues that cascade

• There might be an issue with a few nodes or with a single availability zone

• Other zones become impacted (often due to additional traffic) and fail

– It can be difficult to determine the scope of the issue while it’s occurring


Regional Failure Modes

• You could loose network access to a region


Region


Regional Outage

• This is often caused by

– a DNS issue

– Router issues

– Network capacity overload

• Causes you to loose access to a region


Regional Failure Modes

• Local failures can cause a control plane overload


Region


Data Store Failure

• As with the other portions of the system the data store can become unresponsive

• The remedy for this is typically to mark this node as bad and attempt to bring a new node online

• If the issue is more pervasive it can result in:

– Disrupted availability

– Loss of persistent data


Backup Failure

• Systems will often have a backup data mechanism

• This is often a key component in disaster recovery

• This can also fail

– It can become temporarily or permanently unavailable


Upgrades

• Cloud providers need to upgrade their software as well

• When they do this the nodes that are being upgraded experience an outage

• If your software is running on these nodes you might experience an outage as well


Utilizing AWS

• You can utilize AWS in many ways

– You can host your entire application in the cloud

– You can host a specific portion of your application in the cloud

– You can use the cloud for a specialized need


Hosting Your Application

• You can have a system that is fully deployed in the cloud • You’ll need to figure out how to structure the application to achieve both functional and quality

attribute needs • You’ll want to first consider quality attribute concerns such as:

– Scalability – Availability – Security – …

• Utilize the techniques we talked about to determine the needs – Fault modeling (considering the cloud specific faults) – Threat modeling – Understanding the anticipated load and desired throughput and latency

• Come up with a gross structure that achieves your objectives – Think about partitioning of the system to support testing, degraded modes of operation and independent

deployment


Partial Hosting

• You might want to leverage the cloud for a specific portion of your system e.g. – Supporting mobile applications

– Databases

– Analytics

– Delivery of particular content

– Hosting your front end

– …

• This is typically going to be driven by cost and quality attribute needs (e.g. scalability)


Backup and Recovery

• Many organizations utilize the cloud for bulk storage, archiving, or back up and recovery

• In the past external services were used for such needs

– They often stored data on tape in separate physical locations

• It can be cheaper and more convenient to utilize cloud services

• As a result many organizations use the cloud for such storage needs


Summary

• Many services are available in the cloud

– Storage

– Network

– Compute related services

– …

• These services provide different levels of service at different pricing levels

• Utilizing the cloud appropriately and efficiently takes an explicit understanding of both your needs and the services available

Architecting for the cloud cloud providers

Software

Transcript of Architecting for the cloud cloud providers