Architecting for the cloud cloud providers
description
Transcript of Architecting for the cloud cloud providers
© Matthew Bass 2013
Architecting for the Cloud
Len and Matt Bass
Cloud Providers
© Matthew Bass 2013
IaaS Providers
• There are several primary providers
– Amazon: Amazon Web Services (AWS)
– Microsoft: Azure
– Google: Google Compute Engine
– …
• Each of these are set up a bit differently with slightly different internal decisions and associated services
© Matthew Bass 2013
Goals
• The goals for this talk is not to give you a definitive how to for each provider
• It’s meant to give you just an introduction
• The idea is that you’ll see how the concepts that we talked about in the course map to specific providers
• We’ll look primarily at Amazon (with some details from others thrown in)
• We’ll go through both the overall structure and look at specific services
© Matthew Bass 2013
Amazon Elastic Compute Cloud
• Amazon EC2 provides compute capacity in the cloud
• You can select the machine image with a given OS and specified capability
• You can resize the capacity as needed
• Takes minutes to spin up a new VM
• You can specify multiple instances and select where they will run – Region & availability zones
• You pay per usage/hour depending on the capability of the instance and if it’s a reserved instance (dedicated)
© Matthew Bass 2013
Regions • Amazon has divided their cloud offerings into multiple regions. Each region
should be thought of as a separate cloud – I.e. there is no automatic copying of data from one region to another.
© Matthew Bass 2013
Current AWS Regions
• North America: – US East (5 availability zones) – US West Oregon (3 availability zones) – US West Northern California (3 availability zones) – USGov Cloud (2 availability zones)
• South America – Sao Paulo (2 availability zones)
• Europe – Ireland (2 availability zones)
• Asia Pacific – Sydney (2 availability zones) – Singapore (2 availability zones) – China (1 availability zone) – Tokyo (3 availability zones)
© Matthew Bass 2013
AWS and Services
• Amazon Web Services offers a number of services
• These services are things like: – Storage
– Database
– Network capabilities
– Monitoring
– …
• Not all services are available at all regions – https://aws.amazon.com/about-aws/globalinfrastructure/regional-
product-services/
© Matthew Bass 2013
Amazon Availability Zones
• Amazon has a notion of availability zones
• Engineered to be insulated from failures in other availability zones
• Availability zones are locations within a region
• Amazon has not announced the details of an availability region but presumably they are – Physically separate data centers
– Have independent networks
– Have independent power delivery
– …
© Matthew Bass 2013
Amazon Service Level Agreement
• Amazon guarantees 99.95% availability for each region
• IaaS consumers are free to deploy their applications: – Within an availability zone
– Across availability zones but within a region
– Across regions
• Amazon does not make any claim about the availability of their availability zones (that I could find)
© Matthew Bass 2013
All-in-one Single Server
© Matthew Bass 2013
Basic 4-server Setup
© Matthew Bass 2013
Multiple Availability Zones
© Matthew Bass 2013
Multiple Regions
© Matthew Bass 2013
Elastic Compute Cloud (EC2) & Redundancy
• EC2 supports different levels of redundancy
– It is up to the customer to determine how much redundancy they wish to have and how much they wish to pay for it
• Redundant elements can be:
– Within an availability zone
– Across availability zones
– Across regions
© Matthew Bass 2013
Microsoft Azure Regions
• North America – US Central (Iowa) – US East (Virginia) – US East 2 (Virginia) – US North Central (Illinois) – US South Central (Texas) – US West (California)
• Europe – Europe North (Ireland) – Europe West (Netherlands)
• Asia Pacific – East (Hong Kong) – Southeast (Singapore)
• Japan – Japan East (Saitama) – Japan West (Osaka)
• Brazil – Sao Paulo
© Matthew Bass 2013
Fault Domains in Azure
• In Azure there is the concept of Fault Domains
• A Fault Domain is essentially a rack in a given datacenter
• A consumer is not able to define which fault zones the application are distributed to
– Unlike an availability zone
• As a result the fault zone is really an internal structure
© Matthew Bass 2013
Upgrade Domains in Azure
• An upgrade domain is similar to a fault domain
• Essentially an upgrade domain will be upgraded at one time
– When Microsoft upgrades their internal infrastructure they do so a domain at a time
• In order to guard against failures within a fault domains and upgrades you need to replicate across both fault and upgrade domains
• This is called an availability set
© Matthew Bass 2013
Azure Availability Sets
© Matthew Bass 2013
Amazon Auto Scaling
• Auto Scaling works in conjunction with Cloudwatch (Amazon’s monitoring service)
• The idea is the monitoring service monitors the metrics – CPU utilization – Latency – Memory consumption
• The Auto Scaling solution establishes the rules – Add instances when utilization exceeds 70% – Remove instances when utilization falls below 10%
• You can specify things like a “cooling off” period – Where no action is taken until the system has a chance to stabilize
© Matthew Bass 2013
Amazon Elastic Load Balancer
• This is Amazon’s load balancing solution – Recall the push/pull architecture discussion
• It tracks the status and location of instances
• Routes requests to healthy instances based on criteria that you establish
• Can be used in conjunction with Auto Scaling – When new instances are added or removed they are registered with the ELB
• Can use in conjunction with Amazon’s DNS (route 53) – You can use DNS failover to move from one region to another
– The DNS will route traffic to the ELB in the target region
© Matthew Bass 2013
Amazon Simple Queue Service
• SQS is Amazon’s queuing service
– Again recall the push/pull architecture discussion
• It’s a service that supports message queues
• Recall it can be used in conjunction with Auto Scaling to manage the elasticity of your application
• Pricing is per million requests handled
© Matthew Bass 2013
Amazon Storage Solutions
• Amazon has several storage solutions – Elastic Block Store (EBS) – Simple Storage Solution (S3) – Glacier
• These provide raw unmanaged storage • This is useful for:
– Disaster recovery – Backup – Archiving – Persistence for your own database solution
© Matthew Bass 2013
Amazon Elastic Block Store
Amazon Elastic Block Store (EBS) is Amazon’s data file system. Some of its features are
• Data is persisted independently from instances
• EBS data is placed in a specific availability zones and can be attached to instances in the same availability zone
• EBS data is automatically replicated within availability zone
• There are two networks that connect EBS instances – A high speed network to provide coordination among instances and move data between
instances.
– A lower speed network used as backup for coordination.
• $0.05 per million I/O requests
© Matthew Bass 2013
Amazon Simple Storage Solution (S3)
• S3 is a scalable storage solution
• Good for content storage and distribution
• Good for backup, archiving, and disaster recovery
• Costs $0.03 per GB of data
• More expensive but faster than Glacier
• Not as fast for I/O as EBS
© Matthew Bass 2013
Amazon Glacier
• Low cost storage solution
• Good for off site archival of Enterprise data
• Good for backup and data archiving
• Good for large volumes of data
• Costs $0.01 per GB of data
© Matthew Bass 2013
Amazon Database Solutions
• Amazon has a number of fully managed database solutions
• These are built on top of one of Amazon’s storage solutions
• They include:
– DynamoDB
– Relational Data Store (RDS)
– Redshift
– ElastiCache
© Matthew Bass 2013
DynamoDB
• Key Value data store
• Uses a throughput oriented pricing model (rather than a storage oriented model)
• Uses solid state drives
• Guarantees single digit read latencies
• You pay a flat hourly rate based on capacity that you reserve
– Costs $0.0065 per hour for every 10 units of write capacity
– Costs $0.0065 per hour for every 10 unites of read capacity
© Matthew Bass 2013
Relational Data Store
• A distributed relational web service that provides a relational database for use in applications
• It provides access to MySQL, Oracle, SQL Server, or PostgreSQL
• It simplifies installation, patching, and backup related issues
• Priced per hour according to db type, size, and number
© Matthew Bass 2013
Redshift
• Redshift is Amazon’s data warehousing solution
• Integrates with other storage solutions
• Priced at either $0.25 per hour on the low end
• $1000/year per terabyte per year
© Matthew Bass 2013
ElastiCache
• A Web Service that enables an in memory data cache
• Supports:
– Memcached
– Redis
• Improves latency and throughput for read heavy applications
• Prices are per Cache node/hour
© Matthew Bass 2013
Amazon CloudFront
• Amazon’s content delivery network
• Provides edge services
– Competes with companies such as Akamai
• This service will allow you to locate content closer to users
– Reduces latency
• You specify the edge location and point it to the origin
• You can route DNS to the edge location if you want
© Matthew Bass 2013
Amazon Elastic IP Addressing
• Amazon provides elastic IP addressing
• The IP address is associated with your account – not with an instance
• You can programmatically map the elastic IP to any instance in your account
• In this way you make the deployment configuration transparent to the user/application
– Remember the virtual network discussion?
© Matthew Bass 2013
Many Other Services Available
• Authentication services
• Analytics
• Elastic Map Reduce
• Real time data streaming and processing
• Business process automation services
• Email services
• Notification services
• …
© Matthew Bass 2013
Comparison to Other Providers
• Other major providers (Google, Microsoft, Rackspace) offer similar services
• Google doesn’t have as many services but has different pricing model
– Charges in 10 minute increments rather than one hour increment
• Microsoft has similar services
• Rackspace also provides comparable options
© Matthew Bass 2013
Outages
• In Amazon (and others) there are some kinds of outages that are specific to the structure of the provider
• We will now look at some of these outages
© Matthew Bass 2013
Zone Failure
• All of the IaaS providers have some notion of an “availability zone”
• An availability zone (or fault domain in Azure) has it’s own switch, router, and rack
• These availability zones are isolated from each other in a way that nodes within an availability zone are not
© Matthew Bass 2013
Zone Failure Modes
• A zone can fail in different ways
Zone 1 Zone 2 Zone 3
Region
© Matthew Bass 2013
Complete Failure
• If for example you have a power outage you’ll have a complete failure
• If you try to route traffic to any of these machines you’ll get a “no route to host”
– This happens quickly – fast fail
• You’ll know the zone is out
• You can then spin up a new zone elsewhere
© Matthew Bass 2013
Zone Failure Modes
• You could have a network failure
Zone 1 Zone 2 Zone 3
Region
© Matthew Bass 2013
Network Failure
• If you have a network failure it’s typically not a complete failure
• The machines are still working but the network is having trouble
• There is often still a route to host but your data isn’t reaching the host
• As a result you don’t get a fast fail
– You’ll get long timeouts
© Matthew Bass 2013
Network Failure
• With the long timeouts your system will start to back up
• It’s difficult to tell the difference between this issue and other issues that result in latency lags
• This problem can be intermittent as some of the routers might be down but not all
© Matthew Bass 2013
Zone Failure Modes
• You could have a failure of some zone service
Zone 1 Zone 2 Zone 3
Region
© Matthew Bass 2013
Zone Service Failure
• This is some when a service fails that the zone is dependent on
– It could be something that is part of the platform as a service (e.g. EBS)
– It could also be a central service in your application
• This causes cascading failures
• Difficult to figure out what is going on
© Matthew Bass 2013
Region Failure
• It’s rare but a Region can fail as well
• Both complete and partial failures have happened
• Typically this starts with isolated issues that cascade
• There might be an issue with a few nodes or with a single availability zone
• Other zones become impacted (often due to additional traffic) and fail
– It can be difficult to determine the scope of the issue while it’s occurring
© Matthew Bass 2013
Regional Failure Modes
• You could loose network access to a region
Zone 1 Zone 2 Zone 3
Region
© Matthew Bass 2013
Regional Outage
• This is often caused by
– a DNS issue
– Router issues
– Network capacity overload
• Causes you to loose access to a region
© Matthew Bass 2013
Regional Failure Modes
• Local failures can cause a control plane overload
Zone 1 Zone 2 Zone 3
Region
© Matthew Bass 2013
Data Store Failure
• As with the other portions of the system the data store can become unresponsive
• The remedy for this is typically to mark this node as bad and attempt to bring a new node online
• If the issue is more pervasive it can result in:
– Disrupted availability
– Loss of persistent data
© Matthew Bass 2013
Backup Failure
• Systems will often have a backup data mechanism
• This is often a key component in disaster recovery
• This can also fail
– It can become temporarily or permanently unavailable
© Matthew Bass 2013
Upgrades
• Cloud providers need to upgrade their software as well
• When they do this the nodes that are being upgraded experience an outage
• If your software is running on these nodes you might experience an outage as well
© Matthew Bass 2013
Utilizing AWS
• You can utilize AWS in many ways
– You can host your entire application in the cloud
– You can host a specific portion of your application in the cloud
– You can use the cloud for a specialized need
© Matthew Bass 2013
Hosting Your Application
• You can have a system that is fully deployed in the cloud • You’ll need to figure out how to structure the application to achieve both functional and quality
attribute needs • You’ll want to first consider quality attribute concerns such as:
– Scalability – Availability – Security – …
• Utilize the techniques we talked about to determine the needs – Fault modeling (considering the cloud specific faults) – Threat modeling – Understanding the anticipated load and desired throughput and latency
• Come up with a gross structure that achieves your objectives – Think about partitioning of the system to support testing, degraded modes of operation and independent
deployment
© Matthew Bass 2013
Partial Hosting
• You might want to leverage the cloud for a specific portion of your system e.g. – Supporting mobile applications
– Databases
– Analytics
– Delivery of particular content
– Hosting your front end
– …
• This is typically going to be driven by cost and quality attribute needs (e.g. scalability)
© Matthew Bass 2013
Backup and Recovery
• Many organizations utilize the cloud for bulk storage, archiving, or back up and recovery
• In the past external services were used for such needs
– They often stored data on tape in separate physical locations
• It can be cheaper and more convenient to utilize cloud services
• As a result many organizations use the cloud for such storage needs
© Matthew Bass 2013
Summary
• Many services are available in the cloud
– Storage
– Network
– Compute related services
– …
• These services provide different levels of service at different pricing levels
• Utilizing the cloud appropriately and efficiently takes an explicit understanding of both your needs and the services available