Post on 10-Jul-2015
description
Ianni Vamvadelis, Solution Architect
Architecting for high
availability
2 2
What is High Availability (HA)?
• Percentage of time an application operates
• Loss of availability is known as an outage or downtime
– Planned and unplanned
– App is offline, unreachable, or partially available
– App is unresponsive
3 3
HA is related to …
• Scalability
– Often slow is indistinguishable from unavailable.
• Fault Tolerance
– Apps continue functioning when components fail
• Disaster Recovery
– Restoring service after a catastrophic event
4 4
HA and DR
• A continuum
• business continuity plan
• Not all or nothing proposition
In the face of internal or external events, how do you…
– Keep your applications running 24x7
– Make sure you data is safe
– Get an application recovered after a major disaster
High Availability Disaster Recovery
How does AWS Help
High Availability?
US-WEST (Oregon) EU-WEST (Ireland)
ASIA PAC (Tokyo)
ASIA PAC
(Singapore)
US-WEST (N. California)
SOUTH AMERICA (Sao Paulo)
US-EAST (Virginia)
AWS GovCloud (US)
ASIA PAC (Sydney)
US-WEST (Oregon)) EU-WEST (Ireland)
ASIA PAC (Tokyo)
ASIA PAC
(Singapore)
US-WEST (N. California)
SOUTH AMERICA (Sao Paulo)
US-EAST (Virginia)
AWS GovCloud (US)
ASIA PAC (Sydney)
8 8
Automation
AWS SERVICES
Inherently Highly Available and Fault Tolerant Services
Highly Available with the right architecture
Amazon S3
Amazon DynamoDB
Amazon CloudFront
Amazon Route53
Elastic Load Balancing
Amazon SQS
Amazon SNS
Amazon SES
Amazon SWF
…
Amazon EC2
Amazon EBS
Amazon RDS
Amazon VPC
AWS
Principles for HA
1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING
LET’S BUILD A
HIGHLY AVAILABLE SYSTEM
#1 DESIGN FOR FAILURE
●○○○○
« Everything fails all the time »
Werner Vogels
CTO of Amazon
AVOID SINGLE POINTS OF FAILURE
AVOID SINGLE POINTS OF FAILURE
ASSUME EVERYTHING FAILS,
AND WORK BACKWARDS
YOUR GOAL
Applications should continue to function
AMAZON EBS ELASTIC BLOCK STORE
AMAZON ELB ELASTIC LOAD BALANCING
HEALTH CHECKS
#2 MULTIPLE
AVAILABILITY ZONES ●●○○○
AMAZON RDS
MULTI-AZ
AMAZON ELB AND
MULTIPLE AZs
#3 SCALING
●●●○○
AUTO SCALING SCALE UP/DOWN EC2 CAPACITY
#4 SELF-HEALING
●●●●○
HEALTH CHECKS
+ AUTO SCALING
HEALTH CHECKS
+ AUTO SCALING
=
SELF-HEALING
DEGRADED MODE
AMAZON S3 STATIC WEBSITE
+ AMAZON ROUTE 53
WEIGHTED RESOLUTION
#5 LOOSE
COUPLING ●●●●●
BUILD LOOSELY COUPLED SYSTEMS
The looser they are coupled, the bigger they scale,
the more fault tolerant they get…
AMAZON SQS SIMPLE QUEUE SERVICE
PUBLISH& NOTIFY
RECEIVE TRANSCODE
PUBLISH& NOTIFY
RECEIVE TRANSCODE
VISIBILITY TIMEOUT
BUFFERING
CLOUDWATCH METRICS FOR AMAZON SQS
+ AUTO SCALING
1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING
1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING
1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING
1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING
1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING
1. DESIGN FOR FAILURE
2. MULTIPLE AVAILABILITY ZONES
3. SCALING
4. SELF-HEALING
5. LOOSE COUPLING
YOUR GOAL
Applications should continue to function
IT’S ALL ABOUT
CHOICE BALANCE COST & HIGH AVAILABILITY
117 117
Summary
Leverage AWS Services
Apply 5 principles for HA
Automate
Test your HA implementation
118 118
aws.amazon.com/architecture
JUST EAT HIGH AVAILABILITY WITH AWS
120
JUST EAT
13 countries
34,000+ restaurants
8m+ members
Over 50m orders
16,000+ restaurants in UK, 8m visits a month
121
PLATFORM Devices in restaurants
Consumer Website
Public API
Order API Ratings API Search API …
Restaurant Services
SQL Server Networking Monitoring
Customer Care Tools
Emails
Common Infrastructure
…
Apps and External Services
APIs
122
DESIGN FOR FAILURE
Device Service
Auto scaling Group
eu-west-1a
Orders queue
Orders data
Devices in restaurants
eu-west-1b
eu-west-1c
Web Service
Auto scaling Group
eu-west-1a
eu-west-1b
eu-west-1c
Web Service
Web Service
JCT Service Device Service
123
SCALING - PROACTIVE
123
124
SCALING - PROACTIVE
Web servers in data center
125
SCALING – PROACTIVE
Web servers in data center
Web EC2 instances
126
SCALING – REACTIVE
Web servers in data center
Web EC2 instances
127
EVERYTHING MULTI AZ – CONSUMER WEBSITE
Auto scaling Group
eu-west-1a eu-west-1b eu-west-1c
Monitor to keep resource usage at max of 66% of capacity in each AZ
when everything’s available.
66% 66% 66% 99% 99%
128 128
EVERYTHING MULTI AZ – INTERNAL APIS
Auto scaling Group
eu-west-1a eu-west-1b eu-west-1c
Alarms tell us that performance has been degraded – but platform will
self heal as new instances are launched.
Applications assume that internal APIs will fail or run slowly. So can cope with the loss of an AZ
or instances – will just degrade gracefully.
80% 80% 80% 100% 100%
129 129 129
EVERYTHING MULTI AZ – SQL SERVER 2012
eu-west-1a eu-west-1b eu-west-1c
Connection strings simply contain both primary and secondary servers –
no code changes required.
Primary Witness Secondary Alarms tell us that failover has
occurred, but it happens without manual intervention.
DANIEL RICHARDSON
DIRECTOR OF ENGINEERING, JUST EAT
daniel.richardson@just-eat.com
130
www.just-eat.com/jobs
twitter.com/JustEatUK
www.facebook.com/justeat