Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection...

62
Storms in the Cloud Designing and using a fault injection system Michalis Zervos @mzervos

Transcript of Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection...

Page 1: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Storms in the CloudDesigning and using a fault injection system

Michalis Zervos@mzervos

Page 2: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

http://venturebeat.com/ http://www.pcworld.com/

Google News - http://thenextweb.com/https://twitter.com/netflixhelps

Google News - http://mashable.com/

http://www.itnews.com.auGoogle News - http://www.itnews.com.au

Page 3: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust
Page 4: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Service Resilience

• Not a solved problem

• Goal is: • 100% uptime

• No degradation

• Responsive

Page 5: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Traditional testing

• Unit tests

• Functional / Integration

• End to end

Page 6: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Cloud services – Testing challenges

• Continuous evolution

• Multiple dependencies

• Global distribution

• Traffic fluctuation

Page 7: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Cloud services – Fundamentals

• Auto-scaling

• Redundancy

• Monitoring and detection systems

• Auto-mitigation / Failover mechanisms

• Staged deployments

• Data replication

Page 8: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

The extra mile

• Embrace failure

• Break the system

• Adjust the engineering process

Page 9: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Storms in the Cloud

Page 10: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Fault Injection System

Support diverse services

Easy to use

Verify resilience and behavior

Simulate complex failures / real-life incidents

Page 11: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Agenda

Designing a Fault Injection System

Usage patterns

Page 12: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Faults

• Resource pressure

• Network

• Processes

• Virtual machine

• Platform

• Application specific

• Hardware

Page 13: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Faults

• Resource pressure

• Network

• Processes

• Virtual machine

• Platform

• Application specific

• Hardware

Page 14: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Resource Pressure Faults

• CPU

• Memory• Physical

• Virtual

• Hard disk• Capacity

• Read

• Write

Available tools

• consume.exe (Windows SDK)

• stress (Unix)

• Sysinternals tools

Page 15: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Faults

• Resource pressure

• Network

• Processes

• Virtual machine

• Platform

• Application specific

• Hardware

Page 16: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Network faults

• Layers• Transport (TCP/UDP)• Application layer (HTTP)

• Types• Disconnect• Latency• Alter response codes (HTTP)• Packet reorder / loss (TCP/UDP)

• Filters• Domain / IP / Subnet• URL path• Port / Protocol

Available tools

• Network Emulator Toolkit (NEWT)

• Fiddler core

Page 17: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Faults

• Resource pressure

• Network

• Processes

• Virtual machine

• Platform

• Application specific

• Hardware

Page 18: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Process faults

• Stop / Kill

• Restart

• Stop service

• Start

• Crash

• Hang

Available tools

• OS commands

• Sysinternals tools

Page 19: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Faults

• Resource pressure

• Network

• Processes

• Virtual machine

• Platform

• Application specific

• Hardware

Page 20: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Virtual Machine / OS faults

• Stop

• Restart

• BSOD / Kernel panic

• Change date

• Re-image

Available tools

• Cloud Management APIs

• OS commands

• Sysinternals tools

Page 21: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Faults

• Resource pressure

• Network

• Processes

• Virtual machine

• Platform

• Application specific

• Hardware

Page 22: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Distributed platform faults

• Quorum loss

• Data loss

• Move primary node

• Remove replica

Available tools – Platform specific

• Service Fabric testability APIs

Page 23: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Faults

• Resource pressure

• Network

• Processes

• Virtual machine

• Platform

• Application specific

• Hardware

Page 24: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Application specific faults

• Hooks• Instrument service code

• Intercept / Re-route calls• No access to service code

Available tools

• MSR Detours

• TestApi – Managed Fault Injection

Page 25: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Application specific faults

• Hooks• Instrument service code

• Intercept / Re-route calls• No access to service code

Available tools

• MSR Detours

• TestApi – Managed Fault Injection

Page 26: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Faults

• Resource pressure

• Network

• Processes

• Virtual machine

• Platform

• Application specific

• Hardware

Page 27: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Hardware faults

• Machine

• Network devices

• Rack

• UPS

• Datacenter

Page 28: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Faults

• Resource pressure

• Network

• Processes

• Virtual machine

• Platform

• Application specific

• Hardware

Page 29: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Injection mechanism

• VM External

• VM Internal – Service code external Agent

• VM Internal – Service code internal Hooks

Page 30: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Injection mechanism

• VM External

• VM Internal – Service code external Agent

• VM Internal – Service code internal Hooks

Page 31: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

External injection

• VM / Region Stop

• VM / Region Restart

• Re-image

Target VMTarget VM

Cloud

Management Service

Cloud

Management Service

Page 32: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Injection mechanism

• VM External

• VM Internal – Service code external Agent

• VM Internal – Service code internal Hooks

Page 33: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

VM internal injection - Agent

• Resource pressure

• Network

• Processes

• OS

• Detours

• …Target Service VM

Target Application

Virtual Machine

Operating System

Fault Agent

Page 34: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Injection mechanism

• VM External

• VM Internal – Service code external Agent

• VM Internal – Service code internal Hooks

Page 35: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

VM internal injection - Hooks

• Application behavior

• Flexibility

• Service specificTarget Application

Page 36: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

VM internal injection - Hooks

• Application behavior

• Flexibility

• Service specificTarget Application

Page 37: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Hooks

Fault Agent

VM External

Page 38: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

System Architecture

Target Service VMs

Fault

Management

Service

Fault Agent

Fault AgentCloud

Management Service

Cloud

Management Service

Page 39: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

System components

Auditing Automation

Verification Reporting

Security

Page 40: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

System components

Auditing Automation

Verification Reporting

Security

Page 41: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Security and Safety

• AuthN / AuthZ

• Fault agents

• Kill switch

• Safety nets

Page 42: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Security and Safety

• AuthN / AuthZ

• Fault agents

• Kill switch

• Safety nets

• Integrate with Identity ProviderAzure Active Directory

• Multi-Factor Authentication

• Least-privilege principle

• Granular access levels

Page 43: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Security and Safety

• AuthN / AuthZ

• Fault agents

• Kill switch

• Safety nets

• Secure communication – TLS/SSL

• Code signing

• Execution permissions

Page 44: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Security and Safety

• AuthN / AuthZ

• Fault agents

• Kill switch

• Safety nets

Page 45: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Security and Safety

• AuthN / AuthZ

• Fault agents

• Kill switch

• Safety nets

Auto fault removal

• Agents – Service connectivity lossAgent-side detection

• Service malfunctioningAuto-monitoring module

• Unusual behavior Anomaly detection

Page 46: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

System components

Auditing Automation

Verification Reporting

Security

Page 47: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

System components

Auditing Automation

Verification Reporting

Security

Page 48: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Auditing

• Faults

• Fault agents

• Management service

• Clients

Page 49: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

System components

Auditing Automation

Verification Reporting

Security

Page 50: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

System components

Auditing Automation

Verification Reporting

Security

Page 51: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Automation

• Scheduling

• Zero - configuration

• Dependencies auto-discovery

Page 52: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

System components

Auditing Automation

Verification Reporting

Security

Page 53: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

System components

Auditing Automation

Verification Reporting

Security

Page 54: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

System components

Auditing Automation

Verification Reporting

Security

Page 55: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust
Page 56: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust
Page 57: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Usage scenarios

• Resilience verification

• Test new features

• Training

• Verify staged deployments

• Test detection, alerting, mitigation systems

• Repro incidents

Page 58: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Injection environment

Test Canary Production

Page 59: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Recovery Games

Page 60: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Recovery Games

Attacker

• Inject faults

• Provide hints

Defender

• Assess

• Analyze

• Mitigate

Page 61: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Recovery Games - Goals

• Familiarize with monitoring tools

• Recognize outage patterns

• Train on assessing the impact

• Root-cause / mitigation mindset

• Practice log analysis

Page 62: Storms in the Cloud - qconnewyork.com · •Auto-scaling •Redundancy •Monitoring and detection systems •Auto-mitigation / Failover mechanisms ... •Break the system •Adjust

Invest in Fault Injection Testing

Resilience verification

Test new features

Training

Engineering process & culture

Michalis Zervos@mzervos