April 14, 2014 Anthony D. Joseph inst.eecs.berkeley/~cs162
description
Transcript of April 14, 2014 Anthony D. Joseph inst.eecs.berkeley/~cs162
CS162Operating Systems andSystems Programming
Lecture 20
Why Systems Fail and What We Can Do About It
April 14, 2014
Anthony D. Joseph
http://inst.eecs.berkeley.edu/~cs162
Lec 20.24/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Review: 2PC Algorithm
• One coordinator
• N workers (replicas)
• High level algorithm description– Coordinator asks all workers if they can commit
– If all workers reply “VOTE-COMMIT”, then coordinator broadcasts “GLOBAL-COMMIT”,
Otherwise coordinator broadcasts “GLOBAL-ABORT”
– Workers obey the GLOBAL messages
• Ensure atomicity and durability: a transaction is committed/aborted either by all replicas or by none of them
Lec 20.34/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Review: Dealing with Worker Failures
• How to deal with worker failures?– Failure only affects states in which the node is waiting for
messages
– Coordinator only waits for votes in “WAIT” state
– In WAIT, if doesn’t receive
N votes, it times out and sends
GLOBAL-ABORT
INIT
WAIT
ABORT COMMIT
Recv: STARTSend: VOTE-REQ
Recv: VOTE-ABORTSend: GLOBAL-ABORT
Recv: VOTE-COMMITSend: GLOBAL-COMMIT
Lec 20.44/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Review: Dealing with Coordinator Failure
• How to deal with coordinator failures?– worker waits for VOTE-REQ in INIT
» Worker can time out and abort (coordinator handles it)
– worker waits for GLOBAL-* message in READY» If coordinator fails, workers must
BLOCK waiting for coordinator
to recover and send
GLOBAL_* message
INIT
READY
ABORT COMMIT
Recv: VOTE-REQSend: VOTE-ABORT
Recv: VOTE-REQSend: VOTE-COMMIT
Recv: GLOBAL-ABORT Recv: GLOBAL-COMMIT
Lec 20.54/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Review: Example of Coordinator Failure #1
coordinator
worker 1
VOTE-REQ
VOTE-ABORT
timeout
INIT
READY
ABORT COMM
timeout
timeout
worker 2
worker 3
Lec 20.64/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Review: Example of Coordinator Failure #2
VOTE-REQ
VOTE-COMMIT
INIT
READY
ABORT COMM
Block waiting for coordinator
restarted
GLOBAL-ABORT
coordinator
worker 1
worker 2
worker 3
GLOBAL-COMMIT
Lec 20.74/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Remembering Where We Were (Durability)
• All nodes use stable storage* to store which state they are in
• Upon recovery, it can restore state and resume:– Coordinator aborts in INIT, WAIT, or ABORT
– Coordinator commits in COMMIT
– Worker aborts in INIT, ABORT
– Worker commits in COMMIT
– Worker asks Coordinator in READY
* - stable storage is non-volatile storage (e.g. backed by disk) that guarantees atomic writes.
Lec 20.84/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Blocking for Coordinator to Recover
• A worker waiting for global decision can ask fellow workers about their state
– If another worker is in ABORT or COMMIT state then coordinator must have sent GLOBAL-*
– Thus, worker can safely abort or commit, respectively
– If another worker is still in INIT state
then both workers can decide to abort
– If all workers are in ready, need to BLOCK (don’t know if coordinator wanted to abort or commit)
INIT
READY
ABORT COMMIT
Recv: VOTE-REQSend: VOTE-ABORT
Recv: VOTE-REQSend: VOTE-COMMIT
Recv: GLOBAL-ABORT Recv: GLOBAL-COMMIT
Lec 20.94/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Goals for Today
• Definitions for Fault Tolerance
• Causes of system failures
• Fault Tolerance approaches– HW- and SW-based Fault Tolerance, Datacenters,
Cloud, Geographic diversity
Note: Some slides and/or pictures in the following are adapted from slides from a talk given by Jim Gray at UC Berkeley on November 9, 2000.
“You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.” —LESLIE LAMPORT
Lec 20.104/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Dependability: The 3 ITIES
• Reliability / Integrity: does the right thing.
(Need large MTBF)
• Availability: does it now. (Need small MTTR
MTBF+MTTR)
• System Availability:if 90% of terminals up & 99% of DB up?
(=> 89% of transactions are serviced on time)
SecurityIntegrityReliability
Availability
MTBF or MTTF = Mean Time Between (To) FailureMTTR = Mean Time To Repair
Lec 20.114/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Mean Time to Recovery• Critical time as further failures can occur during recovery
• Total Outage duration (MTTR) =Time to Detect (need good monitoring)
+ Time to Diagnose (need good docs/ops, best practices)
+ Time to Decide (need good org/leader, best practices)
+ Time to Act (need good execution!)
Lec 20.124/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Fault Tolerance versus Disaster Tolerance
• Fault-Tolerance: mask local faults– Redundant HW or SW
– RAID disks
– Uninterruptible Power Supplies
– Cluster Failover
• Disaster Tolerance: masks site failures– Protects against fire, flood, sabotage,..
– Redundant system and service at remote site(s)
– Use design diversity
Lec 20.134/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
High Availability System Classes
GOAL: Class 6Gmail, Hosted Exchange target 3 nines (unscheduled)2010: Gmail (99.984), Exchange (>99.9)
UnAvailability ~ MTTR/MTBFCan cut it by reducing MTTR or increasing MTBF
Availability % Downtime per year Downtime per month Downtime per week
90% (“one nine”) 36.5 days 72 hours 16.8 hours
99% (“two nines”) 3.65 days 7.20 hours 1.68 hours
99.9% (“three nines”) 8.76 hours 43.2 minutes 10.1 minutes
99.99% (“four nines”) 52.56 minutes 4.32 minutes 1.01 minutes
99.999% (“five nines”) 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% (“six nines”) 31.5 seconds 2.59 seconds 0.605 seconds
Lec 20.154/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Causal Factors for UnavailabilityLack of best practices for:
•Change control
•Monitoring of the relevant components
•Requirements and procurement
•Operations
•Avoidance of network failures, internal application failures, and external services that fail
•Physical environment, and network redundancy
•Technical solution of backup, and process solution of backup
•Physical location, infrastructure redundancy
•Storage architecture redundancy
Ulrik Franke et al: Availability of enterprise IT systems - an expert-based Bayesian model
Lec 20.164/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Case Studies - Tandem NonStop Systems Reported MTBF by Component – System goal 10yr MTBF
0
50
100
150
200
250
300
350
400
450
1985 1987 1989
software
hardware
maintenance
operations
environment
total
Mean Time to System Failure (years) by Cause
1985 1987 1990SOFTWARE 2 53 33 YearsHARDWARE 29 91 310 YearsMAINTENANCE 45 162 409 YearsOPERATIONS 99 171 136 YearsENVIRONMENT 142 214 346 YearsSYSTEM 8 20 21 YearsProblem: Systematic Under-reporting
Lec 20.174/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Operations Failures
RAID Drive 1 failed!Replace immediately
What went wrong??
Lec 20.184/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Operations FailuresRAID Drive 1 failed!Replace immediately
Lec 20.194/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Windows XP – End of Support 4/8/14• History
– Released to manufacturing on August 21, 2001
– General availability on October 25, 2001
– OEM sales ended on June 30, 2008
• Current users:– 6,000 Web servers, 95% of ATMs, 40%(!) of companies
– 420K ATMs in US alone (those using WinXP Embedded get support until 2016)
– IRS: 110K PCs, 53% running Win XP, paying MS for extended support
• Why?– Min (recommended) config: 233 (300) Mhz Pentium,
64 (128) MB RAM, 1.5 (4.8) GB HD, 800x600 display
– Win 8: 1 Ghz, 1 (4) GB RAM, 16 GB HD, 1024x768 (1366x768) display
Lec 20.204/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
2min Breakhttp://xkcd.com/1354/
Lec 20.214/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Fault Model
• Assume failures are independent*So, single fault tolerance is a big win
• Hardware fails fast (blue-screen, panic, …)
• Software fails-fast (or stops responding/hangs)
• Software often repaired by reboot:– Heisenbugs – Works On Retry
– (Bohrbugs – Faults Again On Retry)
• Operations tasks: major source of outage– Utility operations – UPS/generator maintenance
– Software upgrades, configuration changes
Lec 20.224/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Hardware Failure Rate
https://en.wikipedia.org/wiki/Bathtub_curve
Lec 20.234/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Consumer Disk Drive Failure Rates
http://blog.backblaze.com/2013/11/12/how-long-do-disk-drives-last/
78% survive 4 years
Implies 6 year median
lifetime
Lec 20.244/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
36-Month Survival Rate by Brand
• Results somewhat skewed by Seagate 1.5TB Barracuda Green drives that were warranty replacement drives, and by Seagate 1.5TBBarracuda 7200 drives
Lec 20.254/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Traditional Fault Tolerance Techniques
• Fail fast modules: work or stop
• Spare modules: yield instant repair time
• Process/Server pairs: Mask HW and SW faults
• Transactions: yields ACID semantics (simple fault model)
Lec 20.264/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Fail-Fast is Good, but Repair is Needed
Improving either MTTR or MTBF gives benefit
Simple redundancy does not help much (can actually hurt!)
Lifecycle of a modulefail-fast gives short fault latency
High Availability
is low UN-Availability
Unavailability ~ MTTR MTBF
X XFault Detect
RepairReturn
Lec 20.274/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Hardware Reliability/Availability
(how to make HW fail fast)Duplex module with output comparator:
Fail-Fast: fail if either fails (e.g., duplexed CPUs)
2 work
1 work
0 work
MTBF/2 MTBF/1
MTBF/2
Fail-Soft: fail if both fail (e.g., disc, network,...)
2 work
1 work
0 work
MTBF/2 MTBF/1
1.5 * MTBF
Worse MBTF from simple
redundancy!
The Airplane Rule:
A two-engine airplane has twice as many engine problems as a one engine plane
Lec 20.284/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Add Repair: Gain 104 Improvement
2 work
1 work
0 work
MTBF/2 MTBF/1
104 * MTBF
MTTR MTTR
1 Year MTBF modules12-hour MTTR
Adding repair puts module back into service after a failure
Duplex Fail Soft + Repair Equation: MTBF2/(2 * MTTR) Yields > 104 year MTBF
Lec 20.294/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Software Techniques: Learning from Hardware
• Fault avoidance starts with a good and correct design
• After that – Software Fault Tolerance Techniques:
Modularity (isolation, fault containment)
Programming for Failures: Programming paradigms that assume failures are common and hide them
Defensive Programming: Check parameters and data
N-Version Programming: N-different implementations
Auditors: Check data structures in background
Transactions: to clean up state after a failure
Lec 20.304/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Try&Catch Alone isn’t Fault Tolerance!
String filename = "/nosuchdir/myfilename";
try { // Create the file new File(filename).createNewFile();
} catch (IOException e) {
// Print out the exception that occurred
System.out.println("Unable to create file ("+filename+”): "+e.getMessage()); }
• Fail-Fast, but is this the desired behavior?
• Alternative behavior: (re)-create missing directory?
Lec 20.314/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Fail-Fast and High-Availability ExecutionProcess Pairs: Instant repair
Use Defensive programming to make a process fail-fastHave separate backup process ready to “take over” if primary faults
• SW fault is a Bohrbug no repair“wait for the next release” or “get an emergency bug fix” or“get a new vendor”
• SW fault is a Heisenbug restart process“reboot and retry”
• Yields millisecond repair times
• Tolerates some HW faults SESSIONPRIMARYPROCESS
BACKUPPROCESS
STATEINFORMATION
LOGICAL PROCESS = PROCESS PAIR
Lec 20.324/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Server System Pairs for High Availability
• Programs, Data, Processes Replicated at 2+ sites
– Logical System Pair looks like a single system– Backup receives transaction log
• If primary fails or operator switches, backup offers service
• What about workloads requiring more than one server?
Primary BackupLocal or wide area
Lec 20.334/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Apache ZooKeeper
• Multiple servers require coordination– Leader Election, Group Membership, Work Queues, Data
Sharding, Event Notifications, Config, and Cluster Mgmt
• Highly available, scalable, distributed coordination kernel– Ordered updates and strong persistence guarantees
– Conditional updates (version), Watches for data changes
ServerServer ServerServerServerServer
Leader
Client ClientClientClientClient ClientClient
Lec 20.344/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Datacenter is new “server”
• What about even larger scale?
• “Program” == Web search, email, map/GIS, …
• “Computer” == 1000’s computers, storage, network
• Warehouse-sized facilities and workloads
• Built from less reliable components than traditional datacenters
photos: Sun Microsystems & datacenterknowledge.com
Lec 20.354/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Datacenter Architectures
• Major engineering design challenges in building datacenters
– One of Google’s biggest secrets and challenges
– Facebook creating open solution: http://www.facebook.com/note.php?note_id=10150148003778920
– Open source design: http://www.opencompute.org/
– Very hard to get everything correct!
• Example: AT&T Miami, Florida Tier 1 datacenter– Minimum N+1 redundancy factor on all critical
infrastructure systems (power, comms, cooling, …)
– Redundant communications: dual uplinks to AT&T global backbone
– What about power?
Lec 20.364/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Typical Tier-2 One Megawatt Datacenter
Transformer
Main Supply
ATSSwitchBoard
UPS UPS
STSPDU
STSPDU
Pan
el
Pan
el
Generator
…
1000 kW
200 kW
50 kW
Rack
Circuit
2.5 kWX. Fan, W-D Weber, L. Barroso, “Power Provisioning for a Warehouse-sized Computer,” ISCA’07, San Diego, (June 2007).
• Reliable power: Mains + Generator, Dual UPS
• Units of Aggregation– Rack (10-80 nodes) →
PDU (20-60 racks) → Facility/Datacenter
• Power is 40% of DC costs• Over 20% of entire DC costs is
in power redundancy
Lec 20.374/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
2 commercial feeds, Each at 13,800VACLocated near Substation supplied from 2 different grids
Commercial Power Feeds
AT&T Internet Data Center Power
AT&T Enterprise Hosting Services briefing 10/29/2008
Lec 20.384/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Emergency Power Switchover
AT&T Internet Data Center Power
• Paralleling switch gear
• Automatically powers up all generators when Commercial power is interrupted for more than 7 seconds
– Generators are shed to cover load as needed
– Typical transition takes less than 60 seconds
• Manual override available to ensure continuity if automatic start-up should fail
AT&T Enterprise Hosting Services briefing 10/29/2008
Lec 20.394/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Back-up Power – Generators and Diesel Fuel• Four (4) 2,500 kw Diesel Generators Providing Standby Power, capable of producing 10 MW of power
• Two (2) 33,000 Gallon Aboveground Diesel Fuel Storage Tanks
AT&T Internet Data Center Power
AT&T Enterprise Hosting Services briefing 10/29/2008
Lec 20.404/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
• UPS consisting of four battery strings
• Battery strings contain flooded cell Lead Acid batteries
• A minimum of 15 minutes of battery backup available at Full load
• Remote status monitoring of battery strings
AT&T Internet Data Center Power
UPS Batteries
AT&T Enterprise Hosting Services briefing 10/29/2008
Lec 20.414/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
• Four UPS Modules connected in a Ring Bus configuration• Each Module rated at 1000kVA• Rotary Type UPS by Piller
• Motor-Generator pairs clean power
• Eliminate spikes, sags, surges, transients, and all other Over/Under voltage And frequency conditions
Uninterruptible Power Supply (UPS)
AT&T Internet Data Center Power
AT&T Enterprise Hosting Services briefing 10/29/2008
Lec 20.424/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Alternate Power Distribution Model
• Google and Facebook replaced central UPS with individual batteries*
277 VAC FB
48VDC FB
Industry 21-27%
loss
Facebook7.5% loss
Lec 20.434/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Administrivia
• Project 3 code due Thursday 4/17 by 11:59PM
• Project 4 design due date– Thursday 4/24 by 11:59PM
• Midterm II is April 28th 4-5:30pm in 245 Li Ka Shing and 100 GPB
– Covers Lectures #13-24
– Closed book and notes, no calculators
– One double-sides handwritten page of notes allowed
– Review session: Fri Apr 25 4-6pm in 245 Li Ka Shing
Lec 20.444/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
2min Break http://xkcd.com/1354/
Lec 20.454/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Many Commercial Alternatives
• The rise of Cloud Computing!
• “Inexpensive” virtual machine-based computing resources
– Instantaneous (minutes) provisioning of replacement computing resources
– Also highly scalable
• Competition driving down prices
• Easiest way to build a startup!– Scale resources/costs as company
grows
Lec 20.464/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
MapReduce: Programming for Failure
• First widely popular programming model for data-intensive apps on commodity clusters
• Published by Google in 2004– Processes 20 PB of data / day
• Popularized by open-source Hadoop project– 40,000 nodes at Yahoo!, 70 PB at Facebook
• Programming model– Data type: key-value records
» Map function: (Kin, Vin) list(Kinter, Vinter)
» Reduce function: (Kinter, list(Vinter)) list(Kout, Vout)
Lec 20.484/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Fault Tolerance in MapReduce
1. If a task crashes:– Retry on another node
» OK for a map because it had no dependencies
» OK for reduce because map outputs are on disk
– If the same task repeatedly fails, fail the job
2. If a node crashes:– Relaunch its current tasks on other nodes
– Relaunch any maps the node previously ran» Necessary because their output files are lost
Tasks must be deterministic and side-effect-free
Lec 20.494/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Fault Tolerance in MapReduce
3. If a task is going slowly (straggler):– Launch second copy of task on another node
– Take output of whichever copy finishes first
•Critical for performance in large clusters
•What about other distributed applications?– Web applications, distributed services, …
– Often complex with many, many moving parts
– Interdependencies often hidden and/or unclear
Lec 20.504/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014From: http://analysiscasestudy.blogspot.com/
Cloud Computing Outages Q4 2012
Lec 20.514/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Introduce Controlled Chaos• Best way to avoid failure is to fail constantly!
– John Ciancutti, Netflix
• Inject random failures into cloud by killing VMs– Most times, nothing happens, occasional surprises
• April, 2011: EC2 failure brought down Reddit, Foursquare, Quora (and many others)
– Netflix was unaffected thanks to Chaos Monkey and replication
• Also apply to network and storage systems
• Great until December 24, 2012 EC2 outage– Failure of single Amazon availability zone
http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
Lec 20.524/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Add Geographic Diversity to Reduce Single Points of Failure*
Netflix Chaos Gorilla
Lec 20.534/14/2014 Anthony D. Joseph CS162 ©UCB Spring 2014
Summary• Focus on Reliability/Integrity and Availability
– Also, Security (see next two lectures)
• Use HW/SW FT to increase MTBF and reduce MTTR– Build reliable systems from unreliable components
– Assume the unlikely is likely
– Leverage Chaos Monkey
• Make operations bulletproof: configuration changes, upgrades, new feature deployment, …
• Apply replication at all levels (including globally)