Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

37
(C) 2002 Daniel Sorin Wisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department University of Wisconsin—Madison

description

SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery. Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department University of Wisconsin—Madison. Overview. Hardware fault frequencies are increasing - PowerPoint PPT Presentation

Transcript of Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

Page 1: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

(C) 2002 Daniel Sorin Wisconsin Multifacet Project

SafetyNet: Improving the Availability of

Shared Memory Multiprocessors with Global Checkpoint/Recovery

Daniel J. Sorin, Milo M. K. Martin,Mark D. Hill, and David A. Wood

Computer Sciences DepartmentUniversity of Wisconsin—Madison

Page 2: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 2

Overview

• Hardware fault frequencies are increasing

• Hardware checkpoint/recovery for multiprocessors– Transparent to software

• SafetyNet Innovations– Efficient coordination of checkpoint creation– Optimized logging of checkpoint state– Checkpoint validation off critical path

• SafetyNet achieves 3 goals, existing systems get 2– High availability– High performance– Low cost

Page 3: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 3

Outline

• Availability– Motivation– Example targeted faults– Differences between SafetyNet and existing approaches

• SafetyNet: Key Features• A SafetyNet Implementation• Evaluation• Conclusions

Page 4: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 4

Availability Motivation

• Fault frequencies are increasing

1. Technological reasons– Smaller transistors– Denser wires

2. Architectural reasons– More components– More aggressive designs

• Marketing trends demand more availability

Need architectural solution to improve availability

Page 5: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 5

Which Faults Do We Target?

• Hardware faults in shared memory multiprocessors– Mostly transient, some permanent, not chipkill

• We focus on faults outside of processor cores– Why? Good techniques for processors (e.g., DIVA)

• Interconnection network– Example: dead switch– Detect with timeout

• Cache coherence protocols– Example: lost coherence message– Detect with timeout

CPU CPU CPU

Interconnection

Network

Page 6: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 6

System Hardware Design Space

Backward Error Recovery(Tandem NonStop)

Forward Error Recovery (IBM mainframes)

Servers and PCs

Existing systems get only 2 out of 3 features

Page 7: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 7

Outline

• Availability• SafetyNet: Key Features

– System abstraction– Innovations

• A SafetyNet Implementation• Evaluation• Conclusions

Page 8: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 8

SafetyNet Abstraction

Processor

Processor

CurrentMemory

Checkpoint

CurrentMemory

checkpoint

CurrentMemoryVersion

Active(Architectural)

State ofSystem

Most Recently Validated Checkpoint

Recovery Point

Checkpoints Awaiting Validation

Page 9: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 9

SafetyNet Execution Model

CP1

time

CP2

CP3

CP4

CP5

recovery pt

validating

active

Create CP3

recovery pt

validating

active

validating

Validate CP2

recovery pt

validating

active

Create CP4

active

validating

validating

recovery pt

Recovery

recovery pt

active

Page 10: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 10

SafetyNet Goal and Innovations

• Goal: Recover to consistent checkpoint if fault

• Inefficient but correct solution – Periodically quiesce entire system to take checkpoint– Checkpoints include all system state– Stop system to validate checkpoints as fault free

• SafetyNet innovations:1. Efficient coordination of checkpoint creation across system2. Optimized checkpointing of system state3. Pipelined validation of checkpoints in background

Page 11: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 11

• Checkpoints must reflect consistent system state– Nodes must agree on memory values and coherence

• Coordinate checkpoints in logical time– Logical time is time base that respects causality

• Each node maintains its own logical clock– Create checkpoint every K logical cycles

We need logical time base that helps coordination

Key #1Coordinating Checkpoint Creation

Page 12: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 12

Logical Time Base

• Many logical time bases exist– Depends on coherence protocol

• Broadcast snooping systems– Increment clock for every coherence request processed– Nodes can be at different logical times– All nodes can agree when coherence transaction happens

• Directory protocol systems– Based on loosely synchronized physical clock (10 kHz)– More complicated explanation refer to paper for details

Page 13: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 13

Key #2Optimized Checkpointing of System State

• Checkpoint all state needed to resume execution– Processor registers– Memory state (including cache state)– Cache coherence state

• Processors save register state at each checkpoint– Copy registers into shadow registers

• Logically, cache/memory log old data every time:– Store overwrites an old checkpoint of block– Block’s coherence ownership is transferred

How can we reduce the amount of logged state?

Page 14: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 14

Optimized Logging

• Insight: only recover at checkpoint granularity• Intervals between checkpoints group writes/transfers

– E.g., checkpoint every 100,000 cycles (100 μsec at 1GHz)

• Only log first store/transfer per block per interval

• Optimization at cache:– Label cache blocks with checkpoint numbers (CNs)– If write/transfer is from same checkpoint, no logging needed

Large benefit due to locality of references

Page 15: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 15

Key #3Checkpoint Validation in Background

• Only validate when all agree checkpoint is fault-free– Example: no outstanding coherence requests in checkpoint

• Nodes perform fault detection, then coordinate

• Can be in background and pipelined– Reason why we have checkpoints awaiting validation

• Can hide long fault detection latencies– Number of outstanding checkpoints x checkpoint length– Design tolerance to be longer than longest detection latency

Don’t slow down execution to validate checkpoints

Page 16: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 16

Outline

• Availability • SafetyNet: Key Features• A SafetyNet Implementation• Evaluation• Conclusions

Page 17: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 17

System Model

• Checkpoint Log Buffer (CLB) at cache and memory• Just FIFO log of block writes/transfers

CPU

cache(s) CLB CLBmemory

network interface

NS halfswitch

EW halfswitch

reg CPs

I/O bridge

Page 18: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 18

Example of SafetyNet Operation

Recovery point is checkpoint 2. Most recent checkpoint is 3.Active checkpoint is 4. Processor 1 owns block B (validated).

CLB

Cache

P1

B M 2000

P2

Interconnection network

Regs: CP2Regs: CP3 Regs: CP2Regs: CP3

Addr State CN data

Addr State data Addr State data

Addr State CN data

CLB

Cache

Page 19: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 19

Example of SafetyNet Operation

P1 stores 3000 to block B between checkpoints 3 and 4. Logs old data.

P1

B M 4 3000

P2

B M 2000

Addr State CN data Addr State CN data

Addr State data Addr State data

Regs: CP2Regs: CP3 Regs: CP2Regs: CP3

CLB

Cache

CLB

Cache

Interconnection network

Page 20: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 20

Example of SafetyNet Operation

P1 loads from block B. SafetyNet uninvolved.

P1

B M 4 3000

P2

B M 2000

Addr State CN data Addr State CN data

Addr State data Addr State data

Regs: CP2Regs: CP3 Regs: CP2Regs: CP3

CLB

Cache

CLB

Cache

Interconnection network

Page 21: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 21

Example of SafetyNet Operation

Coordinated creation of checkpoint 4. Active checkpoint is 5.Save register state at beginning of checkpoint 4.

P1

B M 4 3000

P2

B M 2000

Regs: CP2Regs: CP3Regs: CP4Regs: CP2Regs: CP3Regs: CP4

Addr State CN data Addr State CN data

Addr State data Addr State data

CLB

Cache

CLB

Cache

Interconnection network

Page 22: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 22

Example of SafetyNet Operation

P2 requests ownership of block B. P1 logs old data and sends copy to P2. P1 invalidates cache entry.

P1 P2

B M 5 3000

B M 2000B M 3000

Addr State CN data Addr State CN data

Addr State data Addr State data

Regs: CP2Regs: CP3Regs: CP4Regs: CP2Regs: CP3Regs: CP4

CLB

Cache

CLB

Cache

Interconnection network

Page 23: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 23

Example of SafetyNet Operation

Validation of checkpoint 3. Discard checkpoint 2 registers.Recovery point is now beginning of checkpoint 3.

P1 P2

B M 5 3000

B M 3000

Regs: CP3Regs: CP4 Regs: CP3Regs: CP4

Addr State CN data Addr State CN data

Addr State data Addr State data

B M 2000CLB

Cache

CLB

Cache

Interconnection network

Page 24: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 24

Example of SafetyNet Operation

Recovery (to checkpoint 3). Restore CP3 registers. Restore ownership of B to P1. Invalidate B at P2. Now restart system!

P1

B M

P2

2000

Regs: CP3 Regs: CP3

Addr State CN data Addr State CN data

Addr State data Addr State data

CLB

Cache

CLB

Cache

Interconnection network

Page 25: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 25

System Recovery and Restart

• Any component can trigger recovery– E.g., processor times out on coherence request

• All in-progress transactions are dropped– By definition, these transactions are not validated

• After recovery, resume execution– May have to reconfigure (e.g., route around dead link)– Must replay work that was lost

Page 26: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 26

I/O and the Outside World

• Output commit problem – Can’t send uncommitted data beyond sphere of recoverability

• SafetyNet includes processors, memory, coherence• Doesn’t include network, disks, printer, etc.

• Standard solution: wait to communicate with I/O • Only send validated data to outside world

• Input commit problem – Input can’t be recovered– Standard solution: log input

Page 27: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 27

Outline

• Availability• SafetyNet: Key Features• A SafetyNet Implementation• Evaluation

– Methodology– Runtime performance

• Conclusions

Page 28: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 28

Methodology: Simulation & Workloads

• Simulation– Simics full-system simulation of 16-proc SPARC system– Detailed timing simulation of memory system

• MOSI directory cache coherence protocol– Simple, in-order processor model– 128KB L1I/D, 4MB L2, 512KB CLB

• Workloads (commercial and scientific)– Online transaction processing (OLTP): IBM’s DB2– Static web server: Apache driven by SURGE– Dynamic web server: Slashcode– Java server: SpecJBB– Scientific: barnes-hut from SPLASH2

Page 29: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 29

Runtime Performance

Normalize results to unprotected system

Page 30: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 30

Runtime Performance

Unprotected system crashes if fault occurs

Page 31: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 31

Runtime Performance

SafetyNet has same fault-free performance as unprotected

Error bars = +/- one standard deviation

Page 32: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 32

Runtime Performance

SafetyNet avoids crashes in presence of lost messages

Page 33: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 33

Runtime Performance

SafetyNet avoids crashes in presence of dead half-switch

Page 34: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 34

High-Level Comparison to ReVive

ReVive SafetyNet

Backward error recovery scheme

Yes Yes

Fault model Transient & permanent

Transient & some permanent

Processor modification

No Yes

Software modification Minor None

Fault-free performance

6-10% loss No loss

Output commit latency At least 100 milliseconds

No more than 0.4 milliseconds

Page 35: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 35

Conclusions

• SafetyNet: global, consistent checkpointing– Low cost and high performance– Efficient logical time checkpoint coordination– Optimized checkpointing of state– Pipelined, in-background checkpoint validation

• Improved availability– Avoid crash in case of fault– Same fault-free performance

Page 36: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 36

Performance vs. CLB Size

Caveats • Scaled workloads• 100,000 cycle intervals

Page 37: Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill, and David A. Wood Computer Sciences Department

SafetyNet – Daniel Sorinslide 37

Traditional Availability

• Forward Error Recovery (FER)– Use redundant hardware to mask faults– E.g., triple modular redundancy with voter or pair&spare– Systems: IBM mainframes, Intel 432, Stratus – Sacrifices cost to achieve availability

• Backward Error Recovery (BER)– If fault detected, recover system to pre-fault state– Periodically stop system and save state or log changes– Fault? Restore pre-fault checkpoint or unroll log– Systems: Sequoia, Synapse N+1, Tandem NonStop– Sacrifices performance to achieve availability