Stabilization Enabling Technology

82
Stabilization Enabling Technology Shlomi Dolev

description

Stabilization Enabling Technology. Shlomi Dolev. Trustworthy Systems: Why is it So Hard?. Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“ http://larch-www.lcs.mit.edu:8001/~corbato/turing91/ - PowerPoint PPT Presentation

Transcript of Stabilization Enabling Technology

Page 1: Stabilization  Enabling Technology

Stabilization Enabling Technology

Shlomi Dolev

Page 2: Stabilization  Enabling Technology

Trustworthy Systems: Why is it So Hard?

• Corbató’91: "It almost goes without saying that ambitious systems never quite work as expected“http://larch-www.lcs.mit.edu:8001/~corbato/turing91/

• "You must pay extreme attention to detail here. One wrong bit will make things fail… "http://my.execpc.com/~geezer/os/pm.htm

• From Pentium’s manual:“… if the ESP or SP register is 1 when the PUSH instruction is executed, the processor shuts down due to a lack of stack space. No exception is generated to indicate this condition"

Page 3: Stabilization  Enabling Technology

Mars Rover - Spirit• …The Spirit rover has a radiation-hardened R6000

CPU from Lockheed-Martin Federal Systems…The operating system is Wind River Systems' Vx-Works..

• …attempted to allocate more files than the RAM-based directory structure could accommodate. That caused an exception, which caused the task that had attempted the allocation to be suspended…

• …Spirit fell silent, alone on the emptiness of Mars, trying and trying to reboot

http://www.eetimes.com/sys/news/OEG20040220S0046

Page 4: Stabilization  Enabling Technology

Linux and Windows do not Stabilize

Page 5: Stabilization  Enabling Technology

Self-Stabilization

• Self-healing, Self-managing, Self-*• Recovery Oriented Computing

[Berkeley, Stanford] • Autonomic Computing [IBM]

• Self-Stabilization• Self-Stabilizing algorithm for mutual

exclusion in a ring topology [Dijkstra’74]

Page 6: Stabilization  Enabling Technology

Well Established Theory !

Page 7: Stabilization  Enabling Technology

Self-Stabilization• The combination and type of faults

cannot be totally anticipated in on-going systems

• Any on-going system must be self stabilizing (or manually monitored)

E L

Page 8: Stabilization  Enabling Technology

First Self-Stabilizing Algorithm: Token Passing [Dij74]

Page 9: Stabilization  Enabling Technology

Token Passing

1 P1: do forever

2 if x1=xn then

3 x1:=(x1+1)mod(n+1)

4 Pi(i ≠ 1):do forever

5 if xi≠xi-1 then

6 xi:=xi-1

Page 10: Stabilization  Enabling Technology

Token Passing Cont.

• Surely works when we start in x1 = x2 = … = xn = 0.

• One processor may change a state at a time.

{0; 0; 0; 0; 0};

{1; 0; 0; 0; 0};

{1; 1; 0; 0; 0};

{1; 1; 1; 0; 0};

{1; 1; 1; 1; 0};

{1; 1; 1; 1; 1};

{2; 1; 1; 1; 1};

{2; 2; 1; 1; 1};

{2; 2; 2; 1; 1};

{2; 2; 2; 2; 1};

{2; 2; 2; 2; 2}

Page 11: Stabilization  Enabling Technology

Token Passing: Faults

• Transient fault, soft errors, wrong CRC, unexpected temporal severe conditions, etc.

• Assigns each processor with an arbitrary state (in the range of its state space).

• For example {3; 4; 4; 1; 0}.

• p2; p4; and p5 have tokens!

• Will the system ever recover?

Page 12: Stabilization  Enabling Technology

Token Passing: Automatic Recovery

• p1 changes state infinitely often,• Otherwise, let s1 be the fixed state of p1, • p2 eventually copies s1 from p1, then • p3 eventually copies s1 from p2, then • ... • pn eventually copies s1 from pn-1, then • p1 changes state. • p1 changes state in the order 4; 5; 0; 1; 2; 3; 4; 5; 0; ...

Page 13: Stabilization  Enabling Technology

Token Passing: Automatic Recovery Cont.

• In any initial state at least one state is missing, {4; 4; 1; 0; 2}, 3 and 5 are missing.

• Once p1 reaches the missing state e.g., 5, all the processors must copy 5, before p1 reads 5 from pn and changes state to 0.

Page 14: Stabilization  Enabling Technology

Will It Stabilize With mod (n - 2)?

Mod 3

{0,0,2,1,0} p1 {1,0,2,1,0} p5

{1,0,2,1,1} p4 {1,0,2,2,1} p3

{1,0,0,2,1} p2 {1,1,0,2,1}

+1 mod 3 !

Page 15: Stabilization  Enabling Technology

Is Self-Stabilization a Toy?

Page 16: Stabilization  Enabling Technology

Stabilization Stack

• Self Stabilizing Microprocessor [DH04,DH06]

• Self Stabilizing Operating System [DY04]• Self-Stabilization Preserving

Compiler[DH05]• Self-Stabilizing Automatic Recoverer For

Eventual Byzantine Software [BDK03]• Recovery Oriented Programming[BD05]

Page 17: Stabilization  Enabling Technology

Implementation Bottleneck

• Ask Intel, AMD, IBM to design a self-stabilizing microprocessor…

• Technology for converting off-the shelf processor to be self-stabilizing [DH06]

• Ask Microsoft, IBM, Red Hat, to convert existing code of OS to be self-stabilizing…

• Stabilizing Virtual Machine [DY07]

Page 18: Stabilization  Enabling Technology

Enforcing stabilization by resetting

• Processors behave correctly after reset

• Periodic reset ensures correct behavior

• But damages closure…• Need careful solutions

Page 19: Stabilization  Enabling Technology

Periodic Reset Monitor• Find a location P in OS code reached at least

every T time• At P:

Save necessary information to RAM Request a reset and loop forever.

• Stabilizing watchdog accepts request and resets processor

• Upon reset: restore information and continue

• Stabilizing watchdog verifies that a reset is performed at least every T + epsilon time

Page 20: Stabilization  Enabling Technology

Implementationusing Intel XScale core

• Used in numerous processors Network, I/O, Handheld, Cellular etc.

• RISC architecture (ARMv5 compatible)• Debug interface

Allows interaction between WD and OS External debug break used for notifying

the upcoming reset

Page 21: Stabilization  Enabling Technology

Up to now

• Virtual Self-stabilizing processor on top of commercial quality processor

• Towards repeating the concept in OSs and VMMs (enforcing configuration and protecting critical operations)

Page 22: Stabilization  Enabling Technology

Toward Self-Stabilizing Operating System (SOS)

Shlomi Dolev and Reuven Yagel,SAACS’04 Workshop, Zaragoza

Page 23: Stabilization  Enabling Technology

Basic Directions

• Black-box Take existing OS (Unix, Windows, RTOS) Add stabilization layer

• Carefully tailoring a tiny kernel Processor scheduling Memory management Device driver Hosting Byzantine processes

Page 24: Stabilization  Enabling Technology

Assumptions

• Every configuration (processor/memory) is possible

• At least some program code is hardwired (in ROM) and is correct – Harvard Model

• Processor: Instruction manual (e.g. x86\IA-32)

defines a transition function. Self-stabilizing [DH04]

Page 25: Stabilization  Enabling Technology

Black Box

Periodic Reset Re-install and Execute Watchdog timer (self-stabilizing) Periodic processor reset During bootstraps OS reinstall from ROM

Weak self-stabilization E = (ci, ai, ci+1, …., RRE, c1, a1, c2, a2, …., ci,

ai, ci+1, …., RRE, c1, a1, c2, a2, …. Is it always acceptable?

Alternative: Periodic re-install code only, add consistency check and enforcement

Page 26: Stabilization  Enabling Technology

Tailored Kernel

• Tiny Scheduler Tiny Memory Manager

• Requirements: Self-stabilizing Fair Process stabilization preserving (e.g.

validity of P.C. value)

Page 27: Stabilization  Enabling Technology

Tiny SOS Scheduler

; increase task10 mov word ax, [currentProc]11 and ax, PROC_MASK...

; load task state...;restore ip52 mov ax, [bx+4];validate ip53 and ax, IP_MASK54 mov word [ss:STACK TOP], ax;restore general registers55 mov cx, word [bx+12] 56 mov dx, word [bx+14] 57 mov si, word [bx+16] 58 mov di, word [bx+18]

• ~70 lines of a real machine assembly code

• 16bit Real mode & 32bit Protected mode.

• Standard build and emulation tools (Nasm, ld, Bochs)

• Detailed proof of requirement preservation

Page 28: Stabilization  Enabling Technology
Page 29: Stabilization  Enabling Technology

Sketch of Proof

• In every execution E, the code of the scheduler is started to be executed and is executed from the first instruction to the last instruction infinitely often

• In every execution E of the scheduler each process is executed infinitely often

• The self-stabilizing scheduler preservers stabilization of processes.

Page 30: Stabilization  Enabling Technology

Talk Outline

• Self Stabilizing Microprocessor [DH06]• Self Stabilizing Operating System [DY04]• Self-Stabilization Preserving

Compiler[DH05]• Self-Stabilizing Automatic Recoverer For • Eventual Byzantine Software [BDK03]• Recover Oriented Programming[BD05]

Page 31: Stabilization  Enabling Technology

Self-Stabilization Preserving Compiler

Shlomi Dolev, Yinnon A. Haviv,Department of Computer Science

Ben-Gurion University, Israel

Mooly Sagiv,Department of Computer Science

Tel Aviv University, Israel

Page 32: Stabilization  Enabling Technology

The Gap.

• Need a transformation between: Input program P written in a high

abstraction language, e.g., (D)ASM. Output program Q in a machine language,

say, JVM.

• Existing compilers? P and Q behaves the same when started

in the initial state. What if Q reaches an unexpected state

due to soft-error experienced by microprocessor?

Page 33: Stabilization  Enabling Technology

Trivial Example

• A statement of the form: For each i in {0..9} do f(i)

• May be compiled to • Start with cx=12 inside

the loop…

• Moreover: Any runtime mechanism can get stuck / inconsistent.

mov ax, 10 mov cx, 0loop1: push cx call f inc cx cmp cx,ax jne loop

Page 34: Stabilization  Enabling Technology

Stabilization Preserving Compiler – a closer look

State space of P

Ensuring that Q eventually behaves as P:

State space of Q

Page 35: Stabilization  Enabling Technology

The Transformation

upon <condition_1> do

<statement_1>

Variable declarations

upon <condition_n> do

<statement_n>

Enforce invariants

Scheduler

condition_1

condition_n

Statement_1

Statement_n

Page 36: Stabilization  Enabling Technology

Self-Stabilization Preserving Compiler: Summary

• Front end of compiler for ASM.• Self Stabilization preserving compiler.

Language with clear semantics from any state.

New demands for a compiler.

Page 37: Stabilization  Enabling Technology

Talk Outline

• Self Stabilizing Microprocessor [DH04]• Self Stabilizing Operating System

[DY04]• Self-Stabilization Preserving

Compiler[DH05]• Self-Stabilizing Automatic Recoverer For

Eventual Byzantine Software [BDK03]• Recover Oriented Programming[BD05]

Page 38: Stabilization  Enabling Technology

Self-Stabilization and Evolving Systems

• Real world systems cannot be verified exhaustively…

• We enforce safety and live-ness specifications• Contract between the client, project manager

and programmers, that is checked on line!• Make sure that the additional (thin)

monitoring and recovering layer is self-stabilizing

• A change can be made to the • implementation/specification• to support evolving environments

Page 39: Stabilization  Enabling Technology

Self-Stabilizing Recoverer for Eventual Byzantine

Software

Olga Brukman, Shlomi DolevDepartment of Computer Science

Ben-Gurion University, Israel

Hillel Kolodner,Haifa Research Labs

IBM, Israel

Page 40: Stabilization  Enabling Technology

Software Contains Bugs

• Heisenbugs, corrupt states, leaked resources are common…

• Correct and faultless SW is hard Long-lived running programs, e.g., OS

• Usually software is tested when starting from initial state and considering limited time scenarios.

Page 41: Stabilization  Enabling Technology

Fault Model Reflecting Reality

• Software packages can be trusted to work as required after restart.

• Eventual Byzantine software.• System administrators and users use

reboot to deal with faults.

Page 42: Stabilization  Enabling Technology

Middleware Architecture

OS

Kern

el

OMR

<Preds,RActs>1

<Preds,RActs>2

…<Preds,RActs>n

<Preds,RActs

>

<Preds,RActs>

<Preds,RActs>

<Pred

s,RActs>

Page 43: Stabilization  Enabling Technology

Monitor-Restarter for Process and Subsystem

<Pred,RActs>1

<Pred,RActs>2

Page 44: Stabilization  Enabling Technology

Restart Actions – Mature Approach

• Subsystem waits for completion of a restart of its components.

• Restart action may vary, depending on component internal state. Reschedule Roll-back Kill & Restart Few restart attempts with more drastic restart actions.

Page 45: Stabilization  Enabling Technology

Computational Model: rsf-execution

• An execution E is rsf (restart supporting fair)-execution iff E is a fair execution in which every subsystem subi that is initialised during E respects its specification function ssi.

Requirement: Every rsf-execution E has a suffix in which the system respects its specification function ss.

Page 46: Stabilization  Enabling Technology

Tools for Implementation – Black Box Approach

• Software package is a black box.• Package is monitored by recording it’s

IO (e.g., strace in Linux).• Monitors are independent of specific

implementation

Page 47: Stabilization  Enabling Technology

Tools for Implementation – Transparent Box Approach

• Software package implementation tool is known.

• Run-Time Reflection tools are used to monitor and restart the package.

• Possible in Java, C++, CORBA, COM.

Page 48: Stabilization  Enabling Technology

Practical Experience: Printers Problem

• Corrupted pdf, doc or ps file sent to printing server.

• Printer can’t print the file.• Cause retries by printing server

Printer is “stuck” on one job.

• Predicate for printing server: Restrict number of retries, try format

conversions, send error message to user.

Page 49: Stabilization  Enabling Technology

Recovery Oriented Programming

Olga Brukman and Shlomi DolevDepartment of Computer Science

Ben-Gurion University, Israel

Page 50: Stabilization  Enabling Technology

Towards Robust Software• Programming

Structural programming, OOD, Design Patterns…• Testing and debugging

Unit testing [JUnit, CppUnit]… Design By Contract (Eiffel) …

• Formal specification languages ASM, IO Automata, NURPL

• Model checking• Online recovery

ROC [PBB02]. Self-Stabilizing Autonomic Recoverer for

Eventual Byzantine Software [BDK03] - black box software packages.

Page 51: Stabilization  Enabling Technology

Our Contribution• Program invariants derived from design

specifications. Checked every time invariant variables are

updated.• Automatic code generation for invariant verification

and recovery upon invariant violation.• Invariants are verified during • runtime.

Change of invariant variable is pre-checked in sand-box. Violation is prevented and replaced with a recovery action.

Page 52: Stabilization  Enabling Technology

Our Contribution Cont.• Recovery action is chosen depending

on the current state and history.Roll back & resume.Wait.Reschedule.Kill & restart.

Page 53: Stabilization  Enabling Technology

External Monitoring

• Monitoring the whole task to avoid transient faults occurrence after

which invariant variables are not changed ( and no invariant checks are done)

liveness problem – monitor over time

Page 54: Stabilization  Enabling Technology

Talk Conclusions

• Self-Stabilization as an effective paradigm for creating robust systems.

• Rigorous approach for designing basic system components Microprocessor Virtual machine monitor Operating system Compiler Evolving and Recovery Oriented

Page 55: Stabilization  Enabling Technology
Page 56: Stabilization  Enabling Technology
Page 57: Stabilization  Enabling Technology

Self-Stabilizing Virtual Machine Hypervisor Architecture for Resilient Cloud

Alexander Binun, Shlomi Dolev, Reuven Yagel {binun,dolev,yagel}@cs.bgu.ac.il

Martin Kahil, Mark Bloch, Boaz Menuhin{kahilm,mbloch}@post.bgu.ac.il, [email protected]

Ben-Gurion University of the Negev

Marc Lacoste, Thierry Coupaye, Aurelien Wailly{marc.lacoste,thierry.coupaye,aurelien.wailly}@orange.com

Orange Labs, Paris

Page 58: Stabilization  Enabling Technology

INTRODUCTION

Page 59: Stabilization  Enabling Technology

Virtualization

Virtual machines (VM) Guest OS runs apps

Hypervisor (HV) Hardware for VMs Assume that HV is in

host OS (e.g. KVM)

Malware (e.g. rootkits) Get into HV from VM

VM

Page 60: Stabilization  Enabling Technology

TerminologyTransient failures (TFs): Yield arbitrary changes of the system state SEU (Single Event Update), limitations of

error detection algorithms

We do not want to say exactly, as we risk forgetting a scenarioWe better assume a resulting arbitrary state

Page 61: Stabilization  Enabling Technology

Terminology, Cont.

Admissible execution: minimal requirements for a system to be recoverable e.g., less than one third of the processors are

under attack Possible limited faults during the

automatic recovery from the unanticipated events in the past:

CPU failures Message losses

Page 62: Stabilization  Enabling Technology

Self-Stabilization

Legal execution: the desired behaviorSafe state: every execution starting from it is legal Self-stabilizing algorithm reaches a safe state in some finite time In every admissible execution From any arbitrary systems’ state Without external intervention

Page 63: Stabilization  Enabling Technology

Self-Stabilization, Cont.

The program is stored in a ROM Not a subject for changes can be periodically reloaded

The system state can undergo unpredicted changes… … following which the system should

converge to a safe state

Page 64: Stabilization  Enabling Technology

Example: Token Passing Algorithm

66

P1:

do forever X1 = XN =>

X1 := (X1+1) mod (N+1)

Pi, i ≠ 1:do forever

Xi+1≠ Xi =>

Xi+1 := Xi

Atomic step

P1

P3

P2

P4

X1

X2

X3

X4

Code: write-protected

Variables: may be corrupted

• Legal execution: exactly one Pi changes Xi • in infinitely many states

• A safe state: X1 = X2 = … XN

• The only possible execution is: exactly one Pi changes Xi

Page 65: Stabilization  Enabling Technology

Token Passing: Self Stabilization

{0; 0; 0; 0; 0};

{1; 0; 0; 0; 0};

{1; 1; 1; 1; 1};

{2; 1; 1; 1; 1};

{2; 2; 2; 2; 2}

x1 x2 x3 … xN

Failure: start fromArbitrary values

{3; 7; 2; 3; 0};

{4; …… };

{M; …… };

{M; M; M; M; M};

x1 x2 x3 … xN

Safe:

• P1 eventually increments x1

– In N rounds x1 propagates along the chain, reaches PN, then increases

• In a state S x1 gets unique (not encountered) value M• after incrementing x1 several times

• Then M is propagated to other processors, reaching a safe state

Rou

nd N

umbe

r

Page 66: Stabilization  Enabling Technology

OUR APPROACH – MAIN PRINCIPLE

Page 67: Stabilization  Enabling Technology

Rootkit Activity – Current State of Art

Page 68: Stabilization  Enabling Technology

Rootkit Activity – Current State of Art

Page 69: Stabilization  Enabling Technology

Rootkit Activity – Current State of Art

Page 70: Stabilization  Enabling Technology

Rootkit Activity – Current State of Art

Page 71: Stabilization  Enabling Technology

Rootkit Activity – Current State of Art

Page 72: Stabilization  Enabling Technology

Our approach: Software watchdog brings the system

into a safe state

PeriodicI’m Alive(frequent)

Reboot

Every software component (host,guests) can be corrupted – And the watchdog as well…

corrupted…

Page 73: Stabilization  Enabling Technology

Reload system from ROM upon a hardware

timer signal

Hardware interrupt

Consistency check:Tampered => Reboot

SMM

ROM

• Validator is write-protected by hardware means:– It is the one that guarantees self-stabilization– Runs rarely as it is very time-consuming

• Watchdog quickly detects small problems– Runs frequently and efficiently

Page 74: Stabilization  Enabling Technology

User Apps

Guest OS

VM1

User Apps

Guest OS

VM2

T2

(VM2)T1

(VM1)

VM Managercreate / delete VMUser

Existing Infrastructure

Page 75: Stabilization  Enabling Technology

CPU4 .Saving CPU state & stop

VM2 VM1

OSScheduler

Pool

T2T1

VMTable

Hypervisor

VMiStatei

1. Schedule VM

3 .Run CPU during some time

2. Activate VM

User

Existing Infrastructure

I/O

driv

ers

Inter-VM traffic

Hardware

Page 76: Stabilization  Enabling Technology

CPU

OSSchedulerVMTable

Hypervisor

VMiStatei

I/O

driv

ers

Pool

T2T1

Our Architecture: Watchdog

Hardware

Stabilization Manager

Periodic Interrupt

Watchdog

Check schedulerstate

Check VMstate

Check traffic state

Safe state? I’m Alive

not alive during a while =>reboot

Page 77: Stabilization  Enabling Technology

CPU

OSSchedulerVMTable

Hypervisor

VMiStatei

I/O

driv

ers

Pool

T2T1

Hardware

Stabilization Manager

Periodic Interrupt(every second)

Watchdog

Check schedulerstate

Check VMstate

Check traffic state

not alive during a while =>reboot Timer

Integrity checker

Interrupt(every day)

integrity failure=>reboot

Our Architecture: Hardware-facilitated integrity checking

Safe state? I’m Alive

Page 78: Stabilization  Enabling Technology

Implementation: Employ external tools for examining

VMsVM

Benign output

I am Alive

Page 79: Stabilization  Enabling Technology

Implementation: Employ external tools for examining

VMs (cont)VM

Report malware

Alarm

Kill, suspend etc

Page 80: Stabilization  Enabling Technology

Future Work• Test the prototype on real malware

collections– e.g. TechGainer[1]

• Intelligent safety enforcement – if the situation is not severely dangerous :

restart only malfunctioning fragments• reset malfunctioning printer instead of rebooting

the computer– Guarded Commands ([2]) as a basis for the

specification language {(guard,action), … }• Guard safety check, action enforcement

Page 81: Stabilization  Enabling Technology

Future Work• Make the architecture support

distributed cloud infrastructures– E.g. OpenStack[3]

• Are there competitors ?– Azure [5] – recovery through replication– Replicas synchronization algorithm may

suffer from transient faults too

Page 82: Stabilization  Enabling Technology