Software Fault Tolerance (SWFT) SWIFI in OSs

Post on 11-Jan-2016

42 views 0 download

description

Software Fault Tolerance (SWFT) SWIFI in OSs. Prof. Neeraj Suri Constantin Sârbu Dept. of Computer Science TU Darmstadt, Germany. Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de. So far: Verification & Validation Testing Techniques Static vs. Dynamic - PowerPoint PPT Presentation

Transcript of Software Fault Tolerance (SWFT) SWIFI in OSs

1

Software Fault Tolerance (SWFT)SWIFI in OSs

Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de

Prof. Neeraj Suri

Constantin Sârbu

Dept. of Computer ScienceTU Darmstadt, Germany

2

Fault Removal: Software Testing So far: Verification & Validation

Testing Techniques Static vs. Dynamic Black-box vs. White-box

Last time: Testing of dependable systems Modeling Fault-injection (FI / SWIFI) Some existing tools for fault injection

Today: Testing (SWIFI) of operating systems WHERE: Error propagation in OSs [Johansson’05] WHAT: Error selection for testing [Johansson’07] WHEN: Injection trigger selection [Johansson’07]

Next (last before mid-term exam!): Profiling the OS extensions (state change @ runtime)

3

Reminder: SWIFI

General SW Manipulate bits in memory locations, registers, buses etc.

• Emulation of HW faults Change text segment of processes

• Emulation of SW faults (bugs, defects) Dynamic: E.g., Op-code switch during operation Static: Change source code and recompile (a.k.a. mutation)

What is different in Oss? OS act as a mediator between HW and user SW

applications Kernel mode – low accessibility A failure of the OS often means failure of the whole system Often source code not available Add-on kernel extensions written by other parties than OS

producer -> lack of experience Etc.

4

OS Robustness Testing Efforts at DEEDS

Our research topics presented today:

Error propagation profiling• How errors propagate through OS to the user space• “Error Propagation Profiling of Operating Systems” (DSN’05)

Error selection• How an OS reacts to various types of injected errors• “On the Selection of Error Model(s) for OS Robustness

Evaluation” (DSN’07) Error trigger

• How to choose the injection instant?• “On the Impact of Injection Triggers for OS Robustness

Evaluation” (ISSRE’07)

Slides are the ones presented at each conference!http://www.deeds.informatik.tu-darmstadt.de/aja/

5

Error Propagation Profiling of Operating Systems

Andréas Johansson & Neeraj Suri

Department of Computer ScienceTechnische Universität Darmstadt, Germany

Presented at DSN 2005

6

Paper Objectives Investigate Experimental Error Propagation Profiling of OS

Interfaces/Svcs Quantitative and Metrics! Dynamism & Operational Profiles Black Box with no internal access

Motivation

OperatingSystem

OperatingSystem

HW/Drivers

Applications

Libraries

7

A

B D

C

E

F

IncreasinglIncreasinglyy

badbad

C

E

A

F

DB

!!

Profiling

8

Profiling

Experimental technique to ascertain “vulnerabilities” Identify (potential) sources, error propagation & hot spots,

etc. Estimate their “effects” on applications Component enhancement with “wrappers”

• if (X > 100 && Y < 30) then Exception();• Location of wrappers

Aspects Metrics for error propagation profiles Experimental analysis

9

System Model

Applications

Operating System

Drivers

?

10

Device Driver

Model the interfaces (defined in C) Export (functions provided by the driver) Import (functions used by the driver)

Driver X

dsx.1 … dsx.m osx.1 … osx.n

Hardware

Exported Imported

11

Error Model

Data level errors in OS-Driver interface Wrong values Based on the C-type

• Boundary• Special values• Offsets

Transient First occurrence

12

Metrics

Three metrics for profiling1. Propagation - how errors flow through the OS2. Exposure - which OS services are affected3. Diffusion - which drivers are the sources

Impact analysis

– Metrics– Case study (WinCE)– Results

13

Service Error Permeability

1. Service Error Permeability: Measure one driver’s influence

on one OS service Used to study service-driver

relations

)osin error in Pr(error POS

)dsin error in Pr(error PDS

..

..

zxiizx

yxiiyx

s

s

xD

is

14

OS Service Error Exposure

2. OS Service Error Exposure: An application uses certain services How are these services influenced

by driver errors? Used to compare services

x jxx jx ds

ijx

os

ijx PDSPOS

D.

D.

i

..

E

xD

is

15

Driver Error Diffusion

3. Driver Error Diffusion: Which driver affects the

system the most? Used to compare drivers

xD

i .i . s

.s

. Djxjx ds

ijx

os

ijx

x PDSPOS

is

16

Impact Analysis

Impact ascertained via failure mode analysis

Failure classes: Class NF: No visible effect Class 1: Error, no violation Class 2: Error, violation Class 3: OS Crash/Hang

?

17

Case Study: Windows CE

Targeted drivers Serial Ethernet

FI at interface Data level errors

Effects on OS services 4 Test applications

Test App

OS

DriversTargetDriver

Manager

Interceptor

DriversDrivers

Host

18

Error Model

Error C-Type #cases

Integers

int 7

unsigned int 5

long 7

unsigned long 5

short 7

unsigned short 5

LARGE_INTEGER 7

Void * void 3

Char’s

char 7

unsigned char 5

wchar_t 5

Boolean bool 1

Enums multiple #ident’s

Structs multiple 1

Case # New value

1 previous – 1

2 previous +1

3 1

4 0

5 -1

6 INT_MIN

7 INT_MAX

LONG RegQueryValueEx([in] HKEY hKey,

[in] LPCWSTR lpValueName,

[in] LPDWORD lpReserved,

[out] LPDWORD lpType,

[out] LPBYTE lpData,

[in/out] LPDWORD lpcbData);

19

Service Error Permeability

Ethernet driver 42 imported svcs 12 exported svcs

Most Class 1 3 Crashes (Class 3)

20

OS Service Error Exposure

Serial driver 50 imported svcs 10 exported svcs

Clustering of failures

21

Driver Error Diffusion Higher diffusion for Ethernet Most Class NF Failures at boot-up

Ethernet Serial

#Experiments 414 411

#Injections 228 187

#Class NF 330(80%)

377(92%)

#Class 1 80 (19%) 25 (7%)

#Class 2 1 9

#Class 3 3 0

0.616 0.460

0.002 0.022

0.007 0

k1DC

k3DC

k2DC

22

On the Selection of Error Model(s) On the Selection of Error Model(s)

for OS Robustness Evaluationfor OS Robustness Evaluation

Andréas Johansson, Neeraj SuriTU Darmstadt, Germany

Brendan MurphyMicrosoft Research, Cambridge, UK

Presented at DSN 2007

23

Objectives: “What to Inject?”

FI’s effectiveness arises based on the chosen error model being (a) representative of actual errors, and (b) effectively triggering “vulnerabilities”.

Comparative evaluation of “effectiveness” of different error models: Fewest injections? Most failures? Best “coverage”?

Propose a composite error model for enhancing FI effectiveness

24

Error Models Focus

Target errors arising in device drivers Main source of OS failures [1, 2] Developed by HW vendors Continually evolving

Considered error models Data-type Bit-flips Fuzzing

[1] Ganapathi et. al., LISA’06[2] Chou et. al., SOSP’01

25

System Model

Applications

Operating System

Drivers

OS-App services

OS-Driver services

26

Injection Methodology

Operating SystemOperating System

InterceptorInterceptor

Device DriverDevice Driver

Intercepts function calls between OS and driver

Driver binary modifiedto use Interceptor

OS reconfigured to use Interceptor

Implemented for Windows CE .Net

27

Chosen Drivers & Error Models

Error Models: Data-type (DT) Bit-flips (BF) Fuzzing (FZ)

Driver Description#Injection cases

DT BF FZ

cerfio_serial Serial port 397 2362 1410

91C111 Ethernet 255 1722 1050

atadisk CompactFlash 294 1658 1035

28

Error Models – Data-Type (DT) Errors

int foo(int a, int b) {…}

int ret = foo(0x45a209f1, 0x00000000);

29

Error Models – Data-Type (DT) Errors

Case New Value

1 Previous – 1

2 Previous +1

3 1

4 0

5 -1

6 INT_MIN

7 INT_MAX

0x80000000

int foo(int a, int b) {…}

int ret = foo(0x45a209f1, 0x00000000);

30

Error Models – Data-Type (DT) Errors

Varied #cases depending on the data type Requires tracking of the types for correct injection Complex implementation but scales well

int foo(int a, int b) {…}

int ret = foo(0x80000000, 0x00000000);

31

Error Models – Data-Type (DT) Errors

Data type C-Type #Cases

Integers

int 7

unsigned int 5

long 7

unsigned long 5

short 7

unsigned short 5

LARGE_INTEGER 7

Misc.

* void 3

HKEY 6

struct {…} multiple

Strings 4

Characters

char 7

unsigned char 5

wchar_t 5

Boolean bool 1

Enums multiple casesmultiple cases

32

Error Models – Bit-Flip (BF) Errors

int foo(int a, int b) {…}

int ret = foo(0x45a209f1, 0x00000000);

33

Error Models – Bit-Flip (BF) Errors

int foo(int a, int b) {…}

int ret = foo(0x45a209f1, 0x00000000);

1000101101000100000100111110001

34

Error Models – Bit-Flip (BF) Errors

int foo(int a, int b) {…}

int ret = foo(0x45a209f1, 0x00000000);

1000101101000101000100111110001

1000101101000100000100111110001

35

Error Models – Bit-Flip (BF) Errors

int foo(int a, int b) {…}

int ret = foo(0x45a289f1, 0x00000000);

Typically 32 cases per parameter Easy to implement

1000101101000101000100111110001

36

Error Models – Fuzzing (FZ) Errors

int foo(int a, int b) {…}

int ret = foo(0x45a209f1, 0x00000000);

37

Error Models – Fuzzing (FZ) Errors

int foo(int a, int b) {…}

int ret = foo(0x45a209f1, 0x00000000);

0x17af34c2

38

Error Models – Fuzzing (FZ) Errors

int foo(int a, int b) {…}

int ret = foo(0x17af34c2, 0x00000000);

Selective #cases Simple implementation

39

Comparison

Compare Error Models on:

Number of failures Effectiveness Experimentation Time Identifying services

Error propagation

40

Failure Classes & Driver Diffusion

Failure Class Description

No Failure No observable effect

Class 1Error propagated, but still satisfied the OS service specification

Class 2Error propagated and violated the service specification

Class 3 The OS hung or crashed

41

Failure Classes & Driver Diffusion

Failure Class Description

No Failure No observable effect

Class 1Error propagated, but still satisfied the OS service specification

Class 2Error propagated and violated the service specification

Class 3 The OS hung or crashed

Driver Diffusion [3]: a measure of a driver’s abilityto spread errors:

i .s

. Dyxds

iyx

x PDSxD

is

[3] Johansson, Suri, DSN’05

42

Number of Failures (Class 3)

0

10

20

30

40

50

60

70

80

FZBFDTFZBFDTFZBFDT

#C3

Failu

res

91C111cerfio_serial atadisk

43

Failure Classes & Driver Diffusion

Drivers DT BF FZ

cerfio_serial 1.50 1.05 1.56

91C111 0.73 0.98 0.69

atadisk 0.63 1.86 0.29

Driver Diffusion (Class 3)

Class 3

Class 2

Class 1

No failure

0%

20%

40%

60%

80%

100%

BFDT FZ

atadisk

BFDT FZ

91C111

BFDT FZ

cerfio_serial

44

Experimentation Time

Driver Error ModelExec. time

h min

cerfio_serial

DT 5 15

BF 38 14

FZ 20 44

91C111

DT 1 56

BF 17 20

FZ 7 48

atadisk

DT 2 56

BF 20 51

FZ 11 55

45

Identifying Services (Class 3)

Which OS services can cause Class 3 failures?

Which error model identifies most services (coverage)?

Is some model consistently better/worse?

Can we combine models?

Service DT BF FZ

1 X

2 X X

3 X

4 X X

5 X

6 X X

7 X X

8 X X

9 X X X

10 X X X

11 X X X

12 X

13 X

14 X X X

15 X

16 X X X

17 X

18 X

46

Identifying Services (Class 3 + 2)

Which OS services can cause Class 3 failures?

Which error model identifies most services (coverage)?

Is some model consistently better/worse?

Can we combine models?

Service DT BF FZ

1 O X O

2 X X O

3 X O

4 X X

5 X

6 X X

7 X X O

8 X X

9 X X X

10 X X X

11 X X X

12 O X

13 X

14 X X X

15 X

16 X X X

17 X

18 X

47

Bit-Flips: Sensitivity to Bit Position?

0

2

4

6

8

10

024681012141618202224262830Bit position

#Ser

vice

s

[LSB][MSB]

48

024681012141618

024681012141618202224262830

#Ser

vice

s

Bit position

Bit-Flips: Bit Position Profile

Cumulative #services identified

49

Fuzzing – Number of injections?

91111C

cerfio_serial

atadisk

0.2

0.4

0.6

0.8

1.2

1.0

1.4

1.6

1.8

2.0

Dif

fusi

on

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15#Injections

50

Composite Error Model

Let’s take the best of bit-flips and fuzzing Bit-flips: bit 0-9 and 31 Fuzzing: 10 cases

~50% fewer injections Identifies the same service set

500

1500

2500

3500

cerfio_serial

91C111atadisk

#Inj

ecti

ons All BF & FZ

Composite

51

Composite Error Model – Results

BFDT FZCM

atadisk

BFDT FZCM

91C111BFDT FZ

CM

cerfio_serial

Class 3

Class 2

Class 1

No failure

0%

20%

40%

60%

80%

100%

52

Summary Comparison across three well established error models + CM

Data-type Bit-flips Fuzzing

Model Implementation Coverage Execution

DT

BF

FZ

CM

53

Summary Comparison across three well established error models + CM

Data-type Bit-flips Fuzzing

Model Implementation Coverage Execution

DT

BF

FZ

CM

Requires tracking

data types

Requires few experiments

54

Summary Comparison across three well established error models + CM

Data-type Bit-flips Fuzzing

Model Implementation Coverage Execution

DT

BF

FZ

CM

Found the most Class 3 failures

Requires many experiments

55

Summary Comparison across three well established error models + CM

Data-type Bit-flips Fuzzing

Model Implementation Coverage Execution

DT

BF

FZ

CM

Finds additional services

56

Summary Comparison across three well established error models + CM

Data-type Bit-flips Fuzzing

Model Implementation Coverage Execution

DT

BF

FZ

CM

Profiling gives combined BF & FZ with high coverage

57

Summary Comparison across three well established error models + CM

Data-type Bit-flips Fuzzing

Outlook: Outlook: When to do the injection? More drivers, OS’s, models?

Model Implementation Coverage Execution

DT

BF

FZ

CM

58

On the Impact of Injection TriggersOn the Impact of Injection Triggersfor OS Robustness Evaluationfor OS Robustness Evaluation

Andréas JohanssonAndréas Johansson, Neeraj Suri, Neeraj Suri

Department of Computer ScienceDepartment of Computer ScienceTechnische Universtät DarmstadtTechnische Universtät Darmstadt

GermanyGermany

DEEDS: Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de

Brendan MurphyBrendan Murphy

Microsoft ResearchMicrosoft ResearchCambridgeCambridge

UKUK

Presented at ISSRE 2007Presented at ISSRE 2007

59

Operating System RobustnessOperating System Robustness

Operating SystemOperating System Key operational element Used in virtually all environments robustness! Drivers are a major source of failures [1] [2]

[1] Ganapathi et. al., LISA’06[2] Chou et. al., SOSP’01

60

Operating System RobustnessOperating System Robustness

External faults Robustness Drivers Interfaces

Experimental Fault injection Run-time

Interface OS-Driver No source code

Goal Identify services with robustness

issues Identify drivers spreading errors

Applications

Drivers

OS

61

Operating System RobustnessOperating System Robustness

The issues behind FI based OS robustness The issues behind FI based OS robustness Where to inject? [3] What to inject? [4] When to inject? [today]

OutlineOutline Problem definition Call strings and call blocks System and error model Experimental setup and method Results

[3] Johansson et. al., DSN’05[4] Johansson et. al., DSN’07

62

Fault InjectionFault Injection

Target: interface OS-DriverTarget: interface OS-Driver Each call potential injectionEach call potential injection Problem: too many callsProblem: too many calls

First-occurrence Sample (uniform?)

Service invocations

63

Fault InjectionFault Injection

Observation: calls are not made randomlyObservation: calls are not made randomly Repeating sequences of calls

Idea: select calls based on “operations”Idea: select calls based on “operations” Identify subsequences, select services

64

Call Strings & Call BlocksCall Strings & Call Blocks

Call stringCall string List of tokens (invocations) to a specific driver

Call blockCall block Subsequence of a call string May be repeating Corresponds to a higher level “operation” Used as trigger for injection

65

System and Error ModelSystem and Error Model

Error model: bit-flipsError model: bit-flips Shown to be effective Simple to implement

Injection Function parameter values

66

Experimental ProcessExperimental Process

Execute workloadExecute workload Record call string

Extract call blocksExtract call blocks Select service targets (1 per call block)

Define triggersDefine triggers Based on tracking call blocks

Perform injectionsPerform injections

67

Injection SetupInjection Setup

Target OS: Windows CE .NetTarget OS: Windows CE .Net Target HW: XScale 255Target HW: XScale 255

68

Failure ClassesFailure Classes

Failure Class Description

No Failure No observable effect

Class 1Error propagated, but still satisfied the OS service specification

Class 2Error propagated and violated the service specification

Class 3 The OS hung or crashed

69

Selected DriversSelected Drivers

Serial port driverSerial port driver Ethernet card driverEthernet card driver

Workload/driver phases:Workload/driver phases:

70

Serial Driver Call String and Call BlocksSerial Driver Call String and Call Blocks

Call string:Call string:

D02775(747){23}732775(747){23}23D02775(747){23}732775(747){23}23

Init Working Clean up

71

Ethernet Driver Call String and Call BlocksEthernet Driver Call String and Call Blocks

72

Driver ProfilesDriver Profiles

Driver invocation patterns differDriver invocation patterns differ Impact of call block injection efficiencyImpact of call block injection efficiency

Serial Ethernet

73

Driver ProfilesDriver Profiles

Driver invocation patterns differDriver invocation patterns differ Impact of call block injection efficiencyImpact of call block injection efficiency

Serial Ethernet

74

Serial Driver ResultsSerial Driver Results

75

Serial Driver Service IdentificationSerial Driver Service Identification

FO δ α β1 γ1 ω1 β2 γ2 ω2

CreateThread x x x

DisableThreadLibraryCalls

x x

EventModify x x

FreeLibrary x x

HalTranslateBusAddress x

InitializeCriticalSection x

InterlockedDecrement x

LoadLibrary x x

LocalAlloc x x

memcpy x x x

memset x x x

SetProcPermissions x x x

TransBusAddrToStatic x

76

Ethernet Driver ResultsEthernet Driver Results

TriggerSerial Ethernet

#Injections #C3 #Injections #C3

First Occ. 2436 8 1820 12

Call Blocks

8408 13 2356 12

77

SummarySummary

Where, What & When?Where, What & When? New timing model for interface fault injectionNew timing model for interface fault injection

Faults in device driversFaults in device drivers Based on call strings & call blocksBased on call strings & call blocks

ResultsResults Significant differenceSignificant difference More servicesMore services Driver dependentDriver dependent Driver profilingDriver profiling More injections (2436 vs. 8408)More injections (2436 vs. 8408) Focus on init/clean up?Focus on init/clean up?

78

Discussion & OutlookDiscussion & Outlook

Call block identificationCall block identification Scalability? New data structures (suffix trees)

Call block selectionCall block selection Working phase vs. initial/clean up

Determinism & concurrencyDeterminism & concurrency Workload selectionWorkload selection

Error modelsError models