Experimental Evaluation of System-Level Supervisory Approach for SEFIs Mitigation

20
Experimental Evaluation of System- Level Supervisory Approach for SEFIs Mitigation Mrs. Shazia Maqbool and Dr. Craig I Underwood Maqbool 1 MAPLD 2005/P181

description

Experimental Evaluation of System-Level Supervisory Approach for SEFIs Mitigation. Mrs. Shazia Maqbool and Dr. Craig I Underwood. 1. Maqbool. MAPLD 2005/P181. Overview. Context Motivation Mitigation Scheme Top Level Description Protocol Gets Defined Test Bed - PowerPoint PPT Presentation

Transcript of Experimental Evaluation of System-Level Supervisory Approach for SEFIs Mitigation

Page 1: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Experimental Evaluation of System-Level Supervisory

Approach for SEFIs Mitigation

Mrs. Shazia Maqbool and Dr. Craig I Underwood

Maqbool 1 MAPLD 2005/P181

Page 2: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Overview

• Context

• Motivation

• Mitigation Scheme Top Level Description Protocol Gets Defined Test BedExperimental Results

• Conclusions

Maqbool 2 MAPLD 2005/P181

Page 3: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Single Event Functional Interrupts (SEFIs)

• A type of anomaly in microcircuits caused by a single ion strike

• Occurs in sensitive cross-section of the device• User doesn’t have direct access to fault location• Signatures

An upset rate higher than expected Non responding device In a communication network SEFI is an event, which stops

communication Variations in device current consumption

• During a SEFI, device is unavailable to the system • Device is potentially recoverable

Recovery involves resetting or power cycling System recovery requires restoring the device functionality

followed by its state recovery

Maqbool 3 MAPLD 2005/P181

Page 4: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Mitigation Levels

Using Radiation Hardening Processes

Built in Fault Tolerance Features

Incorporating Redundancy within Device

Error Detection And Corrections (EDACs)

Redundancy Techniques, e.g. Voting, Lockstep etc.

Configuration Scrubbing

Data Handling Networks

Device Level

Unit Level

System/Architectural Level

Maqbool 4 MAPLD 2005/P181

Page 5: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Motivation• Space applications demand:

more…

Computational power

Standardization

Reusability

but less…

Mass, volume and power budget

Cost

Development time

• Candidate architectures are heavily based on state-of-the-art COTS technology

Reliability

Availability

• SEFIs and single event transients are becoming dominant radiation hazards

• A unit level approach has usually been considered for SEFI mitigation

Maqbool 5 MAPLD 2005/P181

Page 6: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

System Architecture

P ayloadC om puter

S uperv isor

O B C

M assM em ory U n it

S paceW ireR outer

C ode S tore

P ow er S ubsystemC ontro lle r

• A fast data network interlinks all units

Scalable

Distributed

Reusable

•A system level SEFI mitigation

•A diagnosis and recovery (DAR) packet from each unit acts as an indicator of health status for the unit

•The supervisor intervenes when a packet does not arrive or it does not match expectation

Maqbool 6 MAPLD 2005/P181

Page 7: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Why a System-Level Approach

• Cost-effective

• Adaptable

• Reusable

• Power cycling requirements associated with SEFIs, demands for an external entity to hold state data and to initiate a recovery procedure

• In case of a permanent failure, it can be switched off

• Supervisory functions, network and configuration management can be combined

Maqbool 7 MAPLD 2005/P181

Page 8: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

On-Board ComputerPossible sources of fault Processor Memory

Network interface

Processor

EDAC

Memory

Interface FPGA

OPC

The OBC Subsystem

Am plifie r

C om aprator

Vset

O BCM odule

AD C

InterfaceFPG A

D AC

I

Over-Current Protection Circuitry (OPC)

Required underlying mitigations

EDAC

OPC

Maqbool 8 MAPLD 2005/P181

Page 9: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

SEFI SignaturesD evice Type S ignatures

M icroprocessor

ScreechN on responsive

C alcu la tion errorsC urrent Varia tions

M em ory U pset ra te h igher than expected

N etw ork in terface

Inva lid packetN on responsive

Link estab lishm ent tim e-outTransm it and rece ive tim e-out

Maqbool 9 MAPLD 2005/P181

Page 10: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Supervisory Protocol• Two Types of packets

Screech Packet Diagnosis And Recovery (DAR) Packet

• DAR task Perform Testing of the Processor Collects error count of the memory unit Updates state data Current consumption of the OBC module will be monitored

System ID

Length Flags Diagnostic health data/Screech data

Maqbool 10 MAPLD 2005/P181

Page 11: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Diagnosis And Recovery (DAR) Packet Flow

SODARP MarkerStart Sampling Current

Perform Test  

DAR Packet ReceivedCompare with Stored Values  

Enable InterruptsCollect SEU CountSend DAR Packet

Waiting for Supervisor Response

Command to Update Program Memory Update Memory

OBC-Processor OBC- Interface FPGA Supervisor Code Store

DAR Process StartsDisable Interrupts

   

Collect Current Value

Maqbool 11 MAPLD 2005/P181

Page 12: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Recovery MethodFault Type Recovery Procedure

Screech Reload program memory

Packet time-out (Network problems) Next Slide

Packet time_out (Processor Problem)

Next Slide

Current consumption variations Power cycle and reload memory

SEU count exceeding threshold Reload memory

Test task result mismatch Reset and reload memory

Maqbool 12 MAPLD 2005/P181

Page 13: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Recovery Method (2)Polling

request tothe O BC -in terface

Linkconnection

tim e-out

R eceivepacket tim e-

out

T ransm itpacket tim e-

out

“I am okay”packet

rece ived

N o

N o

N o

O BCinterfaceblocked

Supervisorin terfaceblocked

Supervisorin terfaceblocked

Yes

Yes

Yes

In terruptrequest

IN TR requesttim e-out

N M I requesttim e-out

R esetrequest

Yes

Yes

“I am okay”packet

rece ived

“I am okay”packet

rece ived

N o

N o

• In case of a processor reset and power cycle, the OBC should be allowed sufficient time for reinitialization

• The supervisor needs to keep a record of recoveries applied

• Consecutive recovery cycles needs to be avoided

Maqbool 13 MAPLD 2005/P181

Page 14: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Test Bed• Demonstration of the

synchronization protocol PC1 executes the OBC

program PC2 executes the

supervisor program

PC 1 PC 2

H ub

C elox icaR C 203

Parallel P

ort

E thernet

Maqbool 14 MAPLD 2005/P181

Supervisor po lls the O BC every 1s

O BC is required to send back a packet

Supervisor ho lds tim e-out check

If a tim e-out occurs, the O BC program ism anually reset

T im e-out = m axim um tim e required by theboth supervisor and O BC packets to trave l

in netw ork + m arg in

O BC sends a packet to the supervisor every1s

Supervisor ho lds tim e-out check

T im e-out = Invocation period of O BC packet+ m ax. tim e to travel + m arg in

If a tim e-out occurs, the O BC program ism anually reset

Synchronization Scheme 1 Synchronization Scheme 2

Page 15: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Synchronization Scheme 1

OBC program receives a packet, checks source, if it is from the supervisor program, it sends a packet to the FPGA

FPGA sends the packet to the supervisor program on Ethernet

UDP/IP Packet from the supervisor

Ethernet RC 203 board passes it as it is to the OBC-program

Parallel Port

Parallel Port

Packet on Ethernet

Configuration 1

UDP/IP Packet from the supervisor

Ethernet RC 203 passes only data bytes

OBC program receives data, it sends data bytes to the FPGA

Parallel Port

Parallel Port

FPGA encodes received data into UDP/IP packet

Packet on Ethernet

Configuration 2

Maqbool 15 MAPLD 2005/P181

Page 16: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Time Measurement Method

Maqbool 16 MAPLD 2005/P181

• The ethereal graphical user interface (GUI) network protocol analyzer was used

It displays time when a packet was capturedIt also displays IP source and destination, protocol type source and destination port for all captured packets. Selecting a packet from the list of captured packets shows total bytes captured on the network medium, Ethernet source and destination addresses, and number of data bytes in the packet.

• Time was measured from the moment it captures packet sent by the supervisor to the point when it captures a return packet from the OBC for synchronization scheme 1.

• For synchronization scheme 2, time was measured between two consecutive packets from the OBC.

Page 17: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Results (1)

Maqbool 17 MAPLD 2005/P181

 

Data bytes 

Average time between two packets in a pair (Supervisor packet and OBC packet in response)

(ms)

Time required for 1 byte to travel through the systemTime measured in n run with N bytes – time measured in n+1 run with N+K bytes divided by K

(s)

18 101.053 101.264-101.053/18 = 11.7

36 101.264 101.822-101.264/36 = 15.5

72 101.822 102.229-101.822/28 = 14.53

100 102.229 103.677-102.229/400 = 14.85

500 103.677   

Page 18: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Results (2)

Maqbool 18 MAPLD 2005/P181

Data bytes Experiment Time measured between a supervisor packet and a response packet from the OBC

(s)

18 Configuration 1 with

RC200GetBlockStall function

101053

18 Configuration 1 with

RC200GetBlock function

1201

18 Configuration 2 with

RC200GetBlockStall function

101066

18 Ping program 270

Page 19: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

Synchronization Scheme 2

Maqbool 19 MAPLD 2005/P181

Configuration for Synchronization Scheme 2

OBC sends data bytes to the interface FPGA

Interface FPGA encodes data into UDP/IP packet and writes it on Ethernet

Parallel PortEthernet

Experiment Time measured

Synchronization scheme 2: Time measured between two consecutive packets from OBC (18 data bytes)

261 s

Synchronization scheme 1:

OBC program crashed and reinitialized manuallyFPGA cleared using FTU facility and OBC program reinitialized manually

Time between last OBC packet prior to fault and first packet after recovery12 s, 299 ms and 644 s15 s, 339ms and 501 s

Synchronization scheme 2:

OBC program crashed and reinitialized manuallyFPGA cleared using FTU facility and OBC program reinitialized manually

Time between last OBC packet prior to fault and first packet after recovery11 s, 316 ms and 272 s14 s, 971ms and 559 s

Page 20: Experimental Evaluation of System-Level Supervisory Approach  for SEFIs  Mitigation

ConclusionsA system-level approach has been presented to mitigate SEFIs in data

handling architectures Upset detection is not straightforward, limits effectiveness of currently

available mitigation techniques Increasing SEFI susceptibility in all major data handling device technologies A system level intelligent supervisor allows monitoring of a wide range of

devices with minimal overhead Synchronization is straightforward Two synchronization schemes have been demonstrated Few simple experiments were performed to establish a time-out period for a

packet from the OBC. Once this information was achieved, the system behaved as expected and a

synchronized packet communication was established between the OBC and the supervisor programs

In event of a SEFI, the supervisor program needs to wait until the OBC program is up again. Time-out for this wait period will depend on the recovery latency

Maqbool 20 MAPLD 2005/P181