An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl...

42
An error reporting system for the upgraded CDF data :}cquisition system Peter J. M usgrave Department of Physics, McGill University Montréal, Québec Canada November, 1993 A Thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Master of Science @Peter J. Musgrave, 1993

Transcript of An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl...

Page 1: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

An error reporting system

for the upgraded

CDF data :}cquisition system

Peter J. M usgrave Department of Physics, McGill University

Montréal, Québec Canada

November, 1993

A Thesis submitted to the Faculty of Graduate Studies and Research

in partial fulfillment of the requirements for the degree of Master of Science

@Peter J. Musgrave, 1993

Page 2: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

------------- - -----

Abstract

ThIS thesis describes the data acquIsItion error monitoring system developed for the

1993-94 physics run of the collider detector at Fermilab (CDF). It presents an overview

of the CDF data acqUIsItion system mdicating the role that the error monitoring

system plays ln the experiment. It then describes the custOIIl software and software

packages used to meet the error monitormg requirements of the CDF data acquisition

system

11

Page 3: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

\' ,,,,'1 ./1\' 4" li', .. '~..,) 1U.J...!"J

Cette tl .... ~fie d,écrit le nouveau systeme de surveillance du système d'arquisitlOll rie

dc,l'mé€'s (~JAF .'e l'expérience CD F. récemment amélioré pour la pénode d(' pnst'

de donnp ", ,," 1994 Aprèù un 3urvol du SI\ D on explIque le rôle rlu sySt<>TTH' rit>

;!Uf\' ,,' ,;xlgences auxquellef, Il dOlt repondle. FlDdlement, on di>cnL le

our répon dre à ces e:ogeI1 ces

111

Page 4: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Acknowledgernents

Paradtse

zs exactly lzke where you are rzght now,

only much,

Tnuch,

better

Laune Anderson

1 would like to thank Ken Ragan for supervising this work. His .sense of humor

and flexibihty made this project fun. Thanks also to Klaus Strahl for his help with

Murmur and his willingne.ss to discuss this project at great Iength. 1 am aiso gratefui

to Kurt Biery for his patience with my questions about the DAQ.

The members of the onlme software group at Fermilab developed an error report­

ing system (Murmur) whlch allowed this project to be completed in a timely manner

and saved us a huge amount of work Thanks

The software package DAQERI was developed by me. The remaining bugs shouid

be mterpreted as desIgn intent

1 enJoyed the company and conversation of my fellow CDF graduate students frorn

McGIlI and Toronto durmg my stays at FERMILAB. Happy histogramming folks!

My parents continue to be a perpetuaI source of support regardless of the direction

in which my aspirations and delusïons propel me. Many thanks.

Finally, 1 thank Kathryn Adeney for being my partner on the journey up the

mountain .

IV

Page 5: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

----------

• Contents

Abstract Il

Résumé III

Acknowledgements IV

1 Introduction

2 The CDF Data Acquisition System '1

2.1 The Run lA System 4

• 2.2 The Run lB System ... 7

3 Error Reporting Objectives 10

3.1 Requirements . . . ... 10

3.1.1 Summary of Requirements Il

3.2 Desirable Features .. . . . 12

4 The CDF Error Reporting System 13

4.1 Overview. .... · .. 13

4.2 Murmur 15

4.2.1 Error Generation 15 4.2.2 Error Monitoring 16

4.3 DAQERI. .... · .. 18

5 DAQERI 24

5.1 DAQKER ..... · .... 24

5.1.1 The Murmur Interface 25 5.1.2 DAQGUI Interface 25

• 5.1.3 The DAQKER Data Structure. 27

v

Page 6: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

CONTENTS

5.2 DAQGUI .

5.3 Browse.

5.4 How an Error lS Handled

6 Conclusions

llibiliography

VI

30

31

32

34

35

Page 7: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

• List of Figures

2.1 Run lA data acqUlsltlOn system ()

2.2 Run lB data acqUlsltlOn system 9

4.1 CDF error reportmg system 1·1

4.2 Murmur error display wmdow 17

4.3 DAQERI main window 20

4.4 DAQERI no de wmdow 21

4.5 DAQERI node error window 22

4.6 Logfile browser 2~l

• 5.1 DAQERI kernel data structure. 28

• VIl

Page 8: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

------------------_ .

Introduction

Our present. understanding of matter, as composed of the particle farmlies of quarks

and leptons, is called the standa.rd model This mocle! :s a re'imlt of decades of

theoretlcdl dnd expenmental analysl:'. The standard model has survlved all of the ex­

pC'rirncntal tests to WhlCh il has beeu sub.~ected. It predldH thal, SIX flavors of quarks

eXIst and Hve of these have been v';ldied expenrnentally. The seaI'ch for the sD.th

quark, known as the top quark, is underway at the Ferrm lab Teva'cron The Tevatron

lS presently the only operatlOnal collider wll1Ch can collidt" partiel es at hlgh enough

energy tü s('arch for the creatlOn of a top- antitop pau

The Tevatron mamtamR bunches of pJotons and antI-protons ln a cm:ular orbi t

VIa superconductlllg rnagnets. lt. c'lrl'ently operates wlth SIX bunches (If counter ro­

tatmg protons and antl-protons At several points in the Tevatron nng tbe bunches

of protons and antiprotons are focused to a common intera.ctIon poi nt where collisi()1ls

occur at a center of mass enel'gy of 1.8 Te V at a frequenc:y of 285 KHz

The Tevatron has two detectors which e;.,amme the proton- a.ntiproton collisio1ls

and extract sIgnatures of interesting phySlCS Events. This thesls describes work telat­

ing to one of these, the Colhder Detector at F,~rmila.b (CDF). The CDF detectoJ' 18 a.

large cylindncally symmetnc detector. It. lS descn b(~d in detail elsewhere [1 J. Briefly

il consists of (from the mteractlOn point outward) a SIlIcon vertex detector, vertex

time projectIOn chamber, drift cbarnber, electro-rnagnetic and hadronic calorimeters

surrounded by an array of muon detect.ors. A superconducting soleIlOld magnet sur­

rounds the vertex and drift chambers and creates a magnetic field of 1.4 Telsa. Eacb

of these detedor clements provides the data they rneasure in the form of electronic sig­

naIs A complete record of an event in the detector consists of approximately 250,000

1

Page 9: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

-------- - ------------

Ch3.pter 1: [ntroductlOn

K bytes It IS !lot feasJble Lo record thls dmotmt of data for t'Mil beam lll'SSl11g .IIHI

since only ccomparatJvely few bt'am crossmgs produce mtC'restmg physlcs (t)f orel!'1 ,1

few Hertz) much of the recorded data would be dlscarded. A tflggl'r syst l'In 18 uset! CP

examine eve]'~ts in real time and select those whl( h arc "interest.mg" Events s('[I'('1I'11

by the t.rigger are then read by t,~e data acqUlsltl Hl !>ystt'lll (DAQ) Thcs!' cVI'Ills .IIC

bufferecl and then passed on to further trigger processors wlllch ('xaml1l(' the t'vent 111

TIlore detal1 and determine If It should be retamed.

Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel,

so that the time to read out the detector 15 mlIumized This 18 crucml h!'( ,tl1S(, whtle

the detector is holding the event da.ta Il, cannot callect data {rom ongolllg (011181\111:-'

This is one source of deadtime, t1me dUflng whlch colbslOns are occurmg but (',1Il1l0!.

be recorded The readout modules are part of the DAQ wlnch t<lkes t'venl da ta ,wei

passes Jt through various buffenng, filtermg and formattlIlg proct'SSOfS Ille' udcd III

the DAQ lS rtnather level of tnggermg, WhlCh can 1.ake aclvant<Lge of t!H' IqWef d,ü<t

rate from the detector ta spend more tlme examllllllg each eveni The DAQ syst.elll IS

analogous ta a computer network, wlth a vanety of processors ail handling sepM<Lll'

parts of sorne task.

Correct operation of the DAQ system Îs clearly vital tü the operatIOn of t.he CDF

detector. Monitoring the performance of the DAQ system 15 a key aciivlt.y whde

collecting data The fallure or performance degradabon of a DAQ node rcqlll reH UII

rnediate attention from the opera tors ln the physics run whlch cndreJ 1T1 .1 UrIe' of

1993 (termed run lA) the error momtonng of the DAQ nodes was (iIstnbutcd. Nod('H

reported problems Hl dlfferent ways to physlcally dlstmct plc1ces ln tbe CD'" c.onl/Il]

room. In the 1993-94 phyS1CS run (fun lB) of the Tevatron the CDF DAQ will he

upgraded and new components will be mtroduced. Error mOnItormg for th(~He new

components must be provided. It was suggested that the error momtonng syst.em

developed for the run lB components could also handle error reports from the rest ()f

the DAQ.

This thesis describes the development and implementation of this central DAQ

error monitoring system. It is structureù as follows. We first present an overvicw ()f

the data acquisition system which was previously employed and the enhanccrncnts

for run lB. We then discuss the reqUirements and objectives for a monitorJTIg system

Page 10: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

------------------------- ~~ ---~~-

Chapter 1: Introduction 3

Followlllg that we describe the solutlOn which was implemented in general terms. We

then discuss the graphlcal mterface used ln the CDF control room in more detail.

Fmally wc compare the solutJOn ta the stated Objectives and summanze the current

status of the Murmur/DAQERI system

Page 11: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 2

The CDF Data Acquisition System

In this chapter we seek to provide an overVlew of the CDF data acqmsltion system

primarily to identify those elements which are sources of error and diagnostIc mfor­

mation. Details of the CDF DAQ can be found in the references provlded in [11, a

description of the upgraded system IS found ln [2] .

The goal of the DAQ is to take the signaIs from the detector elements, encode

them, apply further trigger critena and wnte the accepted events to a storage medIUm.

In order to maximize the chance of recording interesting events the DAQ system must

have as little deadtime as possible. Towards this end improvements will be made to

the detector DAQ system prior to run lB. We first describe bnefty the system uscd

for run lA then we indicate the changes which will be implemented for run lB.

2.1 The Run lA System

Data collection necessarily begins with the SignaIs from the various detector compo­

nents (see Figure 2.1). These signaIs must be digitized and collected together in a

buffer so that the detector and pre-readout tngger can go back to searching for inter­

esting events. In the run lA system this task is handled in two ways. The calorimetry

and central muon chambers are read out by an Redundant Analog Bus-Based Infor·

mation Transfer (RABBIT) analog interfdoce in conjunction with an MX scanner and

Multiple Event Port (MEP) interface to FASTBUS [3]. Tracking detectors are read

out by a FASTBUS based time-to-digital conversion (TOC) module in conjunction

4

Page 12: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 2: The CDF DAQ System 5

wlth a SLAC Scanner Processor (SSP) [4j Both the MX and SSP scan:lers have

buffers to hold four events each. Event data from the scanners is passed to the event

bUilder over FASTBUS. FASTBUS crates are inter-connected via Segment Intercon­

nect (SI) modules In the run lA system there are approximately 60 MX scanners,

25 SSP scanners and 53 FASTBUS crates.

Data acquisition begins when an event passes the pre-readout trigger criteria.

When thlS occurs digitIzatJon begins and the MX and SSP scanners are instructed to

begin moving data from the front ('nd electronics to one of their data buffers. Once

an the scanners have completed this readout the detector can return to the task of

searching for another event. The event stored in the scanners is next transferred over

four parallel FASTBUS segments to an event builder. The event builder has the task

of collectmg a complete event description from the buffers of aU the scanners and

reformatting the data. for the subsequent level 3 trigger. Once this task has been

completed the event is sent over FASTBUS to a VME interface to one of the proces­

sors in the level 3 processor farm This processor evaluates the complete event and

applies selectIOn criteria. If the event passes then it is passed to a "consumer" VAX .

Software on this VAX sends the event to a tape dnve and may also send it to one or

more processes which monitor data quality during the run.

The allocation of buffers in the scanners, evcnt builder and leve13 nodes is handled

by a buffer manager process running on a Micro-Vax. This process allocates buffers

to events when the detector is read out and then sends instructions to move the event

inte the event builder and on to level 3 as buffers in those elements become available.

The buffer manager directs the operation of the DAQ elements via FASTBUS inter­

rupts

Control and configuration of the DAQ elements is via a VAX process called run

control. The detector operators use run control to bring the DAQ elements online.

The detector Can operate in a partitioned mode with severél.l iun control pro cesses

USillg different parts of the detector DA 1 (this is particularly useful during testing

and calibration). The run control VAXen each have a FASTBUS interface over which

control messages and software downloadir g of FASTBUS components is done. Note

that both data and control messages require FAS'l'BUS .

Page 13: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Ohapter 2: The ODF DAQ System

(FASTBUS) (FAS'I'BUS)

Event Builder

Level 3 Processor Farm

Consumer VAXen

Run Control

Figure 2.1: Run lA data acquisition system

6

A portion of the CDF DAQ for run lA. Data flow is from top to bottom starting with the detector calorimetry and tracking chambers and ending when the event is

written to tape .

Page 14: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 2: The CDF DA Q System 7

Errer messages from run control are handled within the run control terminal dis­

play. Error messages from the event buîlders are dîsplayed on two other screens.

These screens are controlled hy direct RS-232 links to the event buîlders. MX errors

are reported to the event builder over FASTBUS and included in the event builder

errar display The level3 system has a separate error monitor ,,'V'hich provldes a graph­

Ical overvlcw of the buffer status m the level 3 noùes

2.2 The Run lB System

Experienc~ with the DAQ system has shown that there are a number of performance

bottlenecks. Each event builder has only one link to level 3 and this restricts the

maximum data rate of one event builder into level 3 to of order 20 Hz [2] and with

two event builders, 30 Hz Another bottleneck is due to the buffer mauager's use of

FASTBUS Interrupts via a VAX-FASTBUS interface to direct the data flow. Lim-

1 tatJons of VAX message queues cause these Instructions to get backed up and the

fact that such control mformatlOn lS carned on the same medium on which data lS

passed results in data transfers beîng iaterrupted by control messages. Performance

limitatIons also arise in the consumer VAXen due to the fact that data from level 3

1S carned by FASTBUS whlch 1S bmIted by the VAX/FASTBUS interface banclwidth

of 350 KB/sec

Enhancements to the Tevatron will increase the luminosity and hen~e the collision

rate for run lB. In run 2 (1997) accelerator improv(:ments will allow the production of

more antiprotons, allowing a greater number of antiproton bunches, in turn allowing

the time between beam collisions to be reduced to 400 ns. These improvements will

increase the rate of interesting physics events and a DAQ with a higher throughput

will be required. The run lB system 1S a step in this direction.

One of the goals of the run lB DAQ (Figure 2.2) is to eliminate the event builder

bottleneck. This is done by moving the event building function to level 3 and intro­

ducing parallel data paths from the scanners to the level3 processors. On the scanner

side a new module, the FASTBUS readout controller (FRC), will be used to read out

the MX sca mers. These will connect via a scanner bus to VME based scanner CPUs

(SCPUs) which will reformat and forwaI'd FRC data fragments to level3 processors

Page 15: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Cbapter 2: The CDF DAQ System 8

via an ULTRANET hub. The use of multIple SCPUs and a cross connect ULTRA­

NET hub a\lows the transmISSIOn of data ta be highly parallehzed. Events wlil tll(~n

be built in the level 3 nodes where trigger algonthms will be apphed as before.

This process will be coordinated by sever al new elements, in place of the butTer

manager. A scanner manager (SM) will direct the data from the FRCs to sranner

CPU sand into the level 3 processes Control of event building will oe handled by

processes running in the level 3 processor farm These control clements will no lotlgt'r

use FASTBUS for control and error informatIOn. The scanner manag('r WIll be nll­

plemented on a commercIal VME processor Ta allow for commumca,tIOu between

the scanners, scanner manager and level 3 farm (without usmg FASTHUS) a reflec­

tive memory board will be present in each VME crate contaimng a~l SCPU or SM

and each level 3 box. CommunicatIOn will occur via this shared memory. These

modules will aiso have Ethernet connectlOns to allow for software downloadmg and

error reporting. The VAX run control pracess will communicate wlth the SM Vld. a

User Control Interface (itself a VME board) which will then send control rnessagl.'s

to SCPUs III other crates VIa reflectlve memory

To surnmanze, the new DAQ system will reqUlre the introduction of the follo" ... ing

components:

• FASTBUS Readout Controller (FRC)

• Scanner CPU (SCPU)

• Scanner Manager (SM)

• User Control Interface (UCI)

These DAQ clements require a mechanism to report errors. Before wc discuss the

specific solution which was chosen, wc first discuss the general rcquirements for an

error reporting system for the DAQ .

Page 16: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 2: The GDF DAQ System

Calorimetry TracJ<~ing

Ampl & shaping

F R

C

(FASTBUS)

F

R

C 5 (Scanner Bus) 5 a .............. III ......... III ........... "I" ....... II'

i

TDC

F

(reflectlve memory) @ SCPU .......... I., •••••• " •••• !............ '" (te other SCPUs) M

Run Contro1

Leve1 3 Processor Farm - ....

Consumer VAXen

Figure 2.2. Run lB data acquisition system

Onlya portion of the DAQ is shown. Data flows from calorimetry and tracking detectors at the top of the figure to the tape drive and consumer VAXen at the

bottom .

9

Page 17: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 3

Error Reporting Objectives

3.1 Requirements

There are a number of objectives whlch an error reporting system for the CDF dl'­

tector should meet. The new error reporting system IS being developed for two mam

reasons. Firstly, the upgrüded DAQ to be used in run lB contains new DAQ elemt'u1.s

and we require a mechalllsm to commUlllcate then error and stat us TIIeSsdges tü the

detector operators Secondly, this opens the door to making an effort to ct'ntml­

ize the error reporting from eXIsting DAQ elements Implicit 1Il the reql11rerncnt for

a central error reporting system is the need for the message sending romponent of

the system to be hlghly portable. A varîety of software envlronments are used !Tl the

DAQ system and message generation routines must be available for each cnVlfODJTI('nt

Monitoring the DAQ health IS an important activlty and information about prab­

lems in the DAQ should be easily accessible to the detector operators. This is bcst

handled by a graphical display WhlCh provides an "at a glance" summary of the cur­

rent status of the entire DAQ. When errors do arise it should be possible to acccss the

specifie error information quickly_ If a large number of errors oceur in a short time

span the operator should not be overwhelmed with details unless they are specifically

requested. In those cases where a DAQ element is recognized as faulty the operator

should be able to disable error reports from the node.

Obviously we do not wish the presence of the error reporting system to ID any

way impede the normal operation of the DAQ. It must be possible to shut down and

restart the error reporting system without affecting normal DAQ operat.ion. Con­

versely, the purpose of the error reporting system is to monitor the DAQ a.nd record

10

Page 18: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 3: Objectives 11

the errors whlch have occurred. It must therefore continue to function normally as

elements of the DAQ start, reset or encounter any "reasonable" failure.

Ideally, error reports should be sent directly from a DAQ element to the central

monitor. ThIs elimmates the posslbility that error reports may be lost because an

intermediate no de in the error reporting chain has failed. In other words, we wish

the network topology to be that of a "star".

The error monitoring system should also provide a log of the complete error history

of the detector for a given run. The complete error history is essential for diagnosing

faults m the DAQ. This error log should be centralized so that all DAQ errors are in

a single file in the sequence in which they occurred.

Error monitoring is done in real time. The monitoring system must operate in

real-tIme.

As descnbed in chapter 2 the CDF detector can be operated in a partitioned mode

with control of the different elements handled by separate run control processes. In

such cases it is valu able to have an error display wmdow available for each run con­

trol while maintaming a central log so that the complete system health can be easily

rnonitored.

3.1.1 Summary of Requirements

We provide a summary of the requirements here for later reference. The error report­

ing system must be able to:

• handle error messages from the new DAQ elements

• centralize error reports from the remaining DAQ elements where possible

• provide an "at a glance" summary of the state of the DAQ

• allow specifie error reports to be accessed quickly and easily

• permit disabling of error reports from nodes

Page 19: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 3: Objectlves 12

• never interfere in data taking

• be impervious to "standard" DAQ failures

• use a "star" topology

• maintain a central log file

• allow multIple error displays

• perform real-bme momtoring

3.2 Desirable Features

In this section we present a number of features (in no particùlar order) whlch are

desirable ln an error reporting system but are not reqUlred.

(i) The meaning of error messages from the DAQ may not be dear tü the operators

of the detector. The ability to access help text for a specifie error message wüuld

be valuable .

(ii) For sorne errors the corrective action required will be known. In such cases it is

desirable to allow the receipt of the errür message to trigger the corrcctIvl" actIOn.

(iii) Trouble shooting faults in DAQ elements frequently Involves solicItmg the opm­

ions of people who have detailed knowledge of a specifie DAQ element If the

error monitoring system supported monitoring of the DAQ from rcmote dlS­

plays then these experts could provide assistance to the operators In a faster

and more convenient fashion. Adding and deleting su ch displays should not

affect the displays in the control room In any way.

(iv) In sorne cases it is not the errar message, but rather the frequency with which

it occurs, that is an indication of an element's status. The capability to alert

an operator only when an error message exceeds a frequency threshold may he

useful.

Page 20: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 4

The CDF Error Reporting System

4.1 Overview

To provlde a centralized error reporting momtor for the diverse elements of the CDF

DAQ it was declded that each DAQ element would use Ethernet to report errors to a

central process This pro duces a logical star configuratlOn, although it is of course a

physical bus by vlrtue of the Ethernet standard This also eliminates the need to send

error messages on any of the buses used for moving event data through the system .

The use of Ethernet allows the error reporting system to be develop~d and run on

any processor wJth an Ethernet connectIOn. CDF elected to use a SUN workstation

for thlS purpose.

The CDF error reporting system for run lB is composed of two software packages:

Murmur and DAQERI. Murmur [5, 6] is a software package developed by the on-line

software group at Fermilab. DAQERI [7, 8] is an extensIOn to Murmur written by

the CDF collaboration. Murmur han dIes the generation of error reports, transmis­

sion over Ethernet and provides a central server far collecting an the error reports. It

al10ws for logging and display of the error messages, but dùes not provide any sum­

mary of the current status of the DAQ. This task is performed by DAQERI. DAQERI

interprets the error messages from Murmur and uses them to keep a record of the

status of the DAQ and provide a graphicai summary. The inter-relationship of the

software companents of the error reporting system is illustrated in figure 4.1.

In this chapter we describe these software packages in general terms, discussing

what portions of the error reporting problem they solve .

13

Page 21: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 4: The CDF Error Reporting System

DAQ

Element

DAQ

Element

DAQ

Element

... a ......................................................... n ....................................................................... ..

Murmur Server

DAQERI

Kernel

:SUN Workstation

Murmur t--~_~

X client

Murmur X client

DAQERI GUI

DAQERI GUI

j ~ !

1 .......................... " ..................................... 04 ............................................. 11 ............... ..

Figure 4.1: CnF error reporting system

14

Page 22: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 4' The CDF Error Reporting System 15

4.2 Murmur

M urmur 15 a u)J]ection of software routines and tools to handle the geneIation and

record mg of erroI messages over Ethernet. The error sources (termed Murmur clIents)

make use of Murmur subroutines to package theIr errors in a form the centrallogging

entity (the Murmur server) wIll recogmze We now descnbe the capabIhtie:s and op­

eratJOn of Murmur in more detaIl, begmmng wlth the specification and generatlOn of

errors

4.2.1 Error Generation

A DAQ element sends error messages ta the Murmur server via caUs to one of a varwty

of Murmur client subroutines. Subroutme libranes exist for aIl the software environ­

ments found III the CDF DAQ ( Vax VMS, VxWorks and assorted umx flavoes).

Prior to us mg these messagmg subroutmes the clIent programmer must predefine the

error messages in a Murmur message file ThIS file is used by Murmur to prod.uce

umque 32 bIt error codes Each error code carries mformatJOn about the loglcal j,ype

of the client (faclbty number), the f'rror number and the seventy of the error. These

message files are used to produce files to be mcluded in the client code as well aB

records for the central error data base Each error in the message file may have as­

sociated with it several additional pieces of informatioil. The programmers provide

a one line descnptive text stnng tu mdicate the meaning of the erroI. Optionally

they may elect to provide a reference to a help file for the error, or a scnpt file con­

tammg commands to be executed when the error 1S rece1ved. This information is

appended ta the database used by the Murmur server ThIS approach relieves the

client of the task of sen ding error text with each message. It suffices for the client to

send the error code and optionally parameters to be inserted in the error message text.

Each error described in a message file contains a field to indicate the severity of

the message. Murmur provides four severity types:

• informational

• success

• warnmg

Page 23: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

----- -------------------

Chapter 4. The CDF Error Reparting System t6

• severe

When the error messages are displayed by Murmur, the display can be conf1guff'd

ta display only messages of a certain type or to display the vanous types !TI dJffert'nt.

colors. Apart from these display options Murmur handles al! erraf types in I.h(' sanlt'

manner

Murmur also provldes for stacks of errar messages. A clIent. may wlsh I.a senel

a group of errors which are assoclated wIth a common cause To ensure these elfe

entered in the logfile as a contiguous group they are sent as a message stark This

ensures that errors from other nodes wIll not be interleaved Wlth messages from the

error stack.

4.2.2 Error Monitoring

The central error monitoring process, the Murmur server, performs a number of fUJl(­

tions. It records aU error messages in a central logfile and optlOnally may dm'( L

error messages to one of a number of wmdows based on preferences mdlcated by the

user prior to startIng the server These preferences are spcclfied by uSlI1g {t MurrJlur

tool mur9u~ (Murmur GraphIcal User Interface) MurgU1 allows users Lu spcClfy t.he

number and location C)f display wmdows for error text Rauting of ('rrors (MI also

be speClfied For example, one wmdow can be dedlcated to only scv('re crrors (wei

another to errors from only FRC DO des The error mformatlOn fi('lds C{lfI .t1:,o [H'

configured by murgm aUowmg a user to dH;play only a portlOn of the error H'( ord

This display and routing information is then stored in the central errar database An

example of a Murmur error dlsplay wmdow IS provIded III FIg t1 2 DIsplays can 1)('

added or altered whIle Murmur is runnmg but the changes do not I.ake errec1. untd

aIl the eXIsting dlsplays are closed a.nd the new dlsplay configuration used tu re-open

them

In addition to dlsplay configuratIOns the user may also specify how the lagriles

are to be handled. The current logfile always has the same name and If Murrnur if!

run for extended periods of time the file wIll become unmanagabley large M tlfInur

allows for the logfile to be "recycled" The user can specify a time or BIze interval

after which the contents of the existmg logfile will be preserved in a separate file and

Page 24: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 4: The CDF Error Reporting System 17

Figure 4.2' Murmur error display window

the current logfile cleared. This also reduces the chance that an error in writing to

the current logfile wIll destroy the complete error hlstory of the system.

A typical Murmur logfile entry lS of the form:

FRC_S_SUCCESS Successful Execution

NODE: tor02, APPNAME: FRC, CTIME: 14:07:26,

CZONE: EDT, PID: 6049 STIME: 14:07:26,

SDATE: Wed Sep 22, CODE: 80008009

Only the error code, time and parameters (to be inserted in the text string) are

sent when a client reports an error The error text is retrieved from the central error

database. The Ethernet address, process ID and other informatIon which is fixed for

the client is sent in the initial communicatIOn with Murmur. This initial communica­

tion consists of a request from the client to connect to the Murmur server combined

with the client information which needs to be sent only one. The server responds by

allocating a Tep /IP connection for the client. AIl subsequent communication takes

place on this new connection.

Murmur plays an important role in the new error monitoring system for the CDF

DAQ. lt can be thought of as providing the "link layer" for the error reports as weIl

as central error logging. lts one shortfall is that it provides only error text and the

Page 25: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 4: The CDF Error Reporting System 18

status of the system must be deduced by reading back through the error hlstory lt

does not provlded "status at a glance" ln order to provide thls CDF developed an

extension to Murmur, DAQERI.

4.3 DAQERI

DAQERI (Data Acquisition Error Reporting Interface) is a software package devel­

oped to provide an intuitIve "at a glance" summaryof the error state and status of the

CDF DAQ. It relies on error informatIOn passed from Murmur. ThIs reqU1red makmg

custom modIficatIOns to Murmur 50 that the Murmur server would start DAQERI

and then send error informatIOn through a unix pipe to the DAQERI kernel ThIS

pipe also provides a means for sending supplement al information indicating when

logfiles have been recycled and client connectjdisconnect events occur DAQERI and

Murmur run on the same platform to allow aU this mformatIOn to be passed VI<l a

pipe, instead of echoing aIl the error reports over Ethernet.

The DAQERI system has two components (see Figure 4.1): a kernel proccss w~lIch

examines the error messages from Murmur and keeps track of the DAQ status, and

Graphical User Interface (GUI) processes whlch provlde this information to detector

operators in a graphicai format. The DAQERI kernel allows multIple GUI proccsses

so that several operators can monitor the DAQ performance simultaneously.

The status on the GUI display of DAQ elements IS extracted from the type of the

Murmur message. Status of nodes is indlcated by color. DAQERI supports the four

Murmur error types presented above, and allows an additlOnal error category, that of

frequency checked errors. A frequency checked error has an assoclated tune constant

and thresholds for warning and error states. The errors which occur are intcgrated

over the time period specified by the time constant and then cornpared to the error

threshoids to determine the state of the node. If the number of errors exceeds the

warning threshold then the no de is treated as bemg m a warnmg state. If no further

errors arrive after entering the warning state, then the prevlOUS errors will "age-out"

and the node status will revert to normal.

In the graphical display the following node states are distmguished. (hsted m

or der of increasing severity with color indicated in parentheses):

Page 26: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 4: The CDF Error Reporting System 19

• GHOST' (grey) the node has not yet reported to Murmur, and no information

on it is available.

• DISABLED: (blue) a user has disabled DAQERI error reports for this node.

The errors will still he processed by Murmur and written to the logfile.

• OK: (light green) the node is operating norrnally. Informational messages have

been received or previous errors were cleared in DAQGUI.

• SUCCESS: (bright green) a Murrnur success message has been received. This

clears any prevlOUS errors or warmngs

• WARNING: (yellow) a Murmur warning message has heen received, or a

frequency checked error message has exceeded its warning threshold.

• ERROR: (red) a severe Murmur error has occured or the error threshold of a

frequency checked message has been excceded .

If a given no de has received several messages then the color of the node reflects

the most senous of these.

A warnmg or error state of a node in the DAQERI display can be cleared in one

of two ways, the node recovers and and sends a success message or the operator, via

DAQGUI, instructs DAQERI to clear the status of the node.

The DAQERI main window (Figure 4.3) contains a nurnher of icons which rep­

resent aU the nodes in the CDF DAQ. Below this a status window holds other in­

formation of mterest to the operator (tape drIve status, percent deadhme etc.). The

colors of the icons in the main window reflect the error status of the underlying nodes.

Clicking on an icon results in a new window, the node window (Figure 4.4). This

window shows the nodes represented by the icon and the color of the node buttons

reflects their status.

Clicking on anode hutton within the node window pro duces another window pro­

viding more detailed information (Figure 4.5). The resulting node error window lists

aIl the possible errors which could he generated by the node in question. These errors

Page 27: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 4: The CDF Error Reporting System 20

Figure 4.3: DAQERI main window

Page 28: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 4: The CDF Error Reporting System 21

Figure 4.4: DAQERI node window

Page 29: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 4: The GDF Error Reporting System 22

Figure 4.5· DAQERI node error window

typically require several pages. The user can select a page of errnfR hy clicking on

the corresponding page button. The color of the page button reftects the most severe

errors on that page. In addition to providing detailed mformation on the current

error status of the node this window provides buttons which allow for the disabhng

or clearing of errors either individually or collectively. This wmdow aiso provides a

hutton to Iaunch a Iogfile browser.

The logfile browser (Figure 4.6) is a general purpose utility for browsing through

Murmur logfiles to extract and display errors satisfying selection cuts. For example

it can he used to examine only errors from FRC 20 after 8:00 a.m. When launchcd

from the DAQGUI no de erroI window the browser selection criteria are automatically

set to extract only those errors for the node for which the error window was opened .

These cuts may be changed interactively to either narrow or widen the search for

Page 30: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 4: The CDF Error Reporting System 23

Ju.t ln .-e _'re ctrlOUS RPPTXT : Control pr-oc ... 0 .trtlng bOS\j06, CTItE: 2a63d STltE·

Ju.t 111 .-e _'re ctrlOUS RPPTKT: FRC r.!dout routIne 0 strtlng bOs\j06. CTlt1E: 2a63d STItE:

Figure 4.6 Logfile browser

errors to he displayed. The hrowser displays the most recent 100 errors meeting the

specified criteria.

The DAQERI kernel supports multiple GUI processes. The error history of each

of the GUIs is independent in the sense that if one GUI clears or disables errors this

does not affect the displays of any other GUIs. Requests to open a GUI session can

be made while the DAQ monitoring system is in operation and (unlike Murmur) do

not have to be configured in advance .

Page 31: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 5

DAQERI

DAQERI consists of three independent code modules: the kernel (DAQKER), tht'

gn ,hical user interface (DAQGUI) and the logfile browser. While DAQERI was dt'­

veloped specifically for the CDF DAQ an effort was made to keep the program as

general as possible. The description of the DAQ elements and the topology of the

icons/node structure in the GUI lS read into DAQERI from configuratIOn files The

error information for the nodes withm DAQERI is read from files den veel irom the

Murmur error files. The code ta handle status messages and update the titpe drive

displays is specifie to CDF. To maintain generality this code can be excluded by set­

ting an appropriate complle-bme constant.

This chapter is intended ta provide an overview of the lmplementation of the

modules. We first provide an overview of the functionality of each module, and thcn

discuss how the modules handle requests as a system.

5.1 DAQKER

The DAQERI kernel performs the error book keeping for ail the DAQG VI processcs.

It receives error messages from the M urmur server, processes them and compares the

resultin'!; state of the node to the state presently displayed in each of the GUI ses­

sions. If the error state of a node ln a GUI is different from lts prevlOUS val uc, Üll'n

the kernel sends an update message to the GUI Vla a dedicated pipe. The kernel also

listens to pipes from the GUIs and handles requests for specifie error information,

requests ta clear or disable errors etc .

The kernel interfaces via unix pipes to Murmur and multiple DAQGUI processcs.

24

Page 32: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

DAQERI 25

The information exchanged is detailed In the sections below.

5.1.1 The Murmur Interface

The error messages sent by Murmur are copies of those sent to the logfile supple­

mented with a prefix and length code. The prefix code IS used to mdlcate whether

an error message is a standard Murrnur message, part of a message stack, or sorne

special information from the Murmur server. Special information messages are used

to commumcate events In the Murmur server which are of interest to DAQKER. At

the present time the following prefix types are defined:

• standard error message or head of a stack of messages

• stacked error message

• stop request

• logfile recycled

• client connect

• client disconnect

Note that there is no communication from the DAQERI kernel to the Murmur

server.

5.1.2 DAQGUI Interface

DAQGUI processes are started by the DAQERI kernel. Each DAQGUI is passed

the name of two pipes as command Hne arguments. These pipes allow for two way

communication l-etween the DAQKER process and a DAQGUI.

Communication from GUI to kernel consists of the following messages:

• REQ_NODE-INFO' request the complete error status of a particular no de and

enable immediate updates from the kernel if new errors for this node arrive .

• CANCEL_MONITOR: stop update messages for a given node.

Page 33: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

DAQERI 26

• DISABLE...NODE: disable error recording for anode.

• EN ABLE.-NODE: enable error recording for anode

• CLEAR_NODE: dear an errors for anode.

• RESET _NODE: reset anode to the GHOST state.

• CLEAR_ERROR: clear the status of a particular error for a given node.

• CHANGE-DISABLE: toggle the enable/disable state of a specifie error of a

node.

• BROWSE...NODE: request that a browser be started for a specified node

• QUIT: terminate the GUI seSSlOn ln response to a qUlt request from the use!

ln addition to these messages there eXIst messages to perform each of the no de

actions on collections of nodes e.g. clear an nodes

Kernel to GUI communicatIOIl is VIa the followmg messages:

• NODE_STATUS: change the status of anode.

• NODE-RECORD: sen ding a complete record of the status of each error for a

node.

• NODE_UPDATE: update the error count and status of an error for anode

which is being monitored by anode error window.

• KER..EXIT: the kernel is stoppmg. Stop the GUI process.

• TAPE_UPDATE: change the status of a tape monitor.

The kernel sends a variety of update messages to the GUI in response to both

changing error states and requests from the GUI. The GUI makes requests to the

kernel in response to actions by the user, for example - clear a particular error.

Page 34: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

DAQERI 27

5.1.3 The DAQKER Data Structure

In arder to get a clear understanding of how the kernel operates a knowledge of lts

central data structure is Important ThlS structure is illustrated in Fig 5.1 The

major orgamzing pnnclple of thls data structure is the fact that the status of anode

varies from GUI to GUI due to the fact that users running separate GUIs may inde­

pendently clear or disable nodes.

When an error arrIves From Murmur we are faced wlth the problem of mapping

It to a specific DAQ element. The Murmur message provldes the error number (con­

taining the facllity number), the node's Ethernet address and 1tS process id (PID) 50

these must form the basls for the search. Sorne DAQ elements are uniquely specified

by their Ethernet ID and facility number (if we know a pnon that there will only he

one client of that facility type at that address). In such cases we search usmg only

the Ethernet address and facihty number. If there are multiple instances of a fac1hty

at a given Ethernet address then we must distinguish them by PID. Smce we have no

way of knowing the PIDs ahead of bme, we must expect sorne number of nodes with

this (Ethernet address, facility number) pair and assign them logical node numhers

as they send their first message

The rnapping from a M urmur message to a node record lS handled by the physical

node table. This table consists of pointers ta node records ordered by facility num­

ber, Ethernet address and PID. Information on PIDs lS not avallable at the tlme the

table i5 mitiahzed. In cases where the PID lS required to distingUl5h between multiple

instances of a facllity the PID lS added when DAQERI receives the first message From

the node and the physical node table lS then reordered. The DAQ description in the

configuration files may also spec1fy that sorne number of a particular facility is ex­

pected but that the Ethernet addresses are not know ahead of time. In such cases the

Ethernet address is also entered into the se arch table wh en the first message arnves.

Each node in the DAQ is also assigned a logical address which consists of the

facility number and a unique node number. Requests from the DAQGUI processes

indicate nodes by facility number and node numher. The kernel must provide a map­

ping from this logical address to a node record. Toward this end it maintains a table

of pomters to no de records ordered by logical address .

Page 35: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

DAQERI 28

• Physical Node Table Logical Node Table

rptr ptr

-=--- ptr t--..E!!... -ptr ptr

-....:...-

Node Record

- facl.l i ty # - node # GUI Node Records -- Ethernet Addr. PID gui [MAX_GUIs J ..... - status

error_table - size monl. tored

error_table 1- status_cnt [MAX_STATUS_TYPES]

partl. tion #

Node Error Table Node Error • Record GUX Node Error Re cords - error code - ptr

error code gui [MAX_GUIs] _ .... status ptr - -- tail_error last_cleared

head_error 1-- num

Error Chain (for frequency baaed errora)

..... count count count count ...... - -time time time tlme prev - prev - prev - prev - - --

Figure 5.1: DAQERI kernel data structure

Page 36: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

DAQERI 29

For a gIven node there is sorne mformatIOn common ta all GUIs In particular the

GUI has a logicd,l (instead of physlcal) vlew of the DAQ system and references nodes

by facility and node number The node record in the kernel holds the facIlity and

no de numbcr for the (Ethernet address, facihty, PID) trIplet sa that in commumcatIOn

with the GUI the kernel can send the loglcal address of the node. The node record

also contains an array of pointers to GUI records, mdexed by GUI number Smce

each GUI can independently dlsable or clear error mformatIOn from a node, the error

status informatIOn must be kept mdependently for each GUI Tpe GUI record holds

a summary of the node's status Tt keeps track of the overall status and mamtams

counters of th!' number of distmct error messages whlch contribute to each of the

nodc status lcvels For example 1t keeps a count of the number of dIstInct warmng

messages whlch have been recelved and remam actIve. As a particular error for a

nor:e is cleared, the appropriate counter is decremented and the overall status of the

no de is re-evaluated.

In addItIon ta the overall status mformatlOn the kernel must retam mformatIOn

for each node on a per error basls Each node record pomts to a table of errors whlch

has an entry for each error the node may generate. This table IS indexed by Murmur

error code The entnes lU thls table pomt to anode error record whlch holds pomters

to records for each GUI and (for frequency based errors) an error cham. The GUI

no de error record holds the present status of the errer, the number of errors recelved

since it was last cleared and the time it was last cleared. The last cleared time is

used for frequency checked errors.

For frequency checked errors a linked list IS used to hold the past error hlstnry The

elements of the linked hst are allocated dynamlcaIly and there is only one p.rrol' chain

per frequency checked error. AIl GUIs refer to thls error chain ta determine the!r

status. Each element m the li st is a "time bin" holding errors which occurred within

the time mterval specified in the node's error configuration file. Frequency based

errors anse when the number of errors m a given time interval exceed the threshold

for that error This IS deterITlined on a per GUI basls by starting from the tail of the

error cham (the oldest errors), prunmg off those errors whlch have "aged out" and

then traversmg the hst untIl the hm time exceeds the last time the GUI cleared this

error Errors are accumulated from thls point to the head of the list. The total is then

compared to the error thresholds. The corn mon mformation for each error, such as

Page 37: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

DAQERI

frequency limits, are kept in a generic error table and are not duplicated for each nodt'

AU updates m the data structure occur m response to an error from I\l1lflllUr

This leads to a problem wlth frequency checked errors, they are exprctcd to age t)Ut

as time passes with no further errors, but the hsts arr only checkt'd whell ét lit'\\' t'rrt)r

arrIves This is handled by havmg the DAQERI kernel generate errors.ll tht' appro­

pnate (future) tlme As a Murmur trequency checked error arnves, tilt' t1t1lt' of tht'

oldest time bin is noted, and the error handlmg routme requests that. a "non-M1IfIllur"

error be scheduled to coinclde with the tIme when tbs bmt' hm should ,tgC' out. Tl\('

DAQKER main loop inJects It at the appropnate tane The error IS cl lstlllgUI!.;llt·d

internallyas DAQKER generatcd. so that the error counter~ ù're not II1crcInt'l1ted

DAQERI necessarily mterprets the error messages lt recelves to extract no<l(' <1.lId

facility mformatlOn In addlbon to recelvmg error IIlformatlOn, commancls to the

kernel can be sent via Murmur messages. Commands to stàrt new GUI seSSlOlIS .Ln·

handled in this manner

5.2 DAQGUI

The DAQERI GUI IS an mteractlve dlsplay whlch provldes a graphlcal SUHlTTliLry of

the status m.formatlOn retamed by DAQKER It makes use of the Motif hbrtlfy of

X wmdow rout.mes The mam loop of the DAQGUI process alternately halldlc>s X

events (button clicks etc.) and polis the pIpe from the kernel for stdLus updat.es The

X events result either in more X events being created (e g. a button click results III

a new wmdow heing opened) or m the generatlOn of a ITIt'SSdgC tn the kerne! (' g

clear a node). The DAQGUI changes node colors m response to l1pdates from t}\('

kernel This means that a request to dlsable an error res111ts in d rnessagt' Lo tht'

kernel, followed by an update from the kernel to change the node status to dlsablcd.

The DAQGUI mamtams a data structure slmilar m concept Lü that uf,er! hy

DAQKER with one sigmficant dlfference; the GUI only needs to be concerned wlth

the status of its own nodes and does not necd to keep multIple, independent ('op)('s

of the status of nodes. Messages from the kernel contam a message tyP(~, faulity and

node numbers and a parameter. Dependmg on the message subsequent informatlrJn

Page 38: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

DAQERI 31

may aiso be sent Refer~nces in the DAQGUI data structure are based on facility and

node number

The GUI does not keep information on a per error basis, unless the user has ex­

panded anode into its error display wmdow. Normally the only mformatIOn a GUI

retains is the overall status of the no de Each GUI node record also contams a pomter

to an icon record, indicating the Icon that the node is represented by. ThIs icon record

keeps count of how many nodes are in each error state and thls mformatlOn IS used

to determme the icon color Likewise, each ieon has a pointer to the icon ab ove it in

the hlerarchy to allow the changes in a node's status to be propagated up through

the hierarehy of icons to the top level

When a user expands anode mto an error list, a request lS sent to the kernel for

full error information The kernel responds by sendmg the error cou nt and status of

each error table entry for that node In additIOn the kernel aiso records the fact that

the node is being momtored, so that as new error information arrIves the information

in the GUI window is updated immediately. The GUI sends a message to disable this

monitoring when the user closes the window

In addition to the node error mformation the GUI processes also provide status

information. InformatIOn is extracted from Murmur messages by the kernel and up­

dates are sent to the GUIs. Status information is kept m a separate status box below

the DAQERI mam window (see e.g FIgure 4 2).

5.3 Browse

The browser is a simple program which reads through the logfiles in reverse time

order and compares the error entries to the user speeified selection cuts. The most

recent 100 errors which pass the cuts are dlsplayed in the mam window. The user

may specify which fit>lds of the error messages are to be displayed by m'!ans of buttons

along the bottom of the browser main window.

When launched from within DAQERI, the browser is started by the DAQERI

kernel on behalf of a specifie DAQERI GUI. The "Browse Logfile" button is located

Page 39: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

DAQERI 32

in a window for a specifie DAQ node and when the browser lS started m response to a

click on thls button it is passed cuts for node number and PID so that by default It wIll

extract only logfile messages assoclated wlth that node These cuts can be changed

interactively wlthin the browser to either narrow or wlden the group of errors selected

by the browser.

5.4 How an Error is Handled

We now describe how an error from a DAQ element is processed by the complet('

error handlmg system. We aIso use thls opportumty to present some detatls abuut

the operatIon of Murmur.

A DAQ client realizes that an error conditIOn has arisen. It responds by sending

an error message VIa a Murmur routine. If thls is the first error message From UllS

source then the error routine attempts to connect to the Murmur server VIa a Tep /IP

conne ct socket. As part of the connect message the client sends its Ethernet address,

process id and other fixed information, so that this does not have to be repeated with

each subsequent error message. The Murmur server accepts the request and l'st ab­

lishes a new socket whlch provides a dedicated Tep /IP connectlOn frorn the client to

the Murmur server. The client then sends all subsequent error messages to thls soekeL

The Murmur server polIs its receive sockets and finds the message from the client.

This message is unpacked and the error code IS looked up ln the central Murmur

database. This database provides the full error text and indicates if the message has

any parameters, help text or action scripts associated with it. The cxpanded error 15

written to the Murmur logfile and sent to any Murmur display windows whic.h have

their routing set to accept it. The message is also sent down a pipe to the DAQglU

kernel.

The DAQERI kernel extracts the Ethernet address, facility number and PlO frorn

the error. It uses this information to get a pomter to a no de record. The error table

indicated by the node record is searched to get to the no de error record. The error

count for an GUIs which have not disabled this error are then updated. Any GUI

sessions which have a window open for this node are sent a message indicating that

the error count (and perhaps status) have changed. For cach GUI in the node record

Page 40: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

DAQERI 33

the overall node status is updated. If the node state in any of the GUI records changes

as a result then a message is sent to the corresponding GUI mdicatmg the new status.

The GUI receives a message indicating that the node status has changed and it

responds by changîng the color of the node button. It then checks the lcon WhlCh

represents this no de and determmes if the change m the no de state results m a cnange

in the icon state If thls node message results in a new 1con state then the color of

the icons above this node in the hierarchy are altered This alerts the user that an

error has occurred

The user now clicks on the icon and gets a window of node buttons. Clicking

on the node button results in a display ind1catmg aU the errors for the node, with

those contributing to the error state distinguished by col or The user may clear or

disable the error, resulting in a message to perform the action being sent to the

kernel Alternately, the user may request to browse the logfile This results m a

browse request bemg sent to the kernel, Wh1r.h then starts a browser. By selecting

"update" in the browser the most recent error messages from th1s node are d1splayed

and the full error text is available to the user

Page 41: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Chapter 6

Conclusions

The error reporting system described here has been installed in the COF control room

and is being integrated into the system. It lS anticipated that it will be used as p<\.rt of

run lB. The system meets all the reqUlrements set for It but at the present tlme Bot

all elements of the DAQ report errors via Murmur. ThIs is due to an understandable

reluctance to rewrite the working code in existing DAQ elements. The level3 system

in particular continues to make use of its custom graphicai error display and Il 18 our

opinion that this display should continue to be used (although It could III tlme he

augmented by error messages to Murmur).

The DAQERI/Murmur system fulfills a11 of the requirements outlmed in 3.1. Mur­

mur alone satisfies many of these by provlding a central error logging system and error

sending routines which can run on aU the new DAQ elements. The extenslOns pro­

vided by DAQERI provide for the graphical monitoring of the DAQ in an intuitIve

fashion.

As with any software system of this kind the evolution of bûth Murmur and

DAQERI is an ongoing process. As detector operators become familiar with the error

reporting system requests for changes in functionality and feature enhancements are

inevitable. However, we beheve our solution provides the essential clements for error

monitoring of the detector and foresee this system being used in one form or another

for the remainder of the CDF experment .

34

Page 42: An error reporting system for the upgraded CDF data ... · Events are read [rom the detector by cl vandy of modules operatlIlg ITI para.llel, so that the time to read out the detector

Bibliography

[1] F. Abe et. al. Nucl. Instr. and Meth. A271 (1988) 387-403.

[2] The CDF Collaboration. "CDF Data Acquistion System Upgrade". CDF InternaI

Document, AprIl 30, 1992.

[3] G. Drake et al. Nuc!. Instr and Meth A269 (1988) 68-8l.

[4] E Barsotti et al Nucl Instr and Meth. A269 (1988) 82-92.

[5] L. Appleton, C. Morre, G Oleynik, G Sergey and L. UdumuIa, "Murmur User

GUIde", Fermilab Document PN457, 1993

[6] C. Moore, "Murmur Quick Reference Guide", Fermilab Document PN472, 1993.

[7] K. Biery, P. Musgrave, K Ragan, K Strahl, A Hûlscher and P Sinervo "An er­

ror reporting interface for the upgraded CDF DAQ system". Conference Record

of the 8th Conference on Real T~me Computer Appl~catwns ~n Nudear, Part~­

de and Plasma Physics, June 8-11th, 1993, Ed~tor R. Poutisson, Vancouver,

Canada

[8] K. Biery, P. Musgrave, K. Ragan and K. Strahl, "DAQERI Expert Manual",

CDF Note CDF jDOCjONLINEjCDFR 2334, 1993 .

35