Dataflow Monitoring
date post
20-Jan-2016Category
Documents
view
27download
0
Embed Size (px)
description
Transcript of Dataflow Monitoring
PowerPoint Presentation
Dataflow MonitoringNicoletta GarelliALICE, ATLAS, CMS & LHCb joint workshop on DAQ@LHC12-14 March 2013Chteau de Bossey
What Dataflow Monitoring MeansMonitoring in real time the flow of data to ensure optimal data takingfrom detector readout to permanent storagetrigger & DAQ quantities (counters, data rate, buffer occupancy, etc.)Avoid dead-time and eventually allow to fix problemsEach experiment uses its own jargon to indicate the same thing12/03/2013N. Garelli (SLAC) - Dataflow Monitoring2RequirementsAccess any relevant information in real time to follow data takingOnline aggregation & data correlationOnline problem detection: dead-time, data losses, etc.Archive: access historical data for diagnostics, statistics, post-mortemUse monitoring data to trigger alarms and/or automatic actions to recover problems12/03/2013N. Garelli (SLAC) - Dataflow Monitoring3BasicsAdded later
EvolutionUsers: shifters & expertsLHC operations at the beginningscattered information and rudimentary toolsshifters: intense monitoring activity experts: high presence in control room + ringing on-call phone 12/03/2013N. Garelli (SLAC) - Dataflow Monitoring4LHC operations routinecoherent information and optimized toolsautomate as much as possible to reduce shifters taskssee Lucas talk of this morningmove from custom GUI to ubiquitous web based toolslets do all with a smartphone
The 4 architectures12/03/2013N. Garelli (SLAC) - Dataflow Monitoring5
MiddlewareEach experiment developed 4 different DAQ systems using different technologiesVariety reflected in dataflow monitoring middlewareLHCb & ALICE: Distributed Information Management (DIM)client/server paradigm, light weight ATLAS: Information Service (IS)custom library on top of CORBAclient-server communication model where information is stored in memory by so called IS serversCMS: Web ServiceCross-DAQ (XDAQ) framework12/03/2013N. Garelli (SLAC) - Dataflow Monitoring6ALICE DAQ12/03/2013N. Garelli (SLAC) - Dataflow Monitoring7
~300 processes on ~300 machines100k dataflow information published every 5 s ~3 GB/h NOTE:HLT monitoring & dedicated expert storage monitoringnot discussed here
ALICE Dataflow Monitoring ArchitectureLogbookmuch more than what you thinkPHP12/03/2013N. Garelli (SLAC) - Dataflow Monitoring8DAQ processesDIM / SMIStatus DisplayBackpressure MonitorLogbookMySQLBased on DIM/SMISMI: framework for designing and implementing distributed control systems developed by DELPHIMySQL:store system configuration LDC&GDC write run info Archive. Logbook as visualization2 monitoring applications Tcl/Tk8ALICE Visualization12/03/2013N. Garelli (SLAC) - Dataflow Monitoring9
ALICE & Android12/03/2013N. Garelli (SLAC) - Dataflow Monitoring10
ATLAS DAQO(20k) processes on ~2k machines1M dataflow information published every 5-10 s ~4 GB/h12/03/2013N. Garelli (SLAC) - Dataflow Monitoring11Custom HWProcessing UnitProcessing UnitProcessing UnitProcessing UnitProcessing NodeLevel 2Processing UnitProcessing UnitProcessing UnitProcessing UnitProcessing NodeEvent FilterOtherFEFEFECalo/ MuonRODRODRODDetector ReadoutProcessing UnitProcessing UnitReadOut SystemProcessing UnitProcessing UnitEvent BuilderProcessing UnitData Logger
DC
BELevel 1~150~100~5~8k~8k
ATLAS Dataflow Monitoring Architecture12/03/2013N. Garelli (SLAC) - Dataflow Monitoring12Information Service (IS)~100 serversMonitoring GUI applicationsWeb Server(Apache)Web IS (Python CGI)Web BrowserDAQ processesMirror Information Service Monitoring GUI applicationsCERN GPNWEB IS: IS gives information access on demand via HTTP protocolpython wrapper accepts HTTP requests & sends back dynamically formed XML text (value of IS obj pointed by given URL)Mirror IS: real-time copy from IS (ATCN) to mirror counterpart in CERN GPNArchive: None.information stored & accessed for ~2 month in RDD files each ~30 s via network monitoring systemWeb Server(Apache)Web IS (Python CGI)Web BrowserShifters Tools in 2012DAQ Panel: tool portal for shiftersDFSummarydynamically constructed web page which computes & displays most important dataflow parameters (~200 variables)~30 s update rateBusy Panel: Qt application for monitoring dead-timeShifter Assistantsee Lucas talk of this morning12/03/2013N. Garelli (SLAC) - Dataflow Monitoring13
CMS DAQ12/03/2013N. Garelli (SLAC) - Dataflow Monitoring14
O(20k) processes on ~2k machinesO(600k) dataflow information published every 1-5 s ~8 GB/h
XDAQ Monitoring & Alarming ServiceFully scalable distributed monitoring & alarming systemService-oriented architecture organized in 3-tier structured collection of communicating components:Sensor: report monitoring data Eventing: scalable publisher-subscriber service orchestrated by a load balancer application (Broker) Collector: build relational tablesLive Access Service: presentation of raw data (Web Service)12/03/2013N. Garelli (SLAC) - Dataflow Monitoring15
Archive: automatic persistency of collected tables in ORACLE according to configurationSubset of information stored: ~30 GB/yMonitoring as a ServiceXDAQ as a Service (XaaS): common infrastructure for both central DAQ & sub-detectors
12/03/2013N. Garelli (SLAC) - Dataflow Monitoring16...ecalhcalcsc
xaasXDAQ as a service
cdaqsentinelmonitorapplicationcentral xaasinteroperable services providing standard functionalities for use in XDAQ environment
All processes organized into searchable groups known as zoneszone defines scope of a distributed XDAQ applicationEach zone has its own monitoring data types (flashlists)CMS VisualizationLabView DAQMon12/03/2013N. Garelli (SLAC) - Dataflow Monitoring17
LHCb DAQO(40 k) processes on 2k machines4M dataflow information published every 5s ~ 11.5 GB/h12/03/2013N. Garelli (SLAC) - Dataflow Monitoring18
LHCb Dataflow Monitoring Architecture12/03/2013N. Garelli (SLAC) - Dataflow Monitoring19UIArchiveAlarmUIUIsUIDetailedUIsStatusCountersRatesAlarmsDetailedCountersControlSystemUIReadoutUnitsUITiming &TriggerCountersRatesCountersRatesInfrastrDataflowDQHistoHLTInfrastrDataflowDQHistoMonitoringInfrastrDataflowDQHistoReconstr.HistosHistosHistosHLT SubFarms X 56DIMInfrastrDataflowDQHistoInfrastrDataflowDQHistoHLT Nodes X ~30DIMDIMHLT SF CtrlInfrastrDataflowStorageEventDataStatus,CountersCountersAutomatic Actions(aggregation)(aggregation)Archive: PVSS ORACLE DBO(200k) values every 5s
19LHCb VisualizationPVSS GUI12/03/2013N. Garelli (SLAC) - Dataflow Monitoring20VT100 graphicsdetailed UI
conclusionS12/03/2013N. Garelli (SLAC) - Dataflow Monitoring21Satisfied?YES, it does the job
BUT 4 different solutions for the same problem sharing experience and maybe even future common solutions?
Lucianos talk on Thursday
12/03/2013N. Garelli (SLAC) - Dataflow Monitoring22