The ALICE DAQ: Current Status and Future Challenges P. VANDE VYVRE CERN-EP/AID.
ALICE Data Challenge V P. VANDE VYVRE – CERN/PH LCG PEB - CERN March 2004.
-
Upload
anissa-wilkins -
Category
Documents
-
view
256 -
download
7
Transcript of ALICE Data Challenge V P. VANDE VYVRE – CERN/PH LCG PEB - CERN March 2004.
ALICE Data Challenge V
P. VANDE VYVRE – CERN/PH
LCG PEB - CERN March 2004
LCG PEB March 2004 2 P. VANDE VYVRE CERN-EP
TriggerLevel 0,1
TriggerLevel 2
High-LevelTrigger
Transient Data Storage (TDS)
Event-Building Network
Storage network
Detector
Digitizers
Front-end Pipeline/Buffer
Decision
Readout Buffer
Decision
Sub-event BufferLocal Data Concentrators (LDC)
Event BufferGlobal Data Collectors (GDC)
Permanent Data Storage (PDS)
Decision
Detector Data Link (DDL)
Data
Logical Model and Requirements
Testedduring ADC
25 GB/s
2.50 GB/s
1.25 GB/s
LCG PEB March 2004 3 P. VANDE VYVRE CERN-EP
Architecture & Performance Goals (1)
• DAQ project:• System size and scalability:
• Scale similar to ALICE year 1 (2007 pp and PbPb runs)
• 30 % of final performance: scalability up to 150 nodes
• System performances:
• ALICE data traffic: verify optimal usage of computing resource
• Verify load balancing
• From DAQ to MSS
• Tape: 300 MB/s sustained over a week
• Disk: 450 MB/s peak needed
• Performance monitoring
LCG PEB March 2004 4 P. VANDE VYVRE CERN-EP
Architecture & Performance Goals (2)
• Offline project• Simulated raw data from several detectors (large and small data fragments)
• Used during ADC V: TPC, ITS
• Other detectors: dummy data of realistic size
• Different trigger classes and detector sets with realistic multiplicity
• Read data back
• Improve Alimdc (ROOT formatting program) /CASTOR performance
• Algorithms from HLT project used for data monitoring purposes
• Automatic registration of files in the AliEn catalogue for world-wide availability
LCG PEB March 2004 5 P. VANDE VYVRE CERN-EP
Technology Goals
• CPU servers: • Mostly Dual CPUs (LXSHARE)
• SMP machines (HP Netservers) for DAQ services (ALICE)
• IA 64 technology: test DATE code on Itanium
• Network: • New generation of NIC cards (Intel Pro 1000)
• Trunking
• 10 Gbit Eth. Backbone. Including NICs
• Storage:• Disk servers
• 23 New IDE-based disk servers (Nominal performance: RFIO @ 90 MB/s)
• Tapes• STK 9940B : ~ 30 MB/s, ~ 200 GB/vol.
LCG PEB March 2004 6 P. VANDE VYVRE CERN-EP
HW Architecture
4 x GE4 x GE
4 x GE4 x GE 10GE10GE
10GE10GE
10GE10GE
10 Tape Server10 Tape Server
4 x GE4 x GE
3COM 49003COM 4900
~ 80 CPU servers 2 x 2.4 GHz Xeon, 1 GB RAM, Intel 8254EM Gigabit in PCI-X 133 (Intel PRO/1000), CERN Linux 7.3.3~ 80 CPU servers 2 x 2.4 GHz Xeon, 1 GB RAM, Intel 8254EM Gigabit in PCI-X 133 (Intel PRO/1000), CERN Linux 7.3.3
4 x 7 Disk servers4 x 7 Disk servers2 x 2.0 GHz Xeon2 x 2.0 GHz Xeon1 GB RAM1 GB RAMIntel 82544GCIntel 82544GC
32 x GE32 x GE
32 IA64 HP-rx2600 Servers32 IA64 HP-rx2600 Servers2 x 1 GHz Itanium-22 x 1 GHz Itanium-22 GB RAM2 GB RAMBroadcom NetXtrem BCM5701 (tg3)Broadcom NetXtrem BCM5701 (tg3)RedHat Advanced Workstation 2.1RedHat Advanced Workstation 2.16.4 GB/s to memory, 4.0 GB/s to I/O6.4 GB/s to memory, 4.0 GB/s to I/O
3COM 49003COM 490016 x Gbit16 x Gbit
Enterasys E1 OASEnterasys E1 OAS12 Gbit, 1 x 10 Gbit12 Gbit, 1 x 10 Gbit
Enterasys ER16Enterasys ER1616 slots16 slots4/8 x Gbit or 1 x 10 Gbit/slot4/8 x Gbit or 1 x 10 Gbit/slot
3COM 49003COM 4900
LDCs
GDCs
LCG PEB March 2004 7 P. VANDE VYVRE CERN-EP
System Setup: CPU servers
CPU servers
requested (Cocotime)
CPU serversallocated & used
Comments
LCG Openlab
March 2003 30 Not used by ALICE
due to an internal reviewApril 2003 150
Jul. 2003 ~ 80 5 DAQ + network tests, addition of IA64 nodes, setup perf. mon.
Aug. 2003 ~ 80 20 New CPU SEIL servers
Network problems
Sep. 2003 ~ 80 20 Broadcom NIC replaced by Intel
Enterasys ER16 replaced by N7
Oct. 2003 30 ~ 80 20
Nov. 2003 150 ~ 80 20
Dec. 2003 64 (70) 20
Jan. 2004 64 (70) 20 Production
Feb. 2004 64 (70) 20 (+ 15) Production
LCG PEB March 2004 8 P. VANDE VYVRE CERN-EP
System Setup: Storage
Number of disk
servers
Requested Bandwidth to
disk
(MB/s)
Measured Bandwidth
to disk
(MB/s)
Requested Bandwidth
to tape
(MB/s)
Measured Bandwidth
to tape
(MB/s)
March 2003 100
April 2003 450 300
Oct. 2003 100 100 300
Nov. 2003 450 450 300 300
Dec. 2003 450 300
Jan. 2004 21 450 300
Feb. 2004 21 450 300
LCG PEB March 2004 9 P. VANDE VYVRE CERN-EP
ALICE DC: Scalability
LCG PEB March 2004 10 P. VANDE VYVRE CERN-EP
ALICE DC – DAQ Bw
0
500
1000
1500
2000
2500
3000
1998 1999 2000 2001 2002 2003 2004 2005 2006
DAQ bw goal
DAQ bw ALICE traffic
DAQ bw equal traffic from all sources
MBytes/s.
LCG PEB March 2004 11 P. VANDE VYVRE CERN-EP
Trunking ADC IV
100100
200200
300300
400400
500500
11 22 33 44 55 66 77
# LDCs# LDCs
DistributedDistributed Same switchSame switch
MB
/sM
B/s
Trunk of 3 x Gb Eth
LCG PEB March 2004 12 P. VANDE VYVRE CERN-EP
Trunking ADC V
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9
# LDCs
MB
/s
Distributed Same switch
Trunk of 4 x Gb Eth
LCG PEB March 2004 13 P. VANDE VYVRE CERN-EP
ALICE DC – MSS Bw (1)
0
200
400
600
800
1000
1200
1400
1998 1999 2000 2001 2002 2003 2004 2005 2006
MSS bw initial goals
MSS bw achieved
Tape Bw LCG
MBytes/s.
alimdc/rootd/castor bw between 2 nodes: 30 MB/s
LCG PEB March 2004 14 P. VANDE VYVRE CERN-EP
ALICE DC – MSS Bw (2)
LCG PEB March 2004 15 P. VANDE VYVRE CERN-EP
ALICE DC – MSS Bw (3)
LCG PEB March 2004 16 P. VANDE VYVRE CERN-EP
Achievements (1)
System size System scalability (Hw and DATE Sw) Performance test with ALICE data traffic
• ALICE-like traffic• LDCs working in ALICE conditions:
Realistic ratio of event rate and sub-event sizes from 1 LDC to another
• ALICE-like events using simulated data:Realistic (sub-)event size on tape (ALICE year 1)
DATE load-balancing demonstrated and used
Sustained bw to tape not achieved• Peak 350 MB/s
• Reached production-quality level only last week of test
• Sustained 280 MB/s over 1 day but with interventions
IA-64 from Openlab successfully integrated in the ADC V
LCG PEB March 2004 17 P. VANDE VYVRE CERN-EP
Achievements (2)
Simulated raw data used for performance test• Several detectors
• Several triggers
Data read back from CASTOR• Data read back and verified
• Data fully reconstructed
Alimdc/CASTOR bw: from 3 to 10 MB/s per data stream Algorithms from HLT successfully integrated
LCG PEB March 2004 18 P. VANDE VYVRE CERN-EP
Hardware components
• Network LDCs and GDCs: stable and scaleable including trunking
Between GDCs and disk servers: Unreliable
Trunking not scaling as expected
Module broken and replaced twice in Enterasys router Network either seriously degraded or completely unusable
10 Gbit Eth. Backbone
New generation of NIC cards (Intel Pro 1000)• NIC from Broadcom unreliable. Replaced by Intel Pro 1000.
Several CPU servers unusable (~3 out of 70)
Storage Hardware problems on the disk servers (unrecovered hard disks failure)
Unfortunate reaction from CASTOR concentrating requests to the faulty machine
• Several last minute workarounds needed (scripts for monitoring and reconfiguring)
LCG PEB March 2004 19 P. VANDE VYVRE CERN-EP
Open issues and future goals• CASTOR:
• Unsupervised recovery from malfunctioning disk server• New stager• Special daemon should be put back in main development
Used instead of standard RFIO daemon to achieve adequate performance.New xrootd daemon from BaBar.
• DAQ• Increase performances• Improve performance monitoring package (AFFAIR)
• Offline• Realistic data for more detectors• More remote sites accessing the raw data (monitoring and prompt reconstruction)• Data streaming per trigger or detector • Run HLT inline in alimdc and not anymore semi-realtime
• Network• First generation of 10 Gig cards from Enterasys unreliable• No indication of hardware failure• Enterasys support took a long time to resolve the problem
LCG PEB March 2004 20 P. VANDE VYVRE CERN-EP
ALICE DC – DAQ Bw revised
0
1000
2000
3000
4000
5000
6000
1998 1999 2000 2001 2002 2003 2004 2005 2006
DAQ bw
DAQ bw ALICE traffic
DAQ bw flat traffic
DAQ bw revised
MBytes/s.
LCG PEB March 2004 21 P. VANDE VYVRE CERN-EP
Conclusions
• Computing Data Challenge is still the best tool for:exercising the fabric, demonstrating the software, verifying interfaces
• ADC V:• Lots of achievements but…
• 1 major performance milestone missed
• Trouble with the network due to the Enterasys equipment under beta test
• A lot of work and milestones in front of us• Next Computing ADC: 50% more on performance milestones
• Simulated raw data from all major detectors
• Preparatory work needed to test each component independently before their integration
LCG PEB March 2004 22 P. VANDE VYVRE CERN-EP
Postscript
• A lot of people from IT and ALICE have spent quite some time and hard work on this DC
• DC are and will remain manpower intensive exercises as will the LHC computing be
• Excellent collaboration between all groups and projects involved
• Regular meetings: • Very constructive attitude
• Informal but extremely efficient atmosphere
• Thanks to all for enthusiast and effective contribution !