C-RORC PRR

42
1 C-RORC PRR ALICE / ATLAS ROS team C-RORC PRR 14 April 2014

description

C-RORC PRR. ALICE / ATLAS ROS team. Agenda. Introduction ALICE, by H. Engel ATLAS Concluding remarks. Introduction. C-RORC: hardware design of ALICE Types of firmware - PowerPoint PPT Presentation

Transcript of C-RORC PRR

Page 1: C-RORC PRR

C-RORC PRR 1

C-RORC PRR

ALICE / ATLAS ROS team

14 April 2014

Page 2: C-RORC PRR

C-RORC PRR 2

Agenda

• Introduction• ALICE, by H. Engel• ATLAS• Concluding remarks

14 April 2014

Page 3: C-RORC PRR

C-RORC PRR 3

Introduction• C-RORC: hardware design of ALICE• Types of firmware

– Test firmware used during production, mainly developed by ALICE, test procedures discussed between ALICE and ATLAS (loopback connector used for tests developed by ATLAS)

– ALICE specific– ATLAS specific

• RobinNP: C-RORC to become the Gen-III ROS ROBIN• “Dozolar”: data source for 12 S-links, for testing S-link inputs of RobinNP• RoIBuilder: if C-RORC replaces VME based RoI Builder specific firmware may be needed, not excluded that RobinNP firmware can be used

14 April 2014

Page 4: C-RORC PRR

C-RORC PRR 4

C-RORC

14 April 2014

• C-RORC picture with some explanation

Could be removed to improve air circulation (will be discussed later)

Page 5: C-RORC PRR

C-RORC PRR 5

This review

• Production Readiness of C-RORC hardware• First prototypes produced by Cerntech (Hungary)– PCBs from Exception PCB, UK

• After tendering production contract was awarded to Hapro (Norway)– PCBs from Suntak, China

• 20 pre-production cards under test since mid February

14 April 2014

Page 6: C-RORC PRR

C-RORC PRR 6

Hapro and Cerntech C-RORC

• PCB build different, copper balancing on Cerntech board (better spread of heat during manufacturing of board)

• Cooler + fan different, Hapro board within PCIe height limit • Hapro FPGA: commercial grade (0 – 85 0C), Cerntech FPGA: industrial grade (-40 – 100 0C)

14 April 2014

Hapro

Cerntech

Page 7: C-RORC PRR

C-RORC PRR 7

Pre-Series test at contractor’s siteand tests by ALICE

Described in presentation by H. Engel

14 April 2014

Page 8: C-RORC PRR

C-RORC PRR 8

Tests performed by ATLAS

• Visual inspection of the pre-series cards• Already mentioned by H. Engel: on one card 3

LEDs only soldered on one side• Fixed by CERN SMD Workshop

• Card at Nikhef: some VIAs filled with solder

14 April 2014

Hapro Cerntech CerntechHapro

Page 9: C-RORC PRR

C-RORC PRR 9

Tests performed by ATLAS IWith RobinNP firmware:• robinnpbist program:

• Checks register contents• Measures FPGA temperature• Sets clock frequencies for S-Links• On-board memory tests• DMA speed tests• Interrupt tests, including performance benchmarking• Tests of speed and data integrity for page handling and transfers into buffer

memory• Temperature measurements, readout via PCIe or via JTAG using Chipscope

14 April 2014

Page 10: C-RORC PRR

C-RORC PRR 10

Tests performed by ATLAS II• Standard data taking environment using ReadoutApplication:

– “Indexing” incoming data and managing buffer memory pages– Receiving requests via network from the ROSTester program – Forwarding requests for data to RobinNP – Sending data via network to the ROSTester program

• Data generated by internal test generator or by DOLARs or MDT RODs. • For short fragments (50 words) stable running has been seen over periods of 11

hours (limited by ROSTester) • Fragments larger than ~180 words cause a lockup of the firmware for a request

fraction of 100% after a short time (10 – 50 s). A logic error in the internal arbitration in the FPGA for access to shared resources is causing this. There is no obvious dependence on features of the C-RORC hardware. A fix for the lockups has been found, consisting of minor (but clearly significant) changes to a couple of state transitions in the Memory Controller's Finite State Machines. The memory is being operated at 303 MHz DDR, it is likely that with more work this can be scaled up

14 April 2014

Page 11: C-RORC PRR

C-RORC PRR 11

Test setups

14 April 2014

Nikhef: Intel dual CPU server,2 C-RORCs, 2 dual-port 10 GbE NICs,1 40 GbE NIC

CERN: • 2 C-RORCs used as Dozolar• 2 GEN-III candidate PCs with 2

RobinNPs each• 1 PC with 3 DOLAR cards

RHUL, single CPU server, 2-C-RORCs,2 dual port 10 GbE NICs

Page 12: C-RORC PRR

C-RORC PRR 12

Observations

• Current for a few cards ~10% higher than for the other cards, but cards do function normally

• Boards from Cerntech seem to be less sensitive to air flow– With good air flow and functioning fan

temperature of FPGA not a problem (< ~65 0C)

14 April 2014

Page 13: C-RORC PRR

C-RORC PRR 13

C-RORC FPGA Core Temperatures

14 April 2014

Site Motherboard C-RORC Manufacturer

FPGA Temperature

Lab Air Temperature

RHUL Supermicro X9SRL-F Cerntech 50 25

RHUL Supermicro X9SRL-F Hapro 60 25

CERN Dell R720 Hapro 53 15-20

CERN Dell R720 Hapro 53 15-20

CERN Supermicro X9SRW-F Hapro 61 15-20

CERN Supermicro X9SRW-F Hapro 63 15-20

Nikhef Intel S2600CP Cerntech 56 53 ~20

Nikhef Intel S2600CP Hapro 48 ~20

RHUL: 2xDDR3@606 Single Rank, 100 MHz oscillatorMeasurements at RHUL for system without lidNikhef: 100 MHz oscillator

1 subROB configuration2 subROBs: ~ +5 0C

Accuracy FPGA temperature sensor: ± 4 0C

Page 14: C-RORC PRR

C-RORC PRR 14

Infrared photos

14 April 2014

ALICE test firmware RobinNP firmware

Hapro C-RORC in machinewith Supermicro MB at Nikhef,4 U high machine with lid open

FPGA sensor: ~ 70 0C FPGA sensor: ~ 64 0C

Page 15: C-RORC PRR

C-RORC PRR 15

Temperature

• High data rates: no significant change of temperature of FPGA

• No relation with presence or absence of QFSPs• Reporting and monitoring of fan failure & over

temperature: via Ichinga (Nagios), automatic flushing of FPGA configuration to reduce power dissipation. – To be implemented– Discuss common solution with ALICE

14 April 2014

Page 16: C-RORC PRR

C-RORC PRR 16

Identification of cards

• DNA id of FPGA: unique number• Hapro serial number printed on PCB of card• ATLAS number• No registration of all QSFPs / memory

modules, but ROS team will keep a record of malfunctioning devices

14 April 2014

Page 17: C-RORC PRR

C-RORC PRR 17

Number of cards to be produced

• Total, including pre-production: 210 for ATLAS, 170 for ALICE

• ATLAS: – Sub-detectors have been asked if they would like to purchase

C-RORCs for test setups, deadline for requests: 15 April. Two requests received so far– ATLAS with 210 C-RORCs: about 10% spares + ~10 cards for

validation system at CERN and test systems at developer labs• Need for a (small) additional batch of C-RORCs, to be discussed

– Plan to have complete Gen-III ROS PCs available as spares (at least 4, depends on plans with pre-series)

14 April 2014

Page 18: C-RORC PRR

C-RORC PRR 18

Testing upon arrival• Repeat Hapro test on small sample• Subset of Hapro test for all cards (no loopback, no FMC)• robinnpbist testing with RobinNP firmware• Run a test partition with Dolars or Dozolar sending test data

to C-RORC under test and ReadoutApplication and ROSTester programs

• After installation of Gen-III ROS PCs run again with test partition and verify that loading new firmware is OK

14 April 2014

Page 19: C-RORC PRR

C-RORC PRR 19

Deployment environment

• USA-15• 2 U high server PCs• 2 C-RORCs + 2 dual port 10 GbE NICs per PC• Purchase contract for PCs not yet awarded,

tendering closed, two candidate PCs under test in bdg. 4

14 April 2014

Page 20: C-RORC PRR

C-RORC PRR 20

S-Link tolerance test

QSFP related:• Set up a ROL between a DOLAR and a RobinNP

and measure with a variable attenuator at what attenuation the link starts to fail• LC-MPO fan out will be tested at the same

time

14 April 2014

Page 21: C-RORC PRR

C-RORC PRR 21

Schedule slippages

14 April 2014

• There have been some significant slippages in the schedule. In particular:– Delivery of the Pre-Series C-RORC cards was delayed, initially by a change

of FPGA fan (to meet the PCIe thickness spec) and then more significantly by changes in the PCB build requested by the company (NB: without the efforts of Tivadar Kiss these probably could not have been solved).

– The RobinNP firmware has taken longer to produce than expected and although it now all exists, there are still issues remaining and a fix has been found for the issue in the buffer handling for full-size fragments from multiple channels, optimization and further checking of the firmware is needed.

– Procurement of the GEN-III ROS PCs has been delayed – mainly in getting the tender launched - so that tests of the candidate PCs are only just starting

Page 22: C-RORC PRR

C-RORC PRR 22

Effect on testing of schedule slippages

• Thus not yet able to start a long-duration stability test using pre-series cards in the final configuration

• But there is a growing body of evidence from tests by ALICE and ourselves in CERN, at RHUL and NIKHEF that the C-RORC H/W works reliably

• Thus we no longer plan to run a long-duration (6-week) stability test prior to the main C-RORC production - the risk by not running the test is small and outweighed by the consequence of the extra delay it would cause

14 April 2014

Page 23: C-RORC PRR

C-RORC PRR 23

Support

• 5 years warranty by Hapro• Test setup at CERN for first diagnosis, remote

access by experts possible• Test setups at RHUL and Nikhef for further

investigations

14 April 2014

Page 24: C-RORC PRR

C-RORC PRR 24

Installation scheduleBoundary conditions:• The ROS system has to be stable and tested by 1 February 2015• In case of a major problem with the GEN-III re-installing and re-testing the

GEN-II H/W takes ~6 weeks

Step Completion date

Ordering C-RORCs, memory and QFSPs 1 May

Ordering PCs 15 May

Delivery of C-RORCs, memory, QFSPs and PCs 15 August

Upgrade of the first ROS rack (rack yet to be nominated) 15 September (with some contingency)

Stability test with this rack 1 October

Upgrade of the remaining racks 15 November

Decision: Keep GEN-III or revert to GEN-II 1 December

14 April 2014

Page 25: C-RORC PRR

C-RORC PRR 25

Concluding remarks

• RobinNP firmware not yet finalized, but to the best of our knowledge there are no hardware related issues

• ALICE is happy with starting the production• If we do not start production now the

deployment of the Gen III ROS for 2015 is not likely to be possible

14 April 2014

Page 26: C-RORC PRR

C-RORC PRR 26

Backup

14 April 2014

Page 27: C-RORC PRR

C-RORC PRR 2714 April 2014

Page 28: C-RORC PRR

C-RORC PRR 2814 April 2014

Page 29: C-RORC PRR

C-RORC PRR 2914 April 2014

Hapro

Cerntech Cerntech

Hapro

Page 30: C-RORC PRR

C-RORC PRR 3014 April 2014

Test machine at Nikhef: Intel server with S2600CP motherboard

Page 31: C-RORC PRR

C-RORC PRR 3114 April 2014

Test setup at Nikhef

Intel server with 2 C-RORCs VME crate with 12 MRODs and SBC

Rack withGen-I andGen-II ROSPCs withDolars and10 GbE NICsand withE5-1620based machinewith 40 GbEdual-portNIC and 10GbE NICs

Page 32: C-RORC PRR

C-RORC PRR 3214 April 2014

Test machine at RHUL with Supermicro X9SRL-F board

Page 33: C-RORC PRR

C-RORC PRR 33

Machine with SuperMicro MB at CERN

14 April 2014

Picture fromSupermicro website, machineat CERN has 1 CPU

Fan may not beoptimally positionedfor max. air flowover PCIe cards

Page 34: C-RORC PRR

C-RORC PRR 34

Test setup configuration (Nikhef)

14 April 2014

Gen II ROS PC runningROSTester

CerntechC-RORC Intel 2-port

10 Gb/s NIC

Intel 2-port10 Gb/s NIC

Gen II ROS PC runningROSTester

Intel 2-port10 Gb/s NIC

ROS PC runningReadoutApplication

2 x E5-2690 CPU(only 1 CPU used)

SLC6 64-bit PC with E5-1620 CPU

Intel 2-port10 Gb/s NICHapro

C-RORC Intel 2-port10 Gb/s NIC

34

Gen-I ROS PC

DOLAR

DOLAR

DOLAR

12 S-links

1 word = 4 Bytes

1 subROB

Page 35: C-RORC PRR

C-RORC PRR 35

Test with fix for lock up

14 April 2014

10% readout fraction, 12 x 150 word fragments

4 10**9 eventsgenerated: ROSTester stops

Page 36: C-RORC PRR

C-RORC PRR 36

Test with fix for lock up

14 April 2014

55% readout fraction, 12 x 350 word fragments

Page 37: C-RORC PRR

C-RORC PRR 37

Test with fix for lock up

14 April 2014

55% readout fraction, 12 x 250 word fragments

Page 38: C-RORC PRR

C-RORC PRR 38

Test with fix for lock up

14 April 2014

55% readout fraction, 12 x 200 word fragments

Page 39: C-RORC PRR

C-RORC PRR 39

Test with fix for lock up

14 April 2014

45% readout fraction, 12 x 200 word fragments4 10**9 events

Page 40: C-RORC PRR

C-RORC PRR 40

Test with fix for lock up

14 April 2014

40% readout fraction, 12 x 200 word fragments

Page 41: C-RORC PRR

C-RORC PRR 41

Test with fix for lock up

14 April 2014

70%*) readout fraction, 12 x 200 word fragments

*) 1 ROSTesterrequesting 100%of fragments, otherROSTester requesting40% of fragments

Slide correctedon 15 April

Page 42: C-RORC PRR

C-RORC PRR 42

FPGA temperature for test of previous slide

14 April 2014