TELL40 Data processing

TELL40 DATA PROCESSING

LHCb upgrade Workshop, Oxford, 07.12.2010Xavier Gremaud (EPFL, Switzerland)

TELL40 DATA PROCESSING Data flow Input data format Time reordering Clusterization Output format Conclusion

07.12.2010Xavier Gremaud, EPFL

Xavier Gremaud, EPFL

DATA FLOW07.12.2010

Split the GBT data in 2x40b

Time Reordering

Reconstruct the Super Pixel Packet (SPP) 80b wide

Linker 0, assemble data from 2 SPP data stream

Clusterization + ToT correction (subtraction)(maybe lookup table based calibration)

Data from two column processors


DATA FLOW07.12.2010

Linker 1, assemble data from 3 GBT, 64b->128b

Linker 2, assemble data from 2x3 GBT, 128b->256b

Linker 3, assemble data from 2x2x3 GBT, 256b->512b

Linker 4, assemble data from 2x2x2x3 GBT, 512b

MEP assembly (note : average event is only 2..4 512-bit word long)

External memory 2x256b

Ethernet framer 512b


INPUT DATA FORMAT For 1 link : 80b/25ns = 3.2 Gb/s For 24 links : 77 Gb/s The 80b wide GBT word is divided into two 40b data

streams which are filled by the column processor (fixed position in the 80b data word).

07.12.2010


TIME REORDERING

The RAM space is divided in 512 equally sized memory blocks (space reserved for data arriving in random order)

RAM location defined with LSBs of BxID (BCNT)Note: The total memory space required is:

max. time delay allowed * the max. event size allowed (space for every event has to be reserved!)

07.12.2010


TIME REORDERING In the current FPGA EP4SGX530 (largest Altera Stratix IV

device) «only» 64x144kB memory blocks are available. Choosing a time reorder buffer of 512 events deep and 8

word event size occupies 48 memory blocks (maximum size reached!)

Note: There are no other large memories required for the other processing steps.

Conclusion: Each GBT link is restricted to 8 SPP (Super Pixel Packets)

smaller than 64bit. For the total pixel chip, the maximum number of SPPs is

5x8=40/event. Time reorder is possible for up to 512-16=498 events.

07.12.2010

CLUSTERIZATION Clusterization requires to split up the SPP format (for example two isolated

pixels can be in the same SPP)! Most obvious approach for clusterization is to use one seeding pixel and

search for possible neighbours. Very difficult to perform “perfect” clusters, average time per cluster is

limited to 25ns if done in a pipeline, otherwise 25ns for the complete event! The 16b seeding hit address is reconstructed from the 12b address, the 4b

row header and the 4b hitmap. An additional link source id is required to identify data from 24 different GBT

links (+5bit)

07.12.2010Xavier Gremaud, EPFL


CLUSTERIZATION The principal goal of the clusterization is data reduction,

“perfect” clustering like for Tell1 is not possible anymore. Additional processing in a CPU is required to finish: Forming clusters over boundaries of GBT links Combining separated clusters Forming clusters for events with too high pixel count (see

illustration next slide)

07.12.2010

The cluster form depend of the seeding hit, which is the first hit.One “normal cluster” can be split in two clusters.


CLUSTERIZATION PIPELINED

To pipeline the cluster search, only one cluster per pipeline step is formed. One pipeline step takes 25ns (2-300Mhz processing frequency) In average the hottest region has 2..4 pixels “only” per event and per GBT

(10..20 pixel per chip)! The cluster search is performed by searching neighbors from the first hit in

the data. Each consecutive pipeline stage has the identical function. The total number of clusters that can be formed is limited by the number of

pipeline stages.

07.12.2010


CLUSTERIZATION DATA REDUCTION PERFORMANCE

07.12.2010

The cluster size is restricted to multiple of bytes! (Data processing on the FPGA but also on the CPU becomes very difficult otherwise)

The expected data reduction from clustering taking for 50% 1-hit and 50% 2-hit clusters is order of 14%.

Q: Is it worth while doing “not perfect” clustering for 14% data reduction?Q: Does the CPU take advantage from such clusters?Q: Does anybody know an other feasible clustering approach?

With clusterization Without Data reduction1 hit 29b => 32b 25b => 32b 0%

2 hits 36b => 40b 50b => 56b 28.5%

3 hits 43b => 48b 75b => 80b 40.0%

4 hits 50b => 56b 100b => 104b 46.1%

5 hits 57b => 64b 125b => 128b 50.0%

6 hits 68b => 72b 150b => 156b 53.8%


OUTPUT FORMAT After the 24 links are linked together, the data are put in a

MEP format to reduce the data before the DDR3 SDRAM. The Bcnt appears only once per event (small data

reduction can be expected) (-12bit).

07.12.2010


CONCLUSIONS (I) The real challenge of the data processing is not to spend

more than 25ns per event! Pipelining is required everywhere!

Time reordering for 512 events reaches the limit of the FPGA internal memory.

ToT calculation from BCnt and timestamp is no problem. Calibration per pixel is impossible!

No more real data reduction (zero suppression) like in TELL1. Small reduction from removing BCNT (-12-bit / SPP) Small increase from source ID (+5-bit / cluster) Small decrease from clustering (-14%) Largest reduction due to not fully loaded GBT links from

furthest pixel chips from the beam. Long time average reduction due to empty bunch crossings.

07.12.2010


CONCLUSIONS (II) Very wide buses require large multiplexers for

padding (eg a 512-bit bus requires for byte padding a multiplexer of 512x64 (32K connections)). Maybe at some stage in the processing the padding has to be reduced to 32-bit minimal size.

Clusterization useful and fast enough? Need some test with real data and a distribution of the cluster sizes.

07.12.2010


OUTLOOK Implementation of the processing

including clustering in VHDL Simulation of the processing with MC

data Place and route of the design to get

better idea of possible processing frequency and resource management.

07.12.2010

TELL40 Data processing

Documents

Transcript of TELL40 Data processing