Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International...

Prototyping particle transport towards GEANT5

A. Gheata27 November 2012

Fourth International Workshop for Future Challenges in Tracking and Trigger Concepts

Outline

• Why changing particle transport ?• Requirements for the LHC upgrade• Ideas embedded in a prototype• Preliminary results and lessons• Moving forward

Classical particle transport

For every particle most of the data structures and code are being traversed• Geometry navigation• Material – X-section tables• Physics processes

• Navigating very large data structures

• Irregular access patterns

• OO abused: very deep instruction stack

• Cache misses: not enough spatial and temporal locality

• Low IPC (0.6-0.8)• Event-level parallelism

will better use resources but won’t improve the above

This approach is hard to optimize in a fine-grain parallelism environment. The current state of the art physics could profit from a transport and data model based on locality and vectorized computing

Technology challenge

Talk by Intel

HEP applications are not doing great… CPI: 0.9361 load and store instructions %: 49.152% resource stalls % (of cycles): 43.779% branch instructions % (approx): 30.183% computational FP instr. %: 0.026%

MC requirements for the LS2 upgrade (to be commissioned by ~2018)

• Rate of events x50-x100 higher than today• Ratio events/MC same as now• We will need to be able to simulate more events and much faster

– The increase in computing resources will not scale the same• We need a simulation framework performing much better that

what we can do today !– Exploiting data locality and parallelism at multiple levels

• We started R&D in the field in the form of a prototype– Exploring many new ideas in a small group

• F.Carminati, J.Apostolakis, R.Brun, A.Gheata

– Trying to materialize some of these ideas in a O(2000) LOC implementation

Simple observation: HEP transport is mostly local !

ATLAS volumes sorted by transport time. The same behavior is observed for most HEP geometries.

50 per cent of the time spent in 50/7100 volumes

Talk by René last year

A playground for new ideas• Simulation prototype in the attempt to explore parallelism and

efficiency issues– Basic idea: simple physics, realistic geometry: can we implement a parallel

transport model on threads exploiting data locality and vectorisation?– Clean re-design of data structures and steering to easily exploit parallel

architectures and allow for collaborative work among resources– Can we achieve a continuous data flow from generation to digitization and

I/O without the need to block between different stages?• Events and primary tracks are independent

– Transport together a vector of tracks coming from many events– Study how does scattering/gathering of vectors of tracks and hits impact on

the simulation data flow• Can we push useful vectors downstream ?

• Start with toy physics to develop the appropriate data structures and steering code– Keep in mind that the application should be eventually tuned based on

realistic numbers

Volume-oriented transport model

• We implemented a model where all particles traversing a given geometry volume are transported together as a vector until the volume gets empty– Same volume -> local (vs. global) geometry navigation, same material and

same cross sections– Load balancing: distribute all particles from a volume type into smaller work

units called baskets, give a basket to a transport thread at a time– Steering methods working with vectors, possibly allowing for future auto-

vectorisation• Particles exiting a volume are distributed to baskets of the neighbor

volumes until exiting the setup or disappearing– Like a champagne cascade, but lower glasses can also fill top ones…– Wait the glass fill before drinking the champagne…– Independent processing once the work is distributed

Event injection

Realistic geometry + event generator

Inject event in the volume containing the IPMore events better to cut event tails and fill better the pipeline !

merging tails...

Track containers (any volume)

Transport model

Physics processes

Geometry transport

n

Event factory

n

11

0 0

nv nv

Track baskets (tracks in a single volume type)

Inject events into a track container taken by a

scheduler thread

The scheduler holds a track “basket” for each

logical volume. Tracks go to the right basket.

Baskets are filled up to a threshold, then injected

in a work queue

Transport threads pick baskets “first in first out”

Physics processes and geometry transport work

with vectors

Tracks are transported to boundaries, then dumped in a track

collection per thread

Track collections are pushed to a queue and picked by the scheduler

Sche

dule

r

Initial events injection

Optimal regime• Constant

basket content

Sparse regime• More and more frequent

garbage collections (drink the glass empty…)

• Less tracks per basket

Garbage collection threshold

Depletion regime• Continuous garbage

collection• New events needed• Force flushing some

events

Processing phases – old lessons

Prioritized transport

transp

or

t

pick-upbaskets

transportable baskets

recycled baskets

full track collections

recycled track collections

Wor

ker t

hrea

ds

Dis

patc

h &

gar

bage

co

llect

thre

ad

Crossing tracks (itrack, ivolume)

Push/replacecollection

Main scheduler

0

1

2

3

4

5

6

7

8

n

Inject priority baskets

recyclebasket

ivolume

loop tra

cks and p

ush

to

baske

ts

0

1

2

3

4

5

6

7

8

n

Stepping(tid, &tracks)

Dig

itize

& I/

O th

read

Prio

rity

baske

ts

Generate(Nevents)

Hits

Hits

Digitize(iev)

Disk

Inject/replace baskets

deque

deque

gene

rate

flush

Depletion of baskets (no re-injection)Flush events

0-4 5-9 95-99Transporting initial buffer

of events

"Priority" regime

Excellent concurrency

allover

Vectors "suffer" in this regime

Buffered events & re-injectionReusage of event

slots for re-injected events keeps memory

under control

Depletion regime with sparse tracks just few % of the

job

Concurrency still excellent

Vectors preserved

much better

Preliminary benchmarks

HT mode

Excellent CPU usage

Benchmarking 10+1 threads on a 12 core Xeon

Locks and waits: some overhead due to transitions coming from exchanging baskets via concurrent queues

Event re-injection will improve the speed-up

GPU ?

• Cannot ignore the TFLOPS delivered, neither the good ratio FLOPS/Watt(or $)

• Different computing patterns, transaction model, requirements for the data and code structure– Can our models fit these resources ?– Can we use GPU in an opportunistic task model ?

• Many CPU<->GPU workflows suffer from memory bus latency– Investigate non-blocking work flow against blocking transactions,

data flow machines, efficient scatter/gather technologies• Will be definitely a challenge

User data and code

• Well-tuned performance can be easily spoiled by inconsequential user code– Hits, digits – large user data structures usually scattered in

memory: we aim for pooling memory resources via factories– The I/O has to be well controlled and scalable – the

framework should provide at least the tools to queue I/O transactions in an efficient manner

• The callbacks to user code should execute asynchronous in competition with the global task queue– The passed information should be stateful and fully

decoupled from the state of the transport machinery

Fact

ory

man

ager

Bloc

ks in

use

User data factories

Factory<T>

Empty blocksBlock<T> Block<T> Block<T> Block<T>

Block<T>

Block<T>

Block<T>

thread 1

thread 2

thread N

GetFactory<MyHit>

MyHit *hit =factory->NextHit()

Factory<T>

Next steps• Include hits, digitization and I/O in the prototype to test the full flow

– Factories allowing contiguous pre-allocation and re-usage of user structures• Introduce realistic EM physics

– Tuning model parameters, better estimate memory requirements• Re-design transport models from a “plug-in” perspective

– E.g. ability to use fast simulation on per-track basis• Look further to auto-vectorization options

– New compiler features, Intel Cilk array notation, ...– Check impact of vector-friendly data restructuring

• Vector of objects -> object with vectors

• Push vectors lower level– Geometry and physics models as main candidates

• GPU is a great challenge for simulation code– Localize hot-spots with high CPU usage and low data transfer requirements– Test performance on data flow machines

Outlook

• The classic simulation approach performs badly on the new computing infrastructures– Projecting this in future looks worse...– The technology trends motivate serious R&D in the field

• We have started looking at the problem from different angles– Data locality and contiguity, fine-grain parallelism, data flow,

vectorization– Preliminary results confirming that the direction is good

• We need to understand what can be gained and how, what is the impact on the existing code, what are the challenges and effort to migrate to a new system…

Thank you!

Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International...

Documents

Transcript of Prototyping particle transport towards GEANT5 A. Gheata 27 November 2012 Fourth International...