Parallel transport prototype Andrei Gheata. Motivation Parallel architectures are evolving fast –...
-
Upload
gwendoline-simmons -
Category
Documents
-
view
224 -
download
1
Transcript of Parallel transport prototype Andrei Gheata. Motivation Parallel architectures are evolving fast –...
Parallel transport prototype
Andrei Gheata
Motivation
• Parallel architectures are evolving fast– Task parallelism in hybrid configurations– Instruction level parallelism (ILP) exploited more and
more• 4 FLOP/cycle on modern processors
• HEP applications are using very inefficiently these resources– Running in average 0.6 instructions/cycle– Bad or inefficient usage of C++ leads to scattered data
structures and cache misses (both instruction and data)
Doing nothing will enlarge the gap From a recent
talk by Intel
AliRoot – is there hope ?
• Can we do more than running AliRoot in threads and call it AliRoot-MT ?– Even that would be a gain…
• Seriously, what could be potentially parallelized ?– Simulation – most CPU-intensive, best candidate
• Event parallelism in the first approach, re-design I/O
– Reconstruction – quite modular• Task-oriented parallelism, re-design I/O
– Analysis – we have a modular structure of tasks• Pool of threads, pool of tasks, marry them…
• Possibly exploiting lower grain parallelism for CPU-intensive track loops
“Somebody make my program parallel !!!”
• Parallelism is not a natural way of programming– Well, at least not for physicists…
• “Can somebody tell me which line should I change ?”– In fact yes, there are tools out there, watch this nice talk about
Intel Parallel Advisor: http://indico.cern.ch/conferenceDisplay.py?confId=191117
• It is clear that we need a deep re-factoring of the code– Looking to the data structures and algorithms from a parallel
perspective is a big step forward– Identify parallelizable “sites” in the code, focus on them– Assess what can be gained by parallelizing a given part– Step-by-step procedure
What about just start doing it ?
• In fact, we did, but not in a very systematic way– G3 event level parallelism (M.Tadel), general code inspection
from a parallel perspective (S.Lohn)• We started to look into parallelizing the simulation
framework– Geometry parallelism first– Simple simulation prototype - a playground for ideas– Not a very steep learning curve, but we start understanding
the “language”…• We should extend this exercise in the offline group
– It tastes bad, but you get used to it…
Why parallelizing geometry ?
• Geometry is a key component in many applications– Simulation MC, reconstruction, event displays, geometry DB, …
• Some use geometry as wrapper to extract 3D or material information – like positions, sizes, matrices, alignment, densities, … – Optimizing the usage of this info is application responsibility
• Like OpenGL internally decomposes the objects in lower level representations that can be handled in parallel
• Some of the HEP applications use directly geometry functionality, namely navigation– Namely transport MC and tracking code– As these applications can and will be parallelized, geometry has to
follow…
What kind of parallelism ?• Geometry is a utility – it has to be thread safe
– Covered in this presentation• Navigation is iterative – next step cannot start unless last one finished
– Query -> propagate -> query -> propagate– Has to support at top level a task-based parallelism (e.g. different tracks to different threads)
• Navigation algorithms are hard to factorize– Tree-oriented queries (ups and downs in a hierarchy of volumes/nodes)– Answers are results of a minimization procedure, it is hard to work ahead– The state changes and has to be propagated all along the query
• Most natural low level factorization – solids– Main loops organized at volume/voxels level– 3D shapes are the local “computation objects” and contain most CPU-expensive algorithms– Good candidates for GPU kernels, but communication of the state can be a limiting factor due to
memory bus latency• Vectorization – ideal for low level computation
– Propagating several state vectors (position, direction) to the same solid type• If solids are vector-aware, can we assemble decent vectors to feed the same solid?
• Long term development
Geometry data structures
• ROOT geometry was NOT thread safe by design– In the attempt to maximize re-usage of cached geometry states or pre-computed
values, state-related info was carried by many geometry data types• Voxel optimisation structures, divisions, assembly shapes, composite shapes, geometry manager
• Many methods, including simple getters, were not thread safe• The stateful part of the geometry was not clearly separated from the const one
class TGeoPatternFinder : public TObject{… Double_t fStep; // division step length Double_t fStart; // starting point on divided axis Double_t fEnd; // ending point Int_t fCurrent; // current division element Int_t fNdivisions; // number of divisions Int_t fDivIndex; // index of first div. node TGeoMatrix *fMatrix; // generic matrix TGeoVolume *fVolume; // volume to which applies Int_t fNextIndex; //! index of next node
Re-design strategy
• The goal was to make geometry thread safe without sacrificing existing optimizations
• Step 1: Split out the navigation part from the geometry manager– Most data structures here depend on the state– Different calling threads will work with different navigators
• Step 2: Spot all thread unsafe data members and methods within the structural geometry objects and protect them– Shapes and optimization structures– Convert object->data into object->data[thread_id]
• Step 3: Rip out all stateful data from structural objects to keep a compact const access geometry core– Whenever possible, percolate the state in the calling sequence
Problems along the way
• Separating navigation out of the manager was a tedious process– Keeping a large existing API functional
• Spotting the thread-unsafe objects was not obvious– Practically all work done by Matevz Tadel (thanks!)
• Changing calling patterns was sometimes impossible, resources needed to be locked– First approach suffered a lot from Amdahl law
• Many calls to get the thread Id needed, while there was no implementation of TLS in ROOT– __thread not supported everywhere
TGeoNavigator
Implementation
• Thread data pre-alocated via TGeoManager::SetMaxThreads()• User threads have to ask for a navigator via TGeoManager::CreateNavigator()• Getting access to a stateful data member goes via:
– statefulObject->GetThreadData(tid)->fData– For voxel structures they are ripped out into stateful data in the navigator, passed as arguments to
methods
Analysis manager
TGeoNavigator
TGeoNavigator
TGeoNavigator
0
1 2 3
Stateless const
TGeoManager::ThreadId()
Data structures
static __thread tid=0 tid=1 tid=2 tid=3
stat
eful
stat
eful
stat
eful
stat
eful
struct ThreadData_t {statefull data members;}
mutable std::vector<ThreadData_t*> fThreadData
Usage//_________________________________________________________MyTransport::PropagateTracks(tracks){// User transport method called by the main thread gGeoManager->SetMaxThreads(N); // mandatory SpawnNavigationThreads(N, my_navigation_thread, tracks)
JoinNavigationThreads();}
void *my_navigation_thread(void *arg){// Navigation method to be spawned as thread TGeoNavigator *nav = gGeoManager->GetCurrentNavigator(); if (!nav) nav = gGeoManager->AddNavigator(); int tid = nav->GetThreadId(); // or TGeoManager::ThreadId() PropagateTracks(subset(tid,tracks)); return 0;}
Speed-up• Good scalability with rather
small Amdahl effects (~0.7 % sequential– No lock on memory resources
however !– Work balancing is not perfect
(worsen by CPU throttling)• Small overheads due to
several hidden effects– Context switches, false cache
sharing (?), pthread calls– May need to re-organize
stateful data per thread rather than
Simulation prototype - a playground for new ideas
• Simple simulation prototype to help exploring parallelism and efficiency issues– Basic idea: minimal physics to start with, realistic HEP geometry: can we implement a parallel
transport model on threads exploiting data locality and vectorization?– Clean re-design of data structures and steering to easily exploit parallel architectures– Can we make it sync-free from generation to digitization and I/O ?
• Events and primary tracks are independent– Work chunk: basket containing a vector of tracks– Mixing tracks from different events to avoid tails and have reasonably-sized vectors– Study how does scattering/gathering impact the simulation data flow
• Toy physics at first, more realistic EM & hadronic processes to continue with– The application should be eventually tuned based on realistic numbers
• New transport model more “detector element”-oriented, profiting from the cached data structures– geometry and x-section wise
• Where to go from there– Re-design the particle stack and the I/O– Re-design transport models from a “plug-in” perspective
• E.g. ability to use fast simulation on per track basis
– Understand what can be gained and how, what is the impact on the existing code, what are the changes and effort to migrate to a new system…
Volume-oriented transport model
• We implemented a model where all particles traversing a given geometry volume are transported together as a vector until the volume gets empty– Same volume -> local (vs. global) geometry navigation, same
material and same cross sections– Load balancing: distribute all particles from a volume type into
smaller work units called baskets, give a basket to a transport thread at a time
• Particles exiting a volume are distributed to baskets of the neighbor volumes until exiting the setup or disappearing– Like a champagne cascade, but lower glasses can also fill top ones…– No direct communication between threads to avoid synchronization
issues
The beginning
Realistic geometry + event generator
Inject event in the volume containing the IPMore events better to cut event tails and fill better the pipeline !
A first approachWork queue
Scatter all injected tracks to baskets. Only baskets above some threshold are transported.
Transport threads pick-up baskets from the work queue
Physics processes
Geometry transport
Particles(i0,…,in)
Particles(i0,…,in)
Physics processes and geometry transport called with vectors of particles
Each thread transports its basket of tracks to the boundaries of the current volumeMove crossing tracks to a buffer, thenpicks-up the next basket from the queue
First version required synchronization…
Work queuePOP_CHUNK
QUEUE_EMPTY
ParticleBuffer
FLUSH
Generation = Pop work chunks until the queue is empty
Synchronization point: flush transported particle buffer and sort baskets according content
Recompute work chunks and start transporting the next generation of baskets
Processing phases
Initial events injection
Optimal regime• Constant
basket content
Sparse regime• More and more frequent
garbage collections• Less tracks per basket
Garbage collection threshold
Depletion regime• Continuous garbage
collection• New events needed• Force flushing some
events
ideal
Prototype implementation
transp
or
t
pick-upbaskets
transportable baskets
recycled baskets
full track collections
recycled track collections
Wor
ker t
hrea
ds
Dis
patc
h &
gar
bage
co
llect
thre
ad
Crossing tracks (itrack, ivolume)
Push/replacecollection
Main scheduler
0
1
2
3
4
5
6
7
8
n
Inject priority baskets
recyclebasket
ivolume
loop tra
cks and p
ush
to
baske
ts
0
1
2
3
4
5
6
7
8
n
Stepping(tid, &tracks)
Dig
itize
& I/
O th
read
Prio
rity
baske
ts
Generate(Nevents)
Hits
Hits
Digitize(iev)
Disk
Inject/replace baskets
deque
deque
gene
rate
flush
Evolution of populationsFlush events
0-4 5-9 95-99
Preliminary benchmarks
HT mode
Excellent CPU usage
Benchmarking 10+1 threads on a 12 core Xeon
Locks and waits: some overhead due to transitions coming from exchanging baskets via concurrent queues
Event re-injection will improve the speed-up