Post on 15-Oct-2020
ROME Workshop, August 23, 2016
DEALING WITH LAYERS OF OBFUSCATIONIN PSEUDO-UNIFORM MEMORY
Robert Kuban, Mark Simon Schops, Jorg Nolte,Randolf Rotta rottaran@b-tu.de
1Research supported by German BMBF grant 01IH13003.
PROBLEM: MEMORY LATENCY ON INTEL XEON PHI KNC
Example: Measuring avg. time is unstable between restarts
Affects: micro-benchmarks,algorithm tuning,developer’s sanity. . .also application performance?
⇒ Outline
1. Causes?
2. Solutions?
3. Is it worthwhile?
2
OUTLINE
1. Causes?
2. Solutions?
3. Is it worthwhile?
4. Conclusions
1 ·Causes? 3
CAUSES: MULTIPLE PERFORMANCE BOTTLENECKS
1. compute bound
2. memory throughput:streaming, matrix alg.
3. memory latency:key-value stores, graphs
4. coherence latency:synchronisation variable
5. coherence throughput:many sync. variables
corecache
coherence directory
memory
corecache
1
2,3
4,5
1 ·Causes? 4
HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT
1. striping over memory channels, banks, and coherence directories2. past: NUMA throughput bottlenecks⇒ mostly local striping3. many-cores: no throughput bottlenecks but larger network
corecache
directory
memory memory
directory
memory memory
corecache
corecache
directory
memory memory
corecache
1 ·Causes? 5
HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT
1. striping over memory channels, banks, and coherence directories2. past: NUMA throughput bottlenecks⇒ mostly local striping3. many-cores: no throughput bottlenecks but larger network
corecache
directory
memory memory
directory
memory memory
corecache
corecache
directory
memory memory
corecache
1 ·Causes? 5
HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT
1. striping over memory channels, banks, and coherence directories2. past: NUMA throughput bottlenecks⇒ mostly local striping3. many-cores: no throughput bottlenecks but larger network
corecache
directory
memory memory
directory
memory memory
corecache
corecache
directory
memory memory
corecache
directory
1 ·Causes? 5
INTEL XEON PHI KNC IN DETAIL
4 threadsL2 cache
directory
2xmemory
ctrl
4 threadsL2 cache
directory
57 - 61 cores
64 directories
bi-directional ring network
2xmemory
ctrl
2xmemory
ctrl
2xmemory
ctrl
memory striping by (PhysAddr/62)&0xF.1
avg. remote L2 read ≈ 240 cycles, contention >16 threads.2
some lines near to memory, up to 28% app. speedup possible.3
1John McCalpin: https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/5861382Ramos et al: Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi.3Balazs Gerofi et al: Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs
1 ·Causes? 6
OUTLINE
1. Causes?
2. Solutions?
3. Is it worthwhile?
4. Conclusions
2 ·Solutions? 7
REVERSE ENGINEERING KNC’S DIRECTORY STRIPING
measure: fetch line currently owned by neighbour L2two cores, two lines: one for measurement, other for coordinationminimum RDTSC cycles, MyThOS kernel as bare-metal env.
core
coreL2 cacheL2 cache
L2 cache L2 cache
directory
directory
1 2
3
1
2
3
core
core
2 ·Solutions? 8
RESULTS: PSEUDO-RANDOMLY SCATTERED≈140 cycles best case vs. ≈400 cycles worst case
100
200
300
400
0 256 512 768 1024cache line
late
ncy
from
cor
e 0
to 1
2 ·Solutions? 9
RESULTS: RECONSTRUCTED MAPPING OF LINES TO DIRECTORIESEnables quick initialisation without measurements
100
200
300
400
0 16 32 48 64tag directory
late
ncy
from
cor
e 0
to 1
2 ·Solutions? 10
IMPLICATIONS
Support in the MyThOS kernel
per page: base address for line 7→ directory
per node: balanced mapping for directory 7→ nearby core
kernel objects can allocate local lines for sync. vars.
Application challenges
avoid >16 threads accessing same line
co-locate dependent tasks
squeeze synchronisation into cache lines
no easy migration after allocation
2 ·Solutions? 11
OUTLINE
1. Causes?
2. Solutions?
3. Is it worthwhile?
4. Conclusions
3 · Is it worthwhile? 12
PING-PONG BENCHMARK: BUSY POLLING, THEN WRITE
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●●●
●●●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●
●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●●
●
●
●●●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●●
●
●
●●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●●
●●●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●
●
●
●●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●●
●
●●●●●●●●●●●●●●●●
●
●
●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●●
●
●
●●●●●●●●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●
●
●
●
●●
●
●●●●●
●
●●●●●●●●●●●●●●●●●●
●
●●●
●
●●●●●
●
●●●●●
●
●
●
●
●
●●●
●
●●●●
●
●●●●●●●
●
●●
●
●●
●
●
●
●●●●
●
●●
●
●●●●●●
●
●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●●
●
●●●
●●●●●
●
●●●
●
●●●●
●●●●
●
●●●●●
●●●●●●
●
●●●●●●●●
●
●●●●●●●●
●●
●
●
●●
●●●●●
●
●
●●●
●
●●●●●●●
●
●●●●●●●●
●
●
●
●●●●●●●●●
●
●●●●●●●●
●
●
●
●●●●●●●●
●
●
●●●●●●●●
●
●●●●●●●●●●●
●
●●
●
●●●●●●
●
●●
●
●●●●
●
●
●
●
●●●
●
●●
●
●●●●●●
●
●●
●
●●●●
●
●●●●●
●
●●●●●●●●
●
●●●●●●●●●●●
●
●●●
●●●●●
●
●●
●●●●●●●●
●
●
●●●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●●
●
●●
●
●●●●
●
●●
●
●●●
●
●●
●
●●●●●●
●
●●
●
●●●●
●
●●
●
●●●
●
●●
●
●●●●●●
●
●●
●
●●●●
●
●●
●
●●●
●
●●●●●●●
●
●
●●●●●●●●●●●●●●●●●●
●
●
●
●●●●●●●●●
●
●
●●●●●●●●
●
●
●●●●
●●●
●●●
●
●●●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●
●●●
●
●●●●●●●●
●
●●
●
●●●●
●
●●
●
●●●
●
●●●●●●●●
●
●●
●
●●●●
●
●●
●
●●●
●
●●
●
●●●●●●
●
●
●
●
●●●●
●
●●
●
●●●
●
●●●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●●
●
●●●
●
●●●●●●●
●
●●●●●●●●
●
●●●●●●●●●●
●
●
●●
●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●●
●
●●
●
●●●●
●
●●
●
●●●
●
●●
●
●●●●●●
●
●●
●
●●●●
●
●●
●
●●●
●
●●
●
●●●●●●
●
●●●●●●●●●●●
●
●●●●●●
●●
●
●●●●●●●
●●●
●
●
●●●●●●●●
●
●●●●●●●●●
●
●
●
●●
●●●●●●
●
●●●●●●●●●
●
●
●
●●●●●●●●
●
●●●
●●
●●●●●●
●
●●●●●●
●●
●
●
●●●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●●●
●
●
●●●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●●
●
●●
●
●●●●
●
●●
●
●●●
●
●●●●●●●●
●
●●●●●
●
●●●●●
●
●●
●
●●●●●●
●
●●●●
●
●●●
●●
●
●
●●●●●●●●
●
●●●●●●●●
●●
●
●
●●●●●●●●
●
●●
●
●●●●
●
●●
●
●●●
●
●●
●
●●●●●●
●
●●
●
●●●●
●
●●
●
●●●
●
●●
●
●●●●●
●
●
●●
●
●●
●●
●
●●
●
●●
●●●●●●●●●
●
●●●●
●
●●●●●●
●
●●●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●●●
●
●
●●●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●●
●
●
●●
●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●●
●
●●●●●●●●●●●
●
●●●●●●●
●
●
●●
●
●
●●●
●
●●
●
●●●
●
●●
●
●●●●●●
●
●●
●
●●●●
●
●●
●
●●●
●
●●
●
●●●●●●
●
●●
●
●●●●
●
●●
●
●●●
●
●●
●
●●●●●●
●
●●
●
●●●●
●
●●
●
●
●
●●●
●
●●
●
●●
●
●
●●
●●●●●●
●
●●
●
●●●●●●●●●●●●●●●●
read
200
400
600
800
1000
1200
1400
1600
1 15 30distance between the cores
mea
n la
tenc
y [n
s]
Placement: worst best
3 · Is it worthwhile? 13
PING-PONG BENCHMARK: TIMES DON’T ADD UP
core
3:copy
1:read2
5:invalidationbroadcast
L2 cachecore
4:rfo
6:ack
L2 cache
directory
3 · Is it worthwhile? 14
PING-PONG BENCHMARK: AVOID INVALIDATION BROADCASTS!
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
atomic fetch−and−add
200
400
600
800
1000
1200
1400
1600
1 15 30distance between the cores
mea
n la
tenc
y [n
s]
Placement: worst best
3 · Is it worthwhile? 15
INTEL XEON PHI KNL: DOES IT APPLY?
DDR4 DDR4
MCDRAM MCDRAM MCDRAM MCDRAM
MCDRAM MCDRAM MCDRAM MCDRAM
1
2
3
modes: all2all, quadrant, sub-numa; as memory or L3 cache
benchmarks4: quadrant > all2all > sub-numa
⇒ memory + directory striping persistssmaller latency? overhead of Y-X crossing?
4Carlos Rosales: A Comparative Study of Application Performance and Scalability on the Intel Knights Landing Processor
3 · Is it worthwhile? 16
OUTLINE
1. Causes?
2. Solutions?
3. Is it worthwhile?
4. Conclusions
4 ·Conclusions 17
CONCLUSIONS
memory striping 6= directory striping
good for throughput-bound computations,bad for latency- and synchronisation-bound computations
Intel KNC: pseudo-uniform
up to 3x synchronisation latencybut avoiding broadcasts and contention equally important
benchmarks: average over multiple random allocations
Future. . .
MyThOS: evaluate impact on in-kernel synchronisation
Intel KNL: latency and contention benchmarks
HW: dedicated memory/network for synchronisation !?
4 ·Conclusions 18
OUTLINE
5. Appendix
5 ·Appendix 19
RESULTS: UNEVEN MAPPING, DEPENDS ON ENABLED CORES!
0%
1%
2%
3%
4%
0 20 40 60core
near
est c
ache
line
s
5 ·Appendix 20
READING FROM MEMORY: LATENCY FROM CORE 0
100
200
300
400
0 1024 2048 3072 4096 5120 6144 7168 8192cache line
late
ncy
from
cor
e 0
5 ·Appendix 21
READING FROM MEMORY: LATENCY FROM BEST CORE
100
200
300
400
0 1024 2048 3072 4096 5120 6144 7168 8192cache line
late
ncy
from
bes
t cor
e
5 ·Appendix 22
“PSEUDO-UNIFORM” MEMORY ARCHITECTURES
Good for throughput bound computations
HW maximises average throughput over large data sets,average latency hidden by prefetching & many threads
⇒ no need for data partitioning and placement,can focus on computation balance
Bad for latency and synchronisation bound computations
most synchronisation variables are very small,prefetching does not help
average latency does not apply,permanent overhead depending on placement
5 ·Appendix 23