Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI...

ROME Workshop, August 23, 2016

DEALING WITH LAYERS OF OBFUSCATIONIN PSEUDO-UNIFORM MEMORY

Robert Kuban, Mark Simon Schops, Jorg Nolte,Randolf Rotta rottaran@b-tu.de

1Research supported by German BMBF grant 01IH13003.

PROBLEM: MEMORY LATENCY ON INTEL XEON PHI KNC

Example: Measuring avg. time is unstable between restarts

Affects: micro-benchmarks,algorithm tuning,developer’s sanity. . .also application performance?

⇒ Outline

1. Causes?

2. Solutions?

3. Is it worthwhile?

OUTLINE

1. Causes?

2. Solutions?

4. Conclusions

1 ·Causes? 3

CAUSES: MULTIPLE PERFORMANCE BOTTLENECKS

1. compute bound

2. memory throughput:streaming, matrix alg.

3. memory latency:key-value stores, graphs

4. coherence latency:synchronisation variable

5. coherence throughput:many sync. variables

corecache

coherence directory

memory

corecache

1 ·Causes? 4

HW SOLUTION: STRIPING TO MAXIMISE THROUGHPUT

1. striping over memory channels, banks, and coherence directories2. past: NUMA throughput bottlenecks⇒ mostly local striping3. many-cores: no throughput bottlenecks but larger network

corecache

directory

memory memory

directory

memory memory

corecache

directory

memory memory

corecache

1 ·Causes? 5

corecache

directory

memory memory

directory

memory memory

corecache

directory

memory memory

corecache

1 ·Causes? 5

corecache

directory

memory memory

directory

memory memory

corecache

directory

memory memory

corecache

directory

1 ·Causes? 5

INTEL XEON PHI KNC IN DETAIL

4 threadsL2 cache

directory

2xmemory

4 threadsL2 cache

directory

57 - 61 cores

64 directories

bi-directional ring network

2xmemory

memory striping by (PhysAddr/62)&0xF.1

avg. remote L2 read ≈ 240 cycles, contention >16 threads.2

some lines near to memory, up to 28% app. speedup possible.3

1John McCalpin: https://software.intel.com/en-us/forums/intel-many-integrated-core/topic/5861382Ramos et al: Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi.3Balazs Gerofi et al: Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs

1 ·Causes? 6

OUTLINE

1. Causes?

2. Solutions?

4. Conclusions

2 ·Solutions? 7

REVERSE ENGINEERING KNC’S DIRECTORY STRIPING

measure: fetch line currently owned by neighbour L2two cores, two lines: one for measurement, other for coordinationminimum RDTSC cycles, MyThOS kernel as bare-metal env.

coreL2 cacheL2 cache

L2 cache L2 cache

directory

2 ·Solutions? 8

RESULTS: PSEUDO-RANDOMLY SCATTERED≈140 cycles best case vs. ≈400 cycles worst case

0 256 512 768 1024cache line

2 ·Solutions? 9

RESULTS: RECONSTRUCTED MAPPING OF LINES TO DIRECTORIESEnables quick initialisation without measurements

0 16 32 48 64tag directory

2 ·Solutions? 10

IMPLICATIONS

Support in the MyThOS kernel

per page: base address for line 7→ directory

per node: balanced mapping for directory 7→ nearby core

kernel objects can allocate local lines for sync. vars.

Application challenges

avoid >16 threads accessing same line

co-locate dependent tasks

squeeze synchronisation into cache lines

no easy migration after allocation

2 ·Solutions? 11

OUTLINE

1. Causes?

2. Solutions?

4. Conclusions

3 · Is it worthwhile? 12

PING-PONG BENCHMARK: BUSY POLLING, THEN WRITE

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●●●●●●

●●●

●●●●●

●●●

●●●●

●●●●●●●

●●

●●●●

●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●

●●●●●

●●●

●●●●

●●●●●

●●●●●●

●●●●●●●●

●●

●●●●●

●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●

●●●●●●

●●

●●●●

●●●

●●

●●●●●●

●●

●●●●

●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●

●●●●●

●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●●●

●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●

●●●●●●●●

●●

●●●●

●●

●●●

●●●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●●●

●●

●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●

●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●●●●●●●●●●

●●●●●●

●●

●●●●●●●

●●●

●●●●●●●●

●●●●●●●●●

●●

●●●●●●

●●●●●●●●●

●●●●●●●●

●●●

●●

●●●●●●

●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●

●●●●

●●

●●●

●●●●●●●●

●●●●●

●●

●●●●●●

●●●●

●●●

●●

●●●●●●●●

●●

●●●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●

●●

●●●●●●●●●

●●●●

●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●

●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●

●●●●●●

●●

●●●●●●●●●●●●●●●●

1 15 30distance between the cores

Placement: worst best

PING-PONG BENCHMARK: TIMES DON’T ADD UP

3:copy

1:read2

5:invalidationbroadcast

L2 cachecore

L2 cache

directory

PING-PONG BENCHMARK: AVOID INVALIDATION BROADCASTS!

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

atomic fetch−and−add

1 15 30distance between the cores

Placement: worst best

INTEL XEON PHI KNL: DOES IT APPLY?

DDR4 DDR4

MCDRAM MCDRAM MCDRAM MCDRAM

modes: all2all, quadrant, sub-numa; as memory or L3 cache

benchmarks4: quadrant > all2all > sub-numa

⇒ memory + directory striping persistssmaller latency? overhead of Y-X crossing?

4Carlos Rosales: A Comparative Study of Application Performance and Scalability on the Intel Knights Landing Processor

OUTLINE

1. Causes?

2. Solutions?

4. Conclusions

4 ·Conclusions 17

CONCLUSIONS

memory striping 6= directory striping

good for throughput-bound computations,bad for latency- and synchronisation-bound computations

Intel KNC: pseudo-uniform

up to 3x synchronisation latencybut avoiding broadcasts and contention equally important

benchmarks: average over multiple random allocations

Future. . .

MyThOS: evaluate impact on in-kernel synchronisation

Intel KNL: latency and contention benchmarks

HW: dedicated memory/network for synchronisation !?

4 ·Conclusions 18

OUTLINE

5. Appendix

5 ·Appendix 19

RESULTS: UNEVEN MAPPING, DEPENDS ON ENABLED CORES!

0 20 40 60core

5 ·Appendix 20

READING FROM MEMORY: LATENCY FROM CORE 0

0 1024 2048 3072 4096 5120 6144 7168 8192cache line

5 ·Appendix 21

READING FROM MEMORY: LATENCY FROM BEST CORE

0 1024 2048 3072 4096 5120 6144 7168 8192cache line

5 ·Appendix 22

“PSEUDO-UNIFORM” MEMORY ARCHITECTURES

Good for throughput bound computations

HW maximises average throughput over large data sets,average latency hidden by prefetching & many threads

⇒ no need for data partitioning and placement,can focus on computation balance

Bad for latency and synchronisation bound computations

most synchronisation variables are very small,prefetching does not help

average latency does not apply,permanent overhead depending on placement

5 ·Appendix 23

Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI...

Documents

Transcript of Dealing with Layers of Obfuscation - in pseudo-Uniform Memory · 2019. 5. 22. · INTEL XEON PHI...

Instead of the screen while your · Threads per core 2 4 Clock 2.1GHz 2.0GHz L1 data cache 32K 32K L2 cache 1024K 256K L3 cache 30976K 32768K Memory Controllers 6 8 SIMD instructions

Disks. 2 The Memory Hierarchy Each level acts as a cache for the layer below it CPU registers, L1 cache L2 cache primary memory disk storage (secondary.

VEX-6254 UM v1r0A€¦ · CPU DM&P SoC CPU Vortex86EX- 400MHz Real Time Clock with Lithium Battery Backup Cache L1:16K I-Cache 16K D-Cache, L2:256KB Cache BIOS Coreboot BIOS System

Intel Pentium III Processor with 512KB L2 Cache at … Sheets/Intel PDFs/Pentium III (1...Intel® Pentium ® III Processor with 512KB L2 Cache at 1.13GHz to 1.40GHz Datasheet Product

Das digitale Flipchart für effizientere Zusammenarbeit€¦ · On-Chip-Cache-Speicher L1-Instruction-Cache: 48 KB / L1-Daten-Cache: 32 KB / L2-Cache: 2 MB Taktgeschwindigkeit 1,7

Introduction to Scientific Programming using GP … › files › CoursesDev › public › 2014 › ...(Device) Grid L2 cache Global Memory Shared Memory Threads Registers Threads

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

64-bit Intel® Xeon™ Processor with 2 MB L2 Cache Datasheet · processor with 2 MB L2 cache is scalable to two processors in a multiprocessor system providing exceptional performance

1 MB L2 Cache Floating Point Unit Schedulers Allocator Arithmetic & Logic Unit 16 KB Data Cache

15-740/18-740 Computer Architectureece740/f11/lib/exe/fetch.php?media=wiki:lectures:onur-740...Fall 2011, 11/7/2011 . Review Set 13 ... L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER

Fujitsu Limited...M.1.3 Level-2 Unified Cache (L2 Cache) 149 M.2 Cache Coherency Protocols 150 M.3 Cache Control/Status Instructions 151 M.3.1 Flush Level-1 Instruction Cache (ASI_FLUSH_L1I)

Performance Characteristics of the POWER8 … · Performance Characteristics of the POWER8 ... – Same core, L2, L3, etc L3 Cache and Chip Interconnect re L2 re L2 re L2 Co re L2

Presentazione standard di PowerPoint - ASEM.IT · PB2200 - Features 1/2 3 PB2200 PROCESSOR Intel® Celeron J1900, 2GHz (2.41GHz burst frequency), 4 cores / 4 threads, 2MB L2 cache,

Cache&Memories:&& WhyProgrammersNeedtoKnow!& · 2011. 4. 4. · 2 Duke University 3 Compsci 104 IntelCore&i7Cache&Hierarchy Regs L1 d-cache L1 i-cache L2 unified cache Core 0 Regs

Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.

Dynamic Cache Clustering for Chip Multiprocessorsmhhammou/ics070-hammoud.pdf · chip multiprocessors. Using DCC, a per-core cache cluster is comprised of a number of L2 cache banks

Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.

· Web viewseja compartilhada por todos os núcleos. Nesse novo sistema, a memória cache L2 é somada com a memória cache L3 (L2 + L3), com isso aumentando consideravelmente a

Cache performance and memory access patterns · 2020-04-29 · Memory L2 cache. A miss on one level becomes an access at the next level CPU L1 cache Main Memory L2 cache Hit Miss

18-447 Computer Architectureece447/s13/lib/exe/fetch.php?...L2 CACHE 0 CORE 1 SHARED L3 CACHE DRAM INTERFACE CORE 0 CORE 2 CORE 3 L2 CACHE 1 L2 CACHE 2 L2 CACHE 3 DRAM BANKS Multi-Core