More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock...

More on Locks: Case StudiesMore on Locks: Case Studies

TopicsTopics Case Study of two Architectures

Xeon and Opteron

Detailed Lock code and Cache Coherence

– 2 –

Putting it all togetherPutting it all together

Background: architecture of the two testing machinesBackground: architecture of the two testing machines

A more detailed treatment of locks and cache-coherence A more detailed treatment of locks and cache-coherence with code examples and implications to parallel software with code examples and implications to parallel software design in the above contextdesign in the above context

– 3 –

Two case studiesTwo case studies

48-core AMD Opteron48-core AMD Opteron

80-core Intel Xeon 80-core Intel Xeon

48-core AMD Opteron48-core AMD Opteron

• Last level cache (LLC) NOT shared• Directory-based cache coherence

6-cores per die

…6x……8x…

6-cores per die

…6x…

cross-socket!

80-core Intel Xeon80-core Intel Xeon

• LLC shared• Snooping-based cache coherence

Last Level Cache (LLC)

10-cores per die

…10x……8x…

10-cores per die

…10x…

cross-socket

– 6 –

Interconnect between socketsInterconnect between sockets

Cross-sockets communication can be 2-hops

– 7 –

Performance of memory operationsPerformance of memory operations

– 8 –

Local caches and memory latenciesLocal caches and memory latencies

Memory access to a line cached locally (Memory access to a line cached locally (cyclescycles)) Best case: L1 < 10 cycles Worst case: RAM 136 – 355 cycles

Latency of remote access: read (cycles)Latency of remote access: read (cycles)

Ignore

“State” is the MESI state of a cache line in a remote cache.

Cross-socket communication is expensive!Cross-socket communication is expensive! Xeon: loading from Shared state is 7.5 times more expensive over two

hops than within socket Opteron: cross-socket latency even larger than RAM

Opteron: uniform latency Opteron: uniform latency regardless regardless of the cache stateof the cache state Directory-based protocol (directory is distributed across all LLC)

Xeon: load from “Shared” state is much faster than from “M” and Xeon: load from “Shared” state is much faster than from “M” and “E” states“E” states “Shared” state read is served from LLC instead from remote cache

Latency of remote access: write (cycles)Latency of remote access: write (cycles)

“State” is the MESI state of a cache line in a remote cache.

Cross-socket communication is expensive!Cross-socket communication is expensive!

Opteron: store to “Shared” cache line is much more expensiveOpteron: store to “Shared” cache line is much more expensive Directory-based protocol is incomplete

Does not keep track of the sharers Equivalent to broad-cast and have to wait for all invalidations to complete

Xeon: store latency similar regardless of the previous cache line Xeon: store latency similar regardless of the previous cache line statestate Snooping-based coherence

Ignore

– 11 –

Detailed Treatment of Lock-based synchronizationDetailed Treatment of Lock-based synchronization

– 12 –

Synchronization implementationSynchronization implementation

Hardware support is required to implement Hardware support is required to implement synchronization primitivessynchronization primitives In the form of atomic instructions Common examples include: test-and-set, compare-and-swap,

etc. Used to implement high-level synchronization primitives

e.g., lock/unlock, semaphores, barriers, cond. var., etc.

We will only discuss test-and-set here

– 13 –

Test-And-SetTest-And-Set

The semantics of test-and-set are:The semantics of test-and-set are: Record the old value Set the value to TRUE

This is a write! Return the old value

Hardware executes it Hardware executes it atomicallyatomically!!

– 14 – 14

Test-And-SetTest-And-Set

• Read-exclusive (invalidations)• Modify (change state)

• Memory barrier• completes all the mem. op.

before this TAS• cancel all the mem. op.

after this TAS

atomic!

– 15 – Courtesy Ding Yuan

Using Test-And-SetUsing Test-And-Set

Here is our lock implementation with test-and-set:Here is our lock implementation with test-and-set:struct lock { int held = 0;}void acquire (lock) { while (test-and-set(&lock->held));}void release (lock) { lock->held = 0;}

TAS and cache coherenceTAS and cache coherence

Shared Memory (held = 0)

CacheProcessor

State Data

Thread A:

CacheProcessor

State Data

Thread B:acq(lock)

Read-Exclusive

CacheProcessor

held=1

Thread A:

CacheProcessor

State Data

Thread B:acq(lock)

Read-ExclusiveFill

CacheProcessor

held=1

Thread A:

CacheProcessor

acq(lock)

State Data

Thread B:acq(lock)

Read-Exclusiveinvalidation

CacheProcessor

held=1

Thread A:

CacheProcessor

acq(lock)

State Data

Thread B:acq(lock)

Read-Exclusiveinvalidationupdate

CacheProcessor

held=1

Thread A:

CacheProcessor

acq(lock)

State Data

Thread B:

held=1

acq(lock)

Read-ExclusiveFill

What if there are contentions?What if there are contentions?

CacheProcessor

State Data

Thread A:

CacheProcessor

while(TAS(l)) ;

State Data

Thread B:while(TAS(l)) ;

– 22 –

How bad can it be?How bad can it be?

Recall: TAS essentially is a Store + Memory Barrier

IgnoreStore

How to optimize?How to optimize?

When the lock is being held, a contending “acquire” When the lock is being held, a contending “acquire” keeps modifying the lock var. to 1keeps modifying the lock var. to 1 Not necessary! void test_and_test_and_set (lock)

{ do { while (lock->held == 1) ; // spin } } while (test_and_set(lock->held));}void release (lock) { lock->held = 0;}

CacheProcessor

held=1

Thread A:

CacheProcessor

while(held==1) ;

State Data

Thread B:

holding lock

CacheProcessor

State Data

Thread B:

ReadRead request

CacheProcessor

held=1

Thread A:

CacheProcessor

while(held==1) ;

State Data

Thread B:

held=1

holding lock

CacheProcessor

State Data

Thread C:

ReadRead request update

CacheProcessor

held=1

Thread A:

CacheProcessor

while(held==1) ;

State Data

Thread B:

held=1

holding lock

CacheProcessor

State Data

Thread C:

held=1

while(held==1) ;

Repeated read to “Shared” cache line: no cache coherence traffic!

Let’s put everything togetherLet’s put everything together

Load Ignore

Local access

Implications to programmersImplications to programmersCache coherence is expensive (more than you thought)Cache coherence is expensive (more than you thought)

Avoid unnecessary sharing (e.g., false sharing) Avoid unnecessary coherence (e.g., TAS -> TATAS)

Clear understanding of the performance

Crossing sockets is a killerCrossing sockets is a killer Can be slower than running the same program on single core! pthread provides CPU affinity mask

pin cooperative threads on cores within the same die

Loads and stores can be as expensive as atomic Loads and stores can be as expensive as atomic operationsoperations

Programming gurus understand the hardwareProgramming gurus understand the hardware So do you now! Have fun hacking!

More details in “Everything you always wanted to know about synchronization but were afraid to ask”. David, et. al., SOSP’13

More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock...

Documents

Transcript of More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock...

AMD Opteron - Presentaciones

Storage for Opteron Server

Advancing Performance on Intel® Xeon® and Xeon Phi™ …Advancing Performance on Intel® Xeon® and Xeon Phi™ Processors ... LAMMPS Baseline on 2S AMD* Opteron* 6274 [1600MHz

Ace’s Server Guide: Dual Xeon, Dual Opteron, and Quad Opteron · 325 only to the HPC market, but the 1U Dual Opteron server could easily target other markets as well. The lack of

WebObjects + ScalaWO.pdf · High-End Cores Threads AMD Opteron IBM Power7 Intel Xeon 12 12 8 32 6 12

High Performance and Productivit High Density HPC Solutions · 2009-09-11 · 2000 4000 6000 8000 10000 12000 14000 Penguin Computing/Xeon 5160 (3.0GHz) HP/Opteron (2.6GHz) NEXXUS/Core2Duo

AMD Opteron - AMD64 Architecture Sean Downes

INTEL XEON PHI COPROCESSOR Case Study...INTEL CONFIDENTIAL Intel® Xeon Phi™ Coprocessor: Performance Proof-points Click on links for more information (SC12) Energy Academic/ Government

High Performance Computing with AMD Opteron Maurizio Davini.

HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...

INTEL XEON PHI COPROCESSOR Case Study · 2016-07-16 · INTEL CONFIDENTIAL Intel® Xeon Phi™ Coprocessor: Performance Proof-points Click on links for more information (SC12) Energy

Packin’ the PMK - Aircrack-ng · Aircrack 650 checks/s on Xeon E5405 (4x2Ghz) 650 checks/s on Opteron 2216 (4x2.4Ghz) "pipe multithreading", fails on AMD Pico Computing products

AMD Opteron™ 6200 Series Processor vs Intel® Xeon® 5600 ...

Parallel Computing and Intel® Xeon Phi™ coprocessors · PDF fileLAMMPS, NAMD, AMBER, HMMER, BLAST, QCD ... •Case Studies & Demo 10 . Intel® Xeon Phi™ coprocessor codenamed

AMD Opteron A1100 Series SoC Launch Presentation

Six-Core AMD Opteron EE Processor

IA32-AMD-IA64- RISC: eine Decus Prozessorübersicht ... · PDF fileRISC: eine Prozessorübersicht ... • Xeon vs. Opteron System Architektur ... High-perform ance 32-bit and 64-bit

A look at computing performance and usage. 3.6GHz Pentium 4: 1 GFLOPS 1.8GHz Opteron: 3 GFLOPS (2003) 3.2GHz Xeon X5460, quad-core: 82 GFLOPS.

Neue Sun HPC-Systeme mit AMD Opteron CPUs

Using the IBM Opteron 1350 at OSC - Personal Pages - …personal.denison.edu/.../Supplements/opteron-mod1-1010.pdf · 2011-03-02 · • Accessing the IBM 1350 Opteron Cluster •