RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala...

RH Locks

Uppsala UniversityInformation Technology

Department of Computer SystemsUppsala Architecture Research Team [UART]

RH Lock:A Scalable Hierarchical Spin Lock

RH Lock:A Scalable Hierarchical Spin Lock

Zoran Radovic and Erik Hagersten{zoranr, eh}@it.uu.se

2nd ANNUAL WORKSHOP ON MEMORYPERFORMANCE ISSUES (WMPI 2002)May 25, 2002, Anchorage, Alaska

WMPI 2002, Alaska Uppsala Architecture Research Team (UART) RH Locks

Synchronization History

Spin-Locks test_and_set (TAS), e.g., IBM System/360, ’64 Rudolph and Segall, ISCA’84

• test_and_test_and_set (TATAS)

TATAS with exponential backoff (TATAS_EXP), ’90 – ’91

P1

$

P2

$

P3

$

Pn

$

Memory

FREELock:

P3

BUSY

Busy-wait/backoff

FREEBUSYBUSY BUSY

…


Performance, 12 years ago …Traditional microbenchmark

for (i = 0; i < iterations; i++) { ACQUIRE(lock);

// Null Critical Section (CS)

RELEASE(lock);}

Thanks: Michael L. Scott

IF (more contention) THEN less efficient CS …

IF (more contention) THEN less efficient CS …


Making it Scalable: Queues …

Spin on your predecessor’s flag

First-come first-served order

Queue-Based Locks QOLB/QOSB ’89 MCS ’91 CLH ’93


Performance, May 2002Traditional microbenchmark

0,00

0,05

0,10

0,15

0,20

0,25

0 2 4 6 8 10 12 14 16Processors

Tim

e/P

roce

ssor

s [s

econ

ds]

TATAS

TATAS_EXP

MCS

CLH

16

Sun Enterprise E6000 SMP


Synchronization Today

Commercial applications use spin-locks (!) usually TATAS & TATAS_EXP with timeout for

• recovery from transaction deadlock

• recovery from preemption of the lock holder

POSIX threads:• pthread_mutex_lock

• pthread_mutex_unlock

HPC: runtime systems, OpenMP, …


Switch

Non-Uniform MemoryArchitecture (NUMA)

NUMA optimizations Page migration Page replication

P1

$

P2

$

P3

$

Pn

$

P1

$

P2

$

P3

$

Pn

$

Memory Memory

12 – 10


Non-Uniform CommunicationArchitecture (NUCA)

NUCA examples (NUCA ratios): 1992: Stanford DASH (~ 4.5) 1996: Sequent NUMA-Q (~ 10) 1999: Sun WildFire (~ 6) 2000: Compaq DS-320 (~ 3.5) Future: CMP, SMT (~ 10)

NUCAratio

Switch

P1

$

P2

$

P3

$

Pn

$

P1

$

P2

$

P3

$

Pn

$

Memory Memory

1 2 – 10

Our NUCA …


Our NUCA: Sun WildFire

Two E6000 connected through a hardware-coherent interface with a raw bandwidth of 800 MB/s in each direction

16 UltraSPARC II (250 MHz) CPUs per node 8 GB memory

NUCA ratio 6


Performance on our NUCA

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

0,40

0,45

0,50

0 4 8 12 16 20 24 28 32Processors

Tim

e/P

roce

sso

rs [

seco

nd

s]

TATAS

TATAS_EXP

MCS

CLH

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32Processors

Nod

e-ha

ndof

fs [

%]

16 16


Our Goals

Demonstrate that the first-come first-served nature of queue-based locks is unwanted for NUCAs new microbenchmark: “more realistic” behavior, and real application study

Design a scalable spin lock that exploits the NUCAs creating a controlled unfairness (stable lock), and reducing the traffic compared with the test&set locks


Outline

History & BackgroundNUMA vs. NUCAExperimentation Environment The RH Lock Performance Results Application Performance Conclusions


Key Ideas Behind RH Lock

Minimizing global traffic at lock-handover Only one thread per node will try to acquire a “remote” lock

Maximizing node locality of NUCAs Handover the lock to a neighbor in the same node Creates locality for the critical section (CS) data as well Especially good for large CS and high contention

RH lock in a nutshell: Double TATAS_EXP: one node-local lock + one “global”


The RH Lock Algorithm

FREE

P1

$

P2

$

P3

$

P16

$

Cabinet 1: Memory

REMOTE

P17

$

P18

$

P19

$

P32

$

Cabinet 2: Memory

FREEREMOTELock1:

Lock2:

Lock1:

Lock2:

P2

2

P19

19else:

TATAS(my_TID, Lock)until FREE or

L_FREE

if “REMOTE”:Spin remotely

CAS(FREE, REMOTE)until FREE

(w/ exp backoff)

… …

FREECS

1

2

16

1 REMOTE

32L_FREE

Acquire:SWAP(my_TID, Lock)If (FREE or L_FREE) You’ve got it!

Release:CAS(my_TID, FREE) else L_FREE)

16

FREECS

IF (more contention) THEN more efficient CS

IF (more contention) THEN more efficient CS


Performance ResultsTraditional microbenchmark, 2-node Sun WildFire

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32Processors

Nod

e-ha

ndof

fs [

%]

TATAS

TATAS_EXP

MCS

CLH

RH Fair_factor = 1

RH Fair_factor = 50

RH Fair_factor = 100

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

0,40

0,45

0,50

0 4 8 12 16 20 24 28 32Processors

Tim

e/P

roce

ssor

s [s

econ

ds]

TATAS

TATAS_EXP

MCS

CLH

RH


Controlling Unfairness …

FREE

P1

$

P2

$

P3

$

Pn

$

Cabinet 1: Memory

FREE

Lock1:

Lock2:

P2

TID

void rh_acquire_slowpath(rh_lock *L){

...

if ((random() % FAIR_FACTOR) == 0) be_fare = TRUE; else be_fare = FALSE;

...

}

void rh_release(rh_lock *L){ if (be_fare) *L = FREE; else if (cas(L, my_tid, FREE) != my_tid) *L = L_FREE;

}

L_FREE


Node-handoffsTraditional microbenchmark, 2-node Sun WildFire

0

10

20

30

40

50

60

70

80

90

100

0 4 8 12 16 20 24 28 32Processors

No

de

-ha

nd

off

s [%

]

TATASTATAS_EXPMCSCLHRH Fair_factor = 1RH Fair_factor = 50RH Fair_factor = 100


New Microbenchmark

for (i = 0; i < iterations; i++) { ACQUIRE(lock);

// Critical Section (CS) work

RELEASE(lock);

// Non-CS work STATIC part +

// Non-CS work RANDOM part}

More realistic node-handoffs for queue-based locks Constant number of processors The amount of Critical Section (CS) work can be

increased we can control the “amount of contention”


Performance ResultsNew microbenchmark, 2-node Sun WildFire, 28 CPUs

0

5

10

15

20

25

30

0 500 1000 1500 2000Critical Work [array size]

Tim

e [s

econ

ds]

TATAS

TATAS_EXP

MCS

CLH

RH

0

5

10

15

20

25

30

35

40

45

50

55

60

0 500 1000 1500 2000Critical Work [array size]

Nod

e-ha

ndof

fs [

%]

WF

14 14


Application Performance (1)Methodology

The SPLASH-2 programs 14 apps

We study only applications with more then 10,000 acquire/release operations Barnes, Cholesky, FMM,

Radiosity, Raytrace, Volrend, and Water-Nsq

Synchronization algorithms TATAS, TATAS_EXP, MCS,

CLH, and RH

2-node Sun WildFire

Program Lock Acquires

Barnes 69,193

Cholesky 74,284

FFT 32

FMM 80,528

LU-c & LU-nc 32

Ocean-c 6,304

Ocean-nc 6,656

Radiosity 295,627

Radix 32

Raytrace 366,450

Volrend 38,456

Water-Nsq 112,415

Water-Sp 510


Application Performance (2)Raytrace Speedup

WF

0

1

2

3

4

5

6

7

8

0 4 8 12 16 20 24 28

Number of Processors

Sp

ee

du

p

TATAS

TATAS_EXP

MCS

CLH

RH


Single-Processor ResultsTraditional microbenchmark, null CS

TATAS 97 ns

TATAS_EXP 97 ns

MCS 202 ns

CLH 137 ns

RH 121 ns

1: for (i = 0; i < iterations; i++) { 2: ACQUIRE(lock); 3: RELEASE(lock); 4: }


Performance ResultsTraditional microbenchmark, single-node E6000

Bind all threads to only one of the E6000 nodes

0,00

0,05

0,10

0,15

0,20

0,25

0 2 4 6 8 10 12 14 16Processors

Tim

e/P

roce

ssor

s [s

econ

ds]

TATAS

TATAS_EXP

MCS

CLH

RH

As expected:

RH lock TATAS_EXP


First-come first-served not desirable for NUCAs The RH lock exploits NUCAs by

creating locality through controlled unfairness (stable lock) reducing traffic compared with the test&set locks

The only lock that performs better under contention A critical section (CS) guarded by the RH lock take

less than half the time to execute with the same CS guarded by any other lock

Raytrace on 30 CPUs: 1.83 – 5.70 “better” Works best for NUCA with a few large “nodes”

Conclusions


http://www.it.uu.se/research/group/uart

UART’s Home Page

RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala...

Documents

Transcript of RH Locks Uppsala University Information Technology Department of Computer Systems Uppsala...