Concurrent programming: From theory to practice · Concurrent programming: From theory to practice...

67
Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David

Transcript of Concurrent programming: From theory to practice · Concurrent programming: From theory to practice...

Page 1: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Concurrent programming:From theory to practice

Concurrent Algorithms 2016Tudor David

Page 2: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

From theory to practice

Theoretical(design)

Practical(design)

Practical(implementation)

2

Page 3: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

From theory to practice

Theoretical(design)

Practical(design)

Practical(implementation)

l Impossibilitiesl Upper/Lower boundsl Techniquesl System modelsl Correctness proofs l Correctness

Design (pseudo-code)

3

Page 4: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

From theory to practice

Theoretical(design)

Practical(design)

Practical(implementation)

l Impossibilitiesl Upper/Lower boundsl Techniquesl System modelsl Correctness proofs l Correctness

Design (pseudo-code)

l System modelsl shared memoryl message passing

l Finite memoryl Practicality issues

l re-usable objectsl Performance

Design (pseudo-code,

prototype) 4

Page 5: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

From theory to practice

Theoretical(design)

Practical(design)

Practical(implementation)

l Impossibilitiesl Upper/Lower boundsl Techniquesl System modelsl Correctness proofs l Correctness

Design (pseudo-code)

l System modelsl shared memoryl message passing

l Finite memoryl Practicality issues

l re-usable objectsl Performance

Design (pseudo-code,

prototype)

l Hardwarel Which atomic opsl Memory consistencyl Cache coherencel Locality l Performancel Scalability

Implementation (code)

5

Page 6: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Example: linked list implementations

6

0

2

4

6

8

10

12

1 10 20 30 40

Thro

ughp

ut (M

op/s

)

# Cores

"bad" linked list "good" linked list

pessimistic

Page 7: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Outline

l CPU cachesl Cache coherencel Placement of datal Hardware synchronization instructionsl Correctness: Memory model & compilerl Performance: Programming techniques

7

Page 8: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Outline

l CPU cachesl Cache coherencel Placement of datal Hardware synchronization instructionsl Correctness: Memory model & compilerl Performance: Programming techniques

8

Page 9: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Why do we use caching?

l Core freq: 2GHz = 0.5 ns / instrl Core → Disk = ~ms

Core

Disk9

Page 10: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Why do we use caching?

l Core freq: 2GHz = 0.5 ns / instrl Core → Disk = ~msl Core → Memory = ~100ns

Core

Disk

Memory

10

Page 11: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Why do we use caching?

l Core freq: 2GHz = 0.5 ns / instrl Core → Disk = ~msl Core → Memory = ~100nsl Cache

- Large = slow- Medium = medium- Small = fast

Core

Disk

Memory

Cache

11

Page 12: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Why do we use caching?

l Core freq: 2GHz = 0.5 ns / instrl Core → Disk = ~msl Core → Memory = ~100nsl Cache

- Core → L3 = ~20ns- Core → L2 = ~7ns- Core → L1 = ~1ns

Core

Disk

Memory

L3

L2

L1

12

Page 13: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Typical server configurations

l Intel Xeon- 12 cores @ 2.4GHz- L1: 32KB- L2: 256KB- L3: 24MB- Memory: 256GB

l AMD Opteron- 8 cores @ 2.4GHz- L1: 64KB- L2: 512KB- L3: 12MB- Memory: 256GB

13

Page 14: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

ExperimentThroughput of accessing some memory,

depending on the memory size

14

Page 15: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Outline

l CPU cachesl Cache coherencel Placement of datal Hardware synchronization instructionsl Correctness: Memory model & compilerl Performance: Programming techniques

15

Page 16: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Until ~2004: Single-cores

l Core freq: 3+GHzl Core → Diskl Core → Memoryl Cache

- Core → L3- Core → L2- Core → L1

Core

Disk

Memory

L2

L1

16

Page 17: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

After ~2004: Multi-cores

l Core freq: ~2GHzl Core → Diskl Core → Memoryl Cache

- Core → shared L3- Core → L2- Core → L1

Core 0

L3

L2

Core 1

Disk

Memory

L2

L1L1

17

Page 18: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Multi-cores with private caches

Core 0

L3

L2

Core 1

Disk

Memory

L2

L1L1

Private=

multiple copies

18

Page 19: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Cache coherence for consistency

Core 0 has X and Core 1- wants to write on X- wants to read X- did Core 0 write or read X?

Core 0

L3

L2

Core 1

Disk

Memory

L2

L1L1X

19

Page 20: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Cache-coherence principles

l To perform a write- invalidate all readers, or- previous writer

l To perform a read- find the latest copy

Core 0

L3

L2

Core 1

Disk

Memory

L2

L1L1X

20

Page 21: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Cache coherence with MESI

l A state diagraml State (per cache line)

- Modified: the only dirty copy- Exclusive: the only clean copy- Shared: a clean copy- Invalid: useless data

21

Page 22: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

The ultimate goal for scalability

l Possible states- Modified: the only dirty copy- Exclusive: the only clean copy- Shared: a clean copy- Invalid: useless data

l Which state is our “favorite”?

22

Page 23: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

The ultimate goal for scalability

l Possible states- Modified: the only dirty copy- Exclusive: the only clean copy

-Shared: a clean copy- Invalid: useless data

= threads can keep the data close (L1 cache)= faster

23

Page 24: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

ExperimentThe effects of false sharing

24

Page 25: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Outline

l CPU cachesl Cache coherencel Placement of datal Hardware synchronization instructionsl Correctness: Memory model & compilerl Performance: Programming techniques

25

Page 26: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Uniformity vs. non-uniformity

l Typical desktop machine

l Typical server machine

= UniformC C

CachesMem

ory

Mem

ory

CachesMem

ory C C C C

Caches

C

Mem

ory

C C C= non-Uniform

(NUMA)26

Page 27: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Latency (ns) to access data

C C

Mem

ory

C

Mem

ory

C

L1

L2

L3

L1

L2L2

L1

L2

L1

L327

Page 28: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Latency (ns) to access data

C C

Mem

ory

C

Mem

ory

C

L1

L2

L3

L1

L2L2

L1

L2

L1

L3

1

28

Page 29: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Latency (ns) to access data

C C

Mem

ory

C

Mem

ory

C

L1

L2

L3

L1

L2L2

L1

L2

L1

L3

1

7

29

Page 30: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Latency (ns) to access data

C C

Mem

ory

C

Mem

ory

C

L1

L2

L3

L1

L2L2

L1

L2

L1

L3

1

7

20

30

Page 31: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Latency (ns) to access data

C C

Mem

ory

C

Mem

ory

C

L1

L2

L3

L1

L2L2

L1

L2

L1

L3

1

7

20

40

31

Page 32: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Latency (ns) to access data

C C

Mem

ory

C

Mem

ory

C

L1

L2

L3

L1

L2L2

L1

L2

L1

L3

1

7

20

40

80

32

Page 33: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Latency (ns) to access data

C C

Mem

ory

C

Mem

ory

C

L1

L2

L3

L1

L2L2

L1

L2

L1

L3

1

7

20

40

80

90

33

Page 34: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Latency (ns) to access data

C C

Mem

ory

C

Mem

ory

C

L1

L2

L3

L1

L2L2

L1

L2

L1

L3

1

7

20

40

80

90 130

34

Page 35: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Latency (ns) to access data

C C

Mem

ory

C

Mem

ory

C

L1

L2

L3

L1

L2L2

L1

L2

L1

L3Conclusion: we need to take care of locality

1

7

40

80

90 130

20

35

Page 36: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

ExperimentThe effects of locality

36

Page 37: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Outline

l CPU cachesl Cache coherencel Placement of datal Hardware synchronization instructionsl Correctness: Memory model & compilerl Performance: Programming techniques

37

Page 38: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

The Programmer’s Toolbox:Hardware synchronization instructions

• Depends on the processor• CAS generally provided J• TAS and atomic increment not always provided• x86 processors (Intel, AMD):

– Atomic exchange, increment, decrement provided– Memory barrier also available

• Intel as of 2014 provides transactional memory

38

Page 39: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Example: Atomic ops in GCC

type __sync_fetch_and_OP(type *ptr, type value);type __sync_OP_and_fetch(type *ptr, type value);// OP in {add,sub,or,and,xor,nand}

type __sync_val_compare_and_swap(type *ptr, typeoldval, type newval);

bool __sync_bool_compare_and_swap(type *ptr, typeoldval, type newval);

__sync_synchronize(); // memory barrier

39

Page 40: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Intel’s transactional synchronization extensions (TSX)

1. Hardware lock elision (HLE)• Instruction prefixes:

XACQUIREXRELEASE

Example (GCC):__hle_{acquire,release}_compare_exchange_n{1,2,4,8}

• Try to execute critical sections without acquiring/releasing the lock

• If conflict detected, abort and acquire the lock before re-doing the work

40

Page 41: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Intel’s transactional synchronization extensions (TSX)

2. Restricted Transactional Memory (RTM)_xbegin();_xabort();_xtest();_xend();

Limitations:• Not starvation free• Transactions can be aborted various reasons• Should have a non-transactional back-up• Limited transaction size

41

Page 42: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Intel’s transactional synchronization extensions (TSX)

2. Restricted Transactional Memory (RTM)

Example:if (_xbegin() == _XBEGIN_STARTED){

counter = counter + 1;_xend();

} else {__sync_fetch_and_add(&counter,1);

}

42

Page 43: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Outline

l CPU cachesl Cache coherencel Placement of datal Hardware synchronization instructionsl Correctness: Memory model & compilerl Performance: Programming techniques

43

Page 44: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Concurrent algorithm correctness

• Designing correct concurrent algorithms:1. Theoretical part 2. Practical part à involves implementation

The processor and the compiler optimize assuming no concurrency!

L44

Page 45: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

The memory consistency model

P1 P2A = 1; B = 1;

r1 = B; r2 = A;

//A, B shared variables, initially 0;//r1, r2 – local variables;

What values can r1 and r2 take?(assume x86 processor)

Answer: (0,1), (1,0), (1,1) and (0,0)

45

Page 46: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

The memory consistency model

à The order in which memory instructions appear to execute

What would the programmer like to see?Sequential consistency

All operations executed in some sequential order;Memory operations of each thread in program order;Intuitive, but limits performance;

46

Page 47: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

The memory consistency modelHow can the processor reorder instructions

to different memory addresses?

x86 (Intel, AMD): TSO variant• Reads not reordered w.r.t. reads• Writes not reordered w.r.t writes• Writes not reordered w.r.t. reads• Reads may be reordered w.r.t.

writes to different memory addresses

//A,B,C//globals…int x,y,z;x = A;y = B;B = 3;A = 2;y = A;C = 4;z = B;… 47

Page 48: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

The memory consistency model

• Single thread – reorderings transparent;

• Avoid reorderings: memory barriers• x86 – implicit in atomic ops;

• “volatile” in Java;

• Expensive - use only when really necessary;

• Different processors – different memory models• e.g., ARM – relaxed memory model (anything goes!);

• VMs (e.g. JVM, CLR) have their own memory models;

48

Page 49: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Beware of the compiler

• The compiler can:• reorder instructions• remove instructions• not write values to memory

lock(&the_lock);…unlock(&the_lock);

void lock(int * some_lock) {while (CAS(some_lock,0,1) != 0) {}asm volatile(“” ::: “memory”); //compiler barrier

}void unlock(int * some_lock) {

asm volatile(“” ::: “memory”); //compiler barrier*some_lock = 0;

}

volatile int the_lock=0;

C ”volatile” !=Java “volatile”

49

Page 50: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Outline

l CPU cachesl Cache coherencel Placement of datal Hardware synchronization instructionsl Correctness: Memory model & compilerl Performance: Programming techniques

50

Page 51: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Concurrent Programming Techniques

• What techniques can we use to speed up our concurrent application?

• Main idea: Minimize contention on cache lines

• Use case: Locks• acquire() = lock()

• release() = unlock()

51

Page 52: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

TAS – The simplest lockTest-and-Set Lock

typedef volatile uint lock_t;

void acquire(lock_t * some_lock) {while (TAS(some_lock) != 0) {}asm volatile(“” ::: “memory”);

}void release(lock_t * some_lock) {

asm volatile(“” ::: “memory”);*some_lock = 0;

}

52

Page 53: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

How good is this lock?

• A simple benchmark• Have 48 threads continuously acquire a lock,

update some shared data, and unlock• Measure how many operations we can do in a

second

Test-and-Set lock: 190K operations/second

53

Page 54: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

How can we improve things?Avoid cache-line ping-pong:Test-and-Test-and-Set Lock

void acquire(lock_t * some_lock) {while(1) {

while (*some_lock != 0) {}if (TAS(some_lock) == 0) {

return;}

}asm volatile(“” ::: “memory”);

}void release(lock_t * some_lock) {

asm volatile(“” ::: “memory”);*some_lock = 0;

} 54

Page 55: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Performance comparison

0

50

100

150

200

250

300

350

400

Test-and-Set Test-and-Test-and-Set

Ops

/sec

ond

(tho

usan

ds)

55

Page 56: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

But we can do even betterAvoid thundering herd:

Test-and-Test-and-Set with Back-offvoid acquire(lock_t * some_lock) {

uint backoff = INITIAL_BACKOFF;while(1) {

while (*some_lock != 0) {}if (TAS(some_lock) == 0) {

return;} else {

lock_sleep(backoff);backoff=min(backoff*2,MAXIMUM_BACKOFF);

}}asm volatile(“” ::: “memory”);

}void release(lock_t * some_lock) {

asm volatile(“” ::: “memory”);*some_lock = 0;

}

56

Page 57: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Performance comparison

0

100

200

300

400

500

600

700

800

Test-and-Set Test-and-Test-and-Set Test-and-Test-and-Set w. backoff

Ops

/sec

ond

(tho

usan

ds)

57

Page 58: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Are these locks fair?

0

5000

10000

15000

20000

25000

30000

35000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Num

ber

of p

roce

ssed

req

uest

s

Thread number

Processed requests per thread, Test-and-Set lock

58

Page 59: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

What if we want fairness?Use a FIFO mechanism:

Ticket Lockstypedef ticket_lock_t {

volatile uint head;volatile uint tail;

} ticket_lock_t;

void acquire(ticket_lock_t * a_lock) {uint my_ticket = fetch_and_inc(&(a_lock->tail));while (a_lock->head != my_ticket) {}asm volatile(“” ::: “memory”);

}void release(ticket_lock_t * a_lock) {

asm volatile(“” ::: “memory”);a_lock->head++;

} 59

Page 60: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

What if we want fairness?

0

5000

10000

15000

20000

25000

30000

35000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Num

ber

of p

roce

ssed

req

uest

s

Thread number

Processed requests per thread, Ticket Locks

60

Page 61: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Performance comparison

0100200300400500600700800

Ops

/sec

ond

(tho

usan

ds)

61

Page 62: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Can we back-off here as well?Yes, we can:

Proportional back-offvoid acquire(ticket_lock_t * a_lock) {

uint my_ticket = fetch_and_inc(&(a_lock->tail));uint distance, current_ticket;while (1) {

current_ticket = a_lock->head;if (current_ticket == my_ticket) break;distance = my_ticket – current_ticket;if (distance > 1)

lock_sleep(distance * BASE_SLEEP);}asm volatile(“” ::: “memory”);

}void release(ticket_lock_t * a_lock) {

asm volatile(“” ::: “memory”);a_lock->head++;

}62

Page 63: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Performance comparison

0

200

400

600

800

1000

1200

1400

1600

Ops

/sec

ond

(tho

usan

ds)

63

Page 64: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Still, everyone is spinning on the same variable….

Use a different address for each thread:Queue Locks

1

run2

spin

3

spin

4

arriving

4

spin

1

leaving

2

run

Use with care: 1. storage overheads2. complexity 64

Page 65: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Performance comparison

0200400600800

100012001400160018002000

Ops

/sec

ond

(tho

usan

ds)

65

Page 66: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

To summarize on locks

1. Reading before trying to write2. Pausing when it’s not our turn3. Ensuring fairness (does not always bring ++)4. Accessing disjoint addresses (cache lines)

More than 10x performance gain!

66

Page 67: Concurrent programming: From theory to practice · Concurrent programming: From theory to practice Concurrent Algorithms 2016 Tudor David. From theory to practice Theoretical (design)

Conclusion

• Concurrent algorithm design• Theoretical design• Practical design (may be just as important)• Implementation

• You need to know your hardware• For correctness• For performance

67