Computer Laboratory Practical non-blocking data structures Tim Harris [email protected]...

Computer Laboratory

Practical non-blocking data structures

Tim [email protected]

Computer Laboratory

Computer Laboratory

Overview

Introduction Lock-free data structures Correctness requirements

Linked lists using CAS

Multi-word CAS

Conclusions

Computer Laboratory

Introduction

class Counter { int next = 0;

int getNumber () { int t; t = next; next = t + 1; return t; }}

What can go wrong here?

next = 0

Thread1:getNumber()

t = 0

Thread2:getNumber()

t = 0

result=0

next = 1

result=0

Computer Laboratory

Introduction (2)


synchronized int getNumber () { int t; t = next; next = t + 1; return t; }}

next = 0

What about now?Thread1:

getNumber()

t = 0

Thread2:getNumber()

result=0

Lock released

Lock acquired

result=1next = 1next = 2

Computer Laboratory

Introduction (3)


synchronized int getNumber () { int t; t = next; next = t + 1; return t; }}

Now the problem is livenessThread1:

getNumber()Thread2:

getNumber()

Priority inversion: 1 is low priority, 2 is high priority, but some other thread 3 (of medium priority) prevents 1 making any progress

Sharing: suppose that these operations may be invoked both in ordinary code and in interrupt handlers…

Failure: what if thread 1 fails while holding the lock? The lock’s still held and the state may be inconsistent

Computer Laboratory

Introduction (4)


int getNumber () { int t; do { t = next; } while (CAS (&next, t, t + 1) != t); return t; }}

In this case a non-blocking design is easy:

Atomic compare and swap

Location

Expected value

New value

Computer Laboratory

Correctness

Safety: we usually want a ‘linearizable’ implementation (Herlihy 1990) The data structure is only accessed through a well-defined

interface

Operations on the data structure appear to occur atomically at some point between invocation and response

Liveness: usually one of two requirements A ‘wait free’ implementation guarantees per-thread progress

A ‘non-blocking’ implementation guarantees only system-wide progress

Computer Laboratory

Overview

Introduction

Linked lists using CAS Basic list operations Alternative implementations Extensions

Multi-word CAS

Conclusions

Computer Laboratory

Lists using CAS

Insert 20:

H 10 30 T10 30

20

30 20

Computer Laboratory

Lists using CAS (2)

Insert 20:

H 10 30 T

20

30 20

25

30 25

Computer Laboratory

Lists using CAS (3)

Delete 10:

H 10 30 TH 10 30

10 30

Computer Laboratory

Lists using CAS (4)

Delete 10 & insert 20:

H 10 30 TH 10 30H 10 30H 10 30

10 30

20

30 20

Computer Laboratory

Logical vs physical deletion

Use a ‘spare’ bit to indicate logically deleted nodes:

H 10 30 TH 30

20

30 20

10 30

30 30X

10 30

Computer Laboratory

Implementation problems

Also need to consider visibility of updates

H 10 30 T

20

30 20

Write barrier

Computer Laboratory

Implementation problems (2)

…and the ordering of reads too

H 10 30 T

20

10 30

while (val < seek) {

p = p->next;

val = p->val;

}

val = ???

Computer Laboratory

Overview

Introduction


Multi-word CAS Design Results

Conclusions

Computer Laboratory

Multi-word CAS

Atomic read-modify-write to a set of locations

A useful building block: Many existing designs (queues, stacks, etc) use

CAS2 directly (e.g. Detlefs ’00)

More generally it can be used to move a structure between consistent states

We’d like it to be non-blocking, disjoint-access parallel, linearizable, and efficient with natural data

Computer Laboratory

Previous work

Lots of designs…

Anderson ’95 Yes Strong LL/SC p(w+l)+l l=log2p+log2a

I+R ’95 Yes CAS p + log2p

Herlihy ’93 No CAS 0

Yes CAS 0 or 2

Moir ’97 Yes Strong LL/SC log2p+log2nI+R ’95 Yes Strong LL/SC log2p

…none of them practicable

p processors, word size w, max n locations, max a addresses

Parallel Requires Reserved bits

Computer Laboratory

Design

H

10

20

T

0x100

0x108

0x110

0x118

0x104

0x10C

0x114

0x11C

status=UNDECIDED

locations=2

a1=0x10Co1=0x110n1=0x118

a2=0x114o2=0x118n2=<null>

Build descriptor Acquire locations Decide outcome Release locationsDCSS (&status, UNDECIDED,

0x10C, 0x110, &descriptor)DCSS (&status, UNDECIDED, 0x114, 0x118, &descriptor)CAS (&status, UNDECIDED, SUCCEEDED)

status=SUCCEEDED

CAS (0x10C, &descriptor, 0x118)CAS (0x114, &descriptor, null)

null

Computer Laboratory

Reading

H

10

20

T

0x100

0x108

0x110

0x118

0x104

0x10C

0x114

0x11C

status=UNDECIDED

locations=2

a1=0x10co1=0x110n1=0x118

a2=0x114o2=0x118n2=<null>

word_t read (addr_t a) { word_t val = *a; if (!isDescriptor(val)) return val else { SUCCEEDED => return new value; return old value; } }

Computer Laboratory

100x108

0x10C

ac=0x200oc=0

au=0x10Cou=0x110nu=0x200

Now we need DCSS from CAS: Easier than full CAS2: the locations used for ‘control’

and ‘update’ addresses must not overlap, only the ‘update’ address may be changed + we don’t need the result

DCSS(&status, UNDECIDED 0x10C, 0x110, &descriptor):

CAS (0x10C, 0x110, &DCSSDescriptor)

if (*0x200 == 0) CAS (0x10C, &DCSSDescriptor, 0x200)else CAS (0x10C, &DCSSDescriptor, 0x110);

Whither DCSS?

Computer Laboratory

Evaluation: method

Attempt to permute elements in a vector. Can control: Level of concurrency Length of the vector Number of elements being permuted Padding between elements Management of descriptors

2343 455460 676

Computer Laboratory

Evaluation: small systems

2 4 8 16 32 64

HF 1.6 2.8 6.0 17 71 280

HF-RC 1.5 2.6 5.6 16 68 270

IR 3.4 4.4 7.9 19 76 300

MCS 5.6 8.2 13 24 46 92

MCS-FG 1.4 2.8 6.0 14 42 130

gargantubrain.cl: 4-processor IA-64 (Itanium) Vector=1024, Width=2-64, No padding s per successful update

CASn width (words permuted per update)

Alg

ori

thm

use

d

Computer Laboratory

Evaluation: large systems

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8 10 12 14 16 20 24 28 32

ms

per

succ

ess

ful update

Number of processors

hodgkin.hpcf: 64-processor Origin-2000, MIPS R12000 Vector=1024, Width=2 One element per cache line

HF-RC

IR

MCS

Computer Laboratory

Overview

Introduction


Multi-word CAS

Conclusions

Computer Laboratory

Conclusions

Some general techniques The descriptor pointers serve two purposes:

They allow ‘helpers’ to find out the information needed to complete their work.

They indicate ownership of locations

Correctness seems clearest when thinking about the state of the shared memory, not the state of individual threads

Unlike previous work we need only a small and constant number of reserved bits (e.g. 2 to identify descriptor pointers if there’s no type information available at run time)

Computer Laboratory

Conclusions (2)

Our scheme is the first practical one: Can operate on general pointer-based data structures

Competitive with lock-based schemes

Can operate on highly parallel systems

Disjoint-access parallel, non-blocking, linearizable

http://www.cl.cam.ac.uk/~tlh20/papers/hfp-casn-submitted.pdf

Computer Laboratory Practical non-blocking data structures Tim Harris [email protected]...

Documents

Transcript of Computer Laboratory Practical non-blocking data structures Tim Harris [email protected]...