Computer Laboratory Practical non-blocking data structures Tim Harris [email protected]...
-
date post
15-Jan-2016 -
Category
Documents
-
view
221 -
download
0
Transcript of Computer Laboratory Practical non-blocking data structures Tim Harris [email protected]...
Computer Laboratory
Practical non-blocking data structures
Computer Laboratory
Computer Laboratory
Overview
Introduction Lock-free data structures Correctness requirements
Linked lists using CAS
Multi-word CAS
Conclusions
Computer Laboratory
Introduction
class Counter { int next = 0;
int getNumber () { int t; t = next; next = t + 1; return t; }}
What can go wrong here?
next = 0
Thread1:getNumber()
t = 0
Thread2:getNumber()
t = 0
result=0
next = 1
result=0
Computer Laboratory
Introduction (2)
class Counter { int next = 0;
synchronized int getNumber () { int t; t = next; next = t + 1; return t; }}
next = 0
What about now?Thread1:
getNumber()
t = 0
Thread2:getNumber()
result=0
Lock released
Lock acquired
result=1next = 1next = 2
Computer Laboratory
Introduction (3)
class Counter { int next = 0;
synchronized int getNumber () { int t; t = next; next = t + 1; return t; }}
Now the problem is livenessThread1:
getNumber()Thread2:
getNumber()
Priority inversion: 1 is low priority, 2 is high priority, but some other thread 3 (of medium priority) prevents 1 making any progress
Sharing: suppose that these operations may be invoked both in ordinary code and in interrupt handlers…
Failure: what if thread 1 fails while holding the lock? The lock’s still held and the state may be inconsistent
Computer Laboratory
Introduction (4)
class Counter { int next = 0;
int getNumber () { int t; do { t = next; } while (CAS (&next, t, t + 1) != t); return t; }}
In this case a non-blocking design is easy:
Atomic compare and swap
Location
Expected value
New value
Computer Laboratory
Correctness
Safety: we usually want a ‘linearizable’ implementation (Herlihy 1990) The data structure is only accessed through a well-defined
interface
Operations on the data structure appear to occur atomically at some point between invocation and response
Liveness: usually one of two requirements A ‘wait free’ implementation guarantees per-thread progress
A ‘non-blocking’ implementation guarantees only system-wide progress
Computer Laboratory
Overview
Introduction
Linked lists using CAS Basic list operations Alternative implementations Extensions
Multi-word CAS
Conclusions
Computer Laboratory
Lists using CAS
Insert 20:
H 10 30 T10 30
20
30 20
Computer Laboratory
Lists using CAS (2)
Insert 20:
H 10 30 T
20
30 20
25
30 25
Computer Laboratory
Lists using CAS (3)
Delete 10:
H 10 30 TH 10 30
10 30
Computer Laboratory
Lists using CAS (4)
Delete 10 & insert 20:
H 10 30 TH 10 30H 10 30H 10 30
10 30
20
30 20
Computer Laboratory
Logical vs physical deletion
Use a ‘spare’ bit to indicate logically deleted nodes:
H 10 30 TH 30
20
30 20
10 30
30 30X
10 30
Computer Laboratory
Implementation problems
Also need to consider visibility of updates
H 10 30 T
20
30 20
Write barrier
Computer Laboratory
Implementation problems (2)
…and the ordering of reads too
H 10 30 T
20
10 30
while (val < seek) {
p = p->next;
val = p->val;
}
val = ???
Computer Laboratory
Overview
Introduction
Linked lists using CAS
Multi-word CAS Design Results
Conclusions
Computer Laboratory
Multi-word CAS
Atomic read-modify-write to a set of locations
A useful building block: Many existing designs (queues, stacks, etc) use
CAS2 directly (e.g. Detlefs ’00)
More generally it can be used to move a structure between consistent states
We’d like it to be non-blocking, disjoint-access parallel, linearizable, and efficient with natural data
Computer Laboratory
Previous work
Lots of designs…
Anderson ’95 Yes Strong LL/SC p(w+l)+l l=log2p+log2a
I+R ’95 Yes CAS p + log2p
Herlihy ’93 No CAS 0
Yes CAS 0 or 2
Moir ’97 Yes Strong LL/SC log2p+log2nI+R ’95 Yes Strong LL/SC log2p
…none of them practicable
p processors, word size w, max n locations, max a addresses
Parallel Requires Reserved bits
Computer Laboratory
Design
H
10
20
T
0x100
0x108
0x110
0x118
0x104
0x10C
0x114
0x11C
status=UNDECIDED
locations=2
a1=0x10Co1=0x110n1=0x118
a2=0x114o2=0x118n2=<null>
Build descriptor Acquire locations Decide outcome Release locationsDCSS (&status, UNDECIDED,
0x10C, 0x110, &descriptor)DCSS (&status, UNDECIDED, 0x114, 0x118, &descriptor)CAS (&status, UNDECIDED, SUCCEEDED)
status=SUCCEEDED
CAS (0x10C, &descriptor, 0x118)CAS (0x114, &descriptor, null)
null
Computer Laboratory
Reading
H
10
20
T
0x100
0x108
0x110
0x118
0x104
0x10C
0x114
0x11C
status=UNDECIDED
locations=2
a1=0x10co1=0x110n1=0x118
a2=0x114o2=0x118n2=<null>
word_t read (addr_t a) { word_t val = *a; if (!isDescriptor(val)) return val else { SUCCEEDED => return new value; return old value; } }
Computer Laboratory
100x108
0x10C
ac=0x200oc=0
au=0x10Cou=0x110nu=0x200
Now we need DCSS from CAS: Easier than full CAS2: the locations used for ‘control’
and ‘update’ addresses must not overlap, only the ‘update’ address may be changed + we don’t need the result
DCSS(&status, UNDECIDED 0x10C, 0x110, &descriptor):
CAS (0x10C, 0x110, &DCSSDescriptor)
if (*0x200 == 0) CAS (0x10C, &DCSSDescriptor, 0x200)else CAS (0x10C, &DCSSDescriptor, 0x110);
Whither DCSS?
Computer Laboratory
Evaluation: method
Attempt to permute elements in a vector. Can control: Level of concurrency Length of the vector Number of elements being permuted Padding between elements Management of descriptors
2343 455460 676
Computer Laboratory
Evaluation: small systems
2 4 8 16 32 64
HF 1.6 2.8 6.0 17 71 280
HF-RC 1.5 2.6 5.6 16 68 270
IR 3.4 4.4 7.9 19 76 300
MCS 5.6 8.2 13 24 46 92
MCS-FG 1.4 2.8 6.0 14 42 130
gargantubrain.cl: 4-processor IA-64 (Itanium) Vector=1024, Width=2-64, No padding s per successful update
CASn width (words permuted per update)
Alg
ori
thm
use
d
Computer Laboratory
Evaluation: large systems
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8 10 12 14 16 20 24 28 32
ms
per
succ
ess
ful update
Number of processors
hodgkin.hpcf: 64-processor Origin-2000, MIPS R12000 Vector=1024, Width=2 One element per cache line
HF-RC
IR
MCS
Computer Laboratory
Overview
Introduction
Linked lists using CAS
Multi-word CAS
Conclusions
Computer Laboratory
Conclusions
Some general techniques The descriptor pointers serve two purposes:
They allow ‘helpers’ to find out the information needed to complete their work.
They indicate ownership of locations
Correctness seems clearest when thinking about the state of the shared memory, not the state of individual threads
Unlike previous work we need only a small and constant number of reserved bits (e.g. 2 to identify descriptor pointers if there’s no type information available at run time)
Computer Laboratory
Conclusions (2)
Our scheme is the first practical one: Can operate on general pointer-based data structures
Competitive with lock-based schemes
Can operate on highly parallel systems
Disjoint-access parallel, non-blocking, linearizable
http://www.cl.cam.ac.uk/~tlh20/papers/hfp-casn-submitted.pdf