1
CONCURRENT PROGRAMMINGIntroduction to Locks and Lock-free data structures
Agenda
• Concurrency and Mutual Exclusion• Mutual Exclusion without hardware primitives• Mutual Exclusion using locks and critical sections• Lock-based Stack• Lock freedom• Reasoning about concurrency:
• Linearizability• Disadvantages of lock based data structures• A lock free stack using CAS• The ABA problem in the stack we just implemented Fix• Other problems with CAS• We need better hardware primitives• Transactional memory
3
www.themegallery.com
Mutual Exclusion
Mutual Exclusion : aims to avoid the simultaneous use of a common resource Eg: Global Variables, Databases etc.
Solutions: Software:
Peterson’s algorithm, Dekker’s algorithm, Bakery etc.
Hardware: Atomic test and set, compare and set, LL/SC
etc.
4
Using the hardware instruction Test and Set
Test and Set, here on, TS: TS on a boolean variable flag
#atomic // The two lines below will be executed one after the other without interruption
If(flag == false) flag = true; #end atomic
bool lock = false; // shared lock variable// Process iInit i;while(true) {
while (lock==false){ // entry protocol TS(lock)};
Critical secion # i;lock = false; // exit protocol//Remainder of code;}
5
Software solution: Peterson’s Algorithm
One of the purely software solutions to the mutual exclusion problem based on shared memory
Simple solution for two processes P0 and P1 that would like share the use of a single resource R
More rigorously, P1 shouldn’t have access to R when P0 is modifying/reading R and vice-versa.
R
P0 P1
6
Peterson’s Algorithm: Two processor version
Requires one global int variable (turn), and one bool variable (flag) per process.
The global variable is turn each processor has signal a variable flag
flag[0] = true is processor P0’s signal that it wants to enter the critical section
turn = 0 says that it is processor P0’s turn to enter the critical section
Can be extended to N processors
7
How to think about
Consider you are in a hall way that is only wide enough for one person to walk.
However, you a see a guy walking in the opposite direction as you are.
Once you approach him, you have two options: Be a gentleman and step to the side so that he
may walk first, and you will continue after he passes ( Peterson’s algorithm)
Beat him up and walk over him (Critical section violation)
8
The algorithm in code
// Process 1init;while(true) {
// entry protocolflag[1] = true;turn = 0;while (flag[0] && turn == 0) {};critical section #1;// exit protocolflag[1] = false;//remainder code
}
// Process 0init;while(true) {
// entry protocolflag[0] = true;turn = 1;while (flag[1] && turn == 1) {};critical section #0;// exit protocolflag[1] = false;//remainder code
}
// Shared variables bool flag[2] = {false, false}; int turn = 0;
9
www.themegallery.com
Requirements for Peterson’s
Reads and writes have to atomic No reordering of instructions or memory
In order processors sometime reorder memory accesses even if they don’t reorder memory accesses. In that case one needs to use memory barrier instructions
Visibility: Any change to a variable has to take immediate effect so that everybody knows about. Keyword volatile in Java
10
So why don’t people use Peterson’s?
Notice the while loop in the algorithm
If process 0 waits a lot of time to enter the critical section, it continually checks the flag and turn to see it can or not, while not doing any useful work
This is termed busy waiting, and locking mechanisms like Peterson’s have a major disadvantage in that regard.
Locks that employ continuous checking mechanism for a flag are called Spin-Locks.
Spin locks are good when the you know that the wait is not long enough.
while (flag[1] && turn == 1) {};
11
Properties of Peterson’s algorithm Mutual Exclusion Absense of Livelocks and Deadlocks:
A live lock is similar to a dead lock but the states of competing processes continually change their state but neither makes any progress.
Eventual Entry: is guaranteed even if scheduling policy is only weakly fair. A weakly fair scheduling policy guarantees that
if a process requests to enter its critical section (and does not withdraw the request), the process will eventually enter its critical section.
12
Comparison with Test and Set
Test and Set Peterson’s algorithm
Mutual Exclusion Yes Yes
Absence of Deadlocks
Yes Yes
Absence of unnecessary delay
Yes Yes
Eventual Entry Strongly fair Scheduling policy
Weakly fair Scheduling policy
Practical issues Special instructions Standard instructions
Easy to implement for any number of processors
> 2 processes becomes complex but doable
13
www.themegallery.com
Putting it all together: a lock based Stack
Stack: A list or an array based data structure that enforces last-in-first-out ordering of elements
Operations Void Push(T data) : pushes the variable data on to
the stack T Pop() : removes the last item that was pushed on
to a stack. Throws a stackEmptyException if the stack is empty
Int Size() : returns the size of the stack All operations are synchronized using one
common lock object.
14
Code : JavaClass Stack<T> { ArrayList<T> _container = new ArrayList<T>();RentrantLock _lock = new ReentrantLock();
public void push(T data){ _lock.lock(); _container.add(data); _lock.unlock();}
public int size(){ int retVal; _lock.Lock(); retVal = _container.size(); _lock.unlock();return retVal;
}
public T pop(){ _lock.lock(); if(_container.empty()) { _lock.unlock();
throw new Exception(“Stack Empty”);}T retVal _container.get(_container.size() – 1);_lock.unlock(); return retVal;
}
15
Problems with locks
Stack is simple enough. There is only one lock. The overhead isn’t that much. But there are data structures that could have multiple locks
Problems with locking Deadlock Priority inversion Convoying Kill tolerant availability Preemption tolerance Overall performance
16
Problems with locking 2
Priority inversion: Assume two threads:
T1 with very low priority T2 with very high priority
Both need to access a shared resource R but T2 holds the lock to R T2 takes longer to complete the operation
leaving the higher priority thread waiting, hence by extension T1 has achieved a lower priority
Possible solution Priority inheritance
17
Problems with Locking 3
Deadlock: Processes can’t proceed because each of them is waiting for the other release a needed resource.
Scenario: There are two locks A and B Process 1 needs A and B in that order to safely
execute Process 2 needs B and A in that order to safely
execute Process 1 acquires A and Process two acquires B Now Process 1 is waiting for Process 2 to release B
and Process 2 is waiting for process 1 to release A
18
Problems with Locking 4
Convoying, all the processes need a lock A to proceed however, a lower priority process acquires A it first. Then all the other processes slow down to the speed of the lower priority process.
Think of a freeway: You are driving an Aston Martin but you are
stuck behind a beat up old pick truck that is moving very slow and there is no way to overtake him.
19
Problems with Locking 5
Kill tolerance What happens when a process holding a lock is
killed? Everybody else waiting for the lock may not ended up
getting it and would wait forever.
‘Async-signal safety’ Signal handlers can’t use lock-based primitives Why?
Suppose a thread receives a signal while holding a user level lock in the memory allocator
Signal handler executes, calls malloc, wants the lock
20
Problems with Locking 6
Overall performance Arguable Efficient lock-based algorithms exist Constant struggle between simplicity and
efficiency Example. thread-safe linked list with lots of
nodes Lock the whole list for every operation? Reader/writer locks? Allow locking individual elements of the list?
21
A Possible solution
Lock-free Programming
22
Lock-free data structures
A data structure wherein there are no explicit locks used for achieving synchronization between multiple threads, and the progress of one thread doesn’t block/impede the progress of another.
Doesn’t imply starvation freedom ( Meaning one thread could potentially wait forever). But nobody starves in practice
Advantages: You don’t run into all the that you would problems
with using locks Disadvantages: To be discussed later
23
Lock-free Programming
Think in terms of Algorithms + Data Structure = Program
Thread safe access to shared data without the use of locks, mutexes etc.
Possible but not practical/feasible in the absence of hardware support
So what do we need? A compare and set primitive from the hardware guys,
abbreviated CAS (To be discussed in the next slide) Interesting TidBit:
Lots of music sharing and streaming applications use lock-free data structures PortAudio, PortMidi, and SuperColliderPortAudio
24
Lock-free Programming
Compare and Set primitive boolean cas( int * valueToChange, int * valueToSet To, int *
ValueToCompareTo) Sematics: The pseudocode below executes atomically without
interruption If( valueToChange == valueToCompareTo){
valueToChange = valueToSetTo; return true;}else { return false;}
This function is exposed in Java through the atomic namespace, in C++ depending on the OS and architecture, you find libraries
CAS is all you need for lock-free queues, stacks, linked-lists, and sets.
25
Trick to building lock-free data structures
Limit the scope of changes to a single atomic variable Stack : head Queue: head or tail depending on enque or
deque
26
A simple lock-free example
A lock free Stack Adopted from Geoff Langdale at CMU
Intended to illustrate the design of lock-free data structures and problems with lock-free synchronization
There is a primitive operation we need: CAS or Compare and Set Available on most modern machines
X86 assembly: xchg PowerPC assembly: LL(load linked), SC (Store
Conditional)
27
Lock-free Stack with Ints in C A stack based on a singly linked list. Not
particularly good design!
Now that we have the nodes let us proceed to meat of the stack
struct NodeEle { int data; Node *next;};
typedef NodeEle Node;
Node* head; // The head of the list
28
Lock-free Stack Push
void push(int t) {Node* node = malloc(sizeof(Node));node->data = t;do {
node->next = head;} while (!cas(&head, node, node->next));
}
Let us see how this works!
29
Push in Action
Currently Head points to the Node containing data 6
10 6
Head
30
Push in Action
Two threads T1 and T2 comes along wanting to push 7 and 8 respectively, by calling the push function
T1 push(7);T2 push(8);
10 6
Head
31
Push in Action
Two new node structs on the heap will be created on the heap in parallel after the execution of the code shown
T1Node* node = malloc(sizeof(Node));node->data = 7;
T2
Node* node = malloc(sizeof(Node));node->data = 8;
10 6
Head
32
Push in Action
The above code means set the newly created Nodes next to head, if the head is still points to 6 then change head pointer to point to the new Node
Both of them try to execute this portion of the code on their respective threads. But only one will succeed.
T1 T2
10 6
Head7 8
do { node->next = head;} while (!cas(&head, node, node->next));
33
Push in Action
Let us Assume T1 Succeeds, therefore T1 exits out of the while and consequently the push()
T2’s cas failed why? Hint: Look at the picture. T2 has no choice but to try again
T1 T2
10 6
Head7 8
do { node->next = head;} while (!cas(&head, node, node->next));
34
Push in Action
Assume T2 Succeeds this time because no one else trying to push
T1 T2
10 6
Head
7 8
35
Pop()
bool pop(int& t) { Node* current = head;
while(current) {if(cas(&head, current->next, current)) {
t = current->data; // problem?return true;
}current = head;}
return false;}
There is something wrong this code. It is very subtle. Can you figure it out? Most of the time this piece of code will work.
36
It is called the ABA problem
While a thread tries to modify A, what happens if A gets changed to B then back to A?
Malloc recycles addresses. It has to eventually.
Now Imagine this scenario. Curly braces contain addresses for each
node10 6
Head
{ 0x89} { 0x90}
37
www.themegallery.com
ABA problem illustration Step 1
Assume two threads T1 and T2.
T1 calls pop() to delete Node at 0x90 but before it has a change and CAS, there is a context switch and T1 goes to sleep.
10 6
Head
{ 0x89} { 0x90}
bool pop(int& t) { Node* current = del = head;
while(current) {if(cas(&head, current->next, current)) {
t = current->data; // problem?delete del;return true;
}current = head;}
return false;}
38
www.themegallery.com
ABA problem illustration Step 2
The following happens while T1 is asleep
10 6
Head
{ 0x89} { 0x90}
39
www.themegallery.com
ABA problem illustration Step 3
The following happens while T1 is asleep
T2 calls Pop(), Node at 0x90 is deleted
10 6
Head
{ 0x89} { 0x90}
40
www.themegallery.com
ABA problem illustration Step 4
The following happens while T1 is asleep
T2 calls Pop(), Node at 0x90 is deleted
T2 calls Pop(), Node at 0x89 is deleted
10 6
Head
{ 0x89} { 0x90}
41
www.themegallery.com
ABA problem illustration Step 5
The following happens while T1 is asleep
T2 calls Pop(), Node at 0x90 is deleted
T2 calls Pop(), Node at 0x89 is deleted
T2 calls push(11) but malloc has recycled the memory 0x90 while allocating space for the new Node
10 11
Head
{ 0x89} { 0x90}
42
www.themegallery.com
ABA problem illustration Step 6
The following happens while T1 is asleep
T2 calls Pop(), Node at 0x90 is deleted
T2 calls Pop(), Node at 0x89 is deleted
T2 calls push(11) but malloc has recycled the memory 0x90 while allocating space for the new Node
T1 now wakes up and the CAS operation succeeds
10 11
Head
{ 0x89} { 0x90}
Head is now pointing to illegal memory!!!!
Replace 10 and 6 with B and A
Now you know where the name (ABA) comes from
43
Solutions:
Double word compare and set. One 32 bit word for the address One 32 bit word for the update count which is
incremented every time a node is updated Compare and Set iff both of the above match Java provides AtomicStampedReference
Use the lower address bits of the pointer (if the memory is 4/8 byte Aligned) to keep a counter to update But the probability of a false positive is still greater than
doubleword compareandset because instead of 2^32 choices for the counter you have 2^2 or 2^3 choices for the counter
44
Disadvantages of lock-free data structures
Current hardware limits the amount of bits available in CAS operation to 32/64 bits.
Imagine the implementation of data structures like BST’s pose a problem When you need to balance a tree you need
update several nodes all at once. Way to get around it
Transactional memory based systems
45
Language Support
C++ 09 (the new C++) : atomic_compare_exchange() Current C++: pthreads library
GCC: type __sync_val_compare_and_swap (type *ptr, type oldval, type newval) More info: http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html
Java Package: java.util.concurrent.atomic.*
AtomicInteger, AtomicBoolean etc.: Atomic access to a single int boolean etc
AtomicStampedReference: Associates an int with a reference AtomicMarkableReference: Associates a boolean with a reference
Your own CAS: Write an inline assembly function that uses XCHG or LL/SC
depending on the hardware platform
46
Performance of Lock-based vs. Lock-free under moderate contention
47
Performance of Lock-based vs. Lock-free under very high contention (almost unrealistic)
48
Ensuring correctness of concurrent objects
To ensure two properties of concurrent objects (eg: FIFO queues) Safety: Object behavior is correct as per the
specification Behavior a FIFO queue:
If two enques x and y happens in parallel(assume queue is initially empty) then the next deque should only return either x or y not z
If enques y and z happened one after other in real time, then deque() should return y first and z second
Overall progress: Conditions under which at least one thread will progress
49
Ensuring correctness in concurrent implementations
Linearizability: Each method call(enq and deq in the case of queue)
should appear to take place instantaneously sometime between the start and the end of the method call Meaning no other thread can see the change to the data
structure in a step by step fashion In English: If the concurrent execution can be mapped
to a valid(meaning correct) sequential execution on the object, then we assume that it is correct.
Moreover, this can be used as an intuitive way to reason about concurrent objects. You already know it because you use it unknowingly Think of shared single lock FIFO queue
50
Linearizability: Intuitively
Consider the deq() method for a queue
Uses a single shared lock for mutual exclusion
public T deq() throws EmptyException { lock.lock(); try { if (tail == head) throw new EmptyException(); T x = items[head % items.length]; head++; return x; } finally { lock.unlock(); } }
All modifications of queue are done mutually exclusive.
Therefore essentially happens in sequence.
Art of Multiprocessor Programming by Maurice Herlihy
51
time
Linearizability: Intuitively for the single lock queue
q.deq(x)
q.enq(x)
enq deq
lock() unlock()
lock() unlock()Behavior is “Sequential”
enq
deq
Correct behavior
for q enq(x) precedes
deq(x)Linearization points
Art of Multiprocessor Programming by Maurice Herlihy
52
Linearizability: Intuitively
Each method of the object should “take effect” Instantaneously Between invocation and response of the
method call Object is correct if this “sequential”
behavior is correct Generalization: It can happen with or
without mutual exclusion(this is an implementation detail)
Any such concurrent object is Linearizable
Top Related