Memory model
-
Upload
mingdongliao -
Category
Software
-
view
87 -
download
1
Transcript of Memory model
![Page 1: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/1.jpg)
Memory Model
Mingdong Liao
![Page 2: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/2.jpg)
Overview
• Simple definition of memory model.• Optimizations.• HW: SC & TSO & RC: Strong and Weak.• SW: Ordering & C++11 memory model.• Further reading.
![Page 3: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/3.jpg)
Lock-free programming
Joggling razor blades.--Herb Sutter
Just don’t do it, use lock!
![Page 4: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/4.jpg)
Memory model(Consistency Model)
• “the memory model specifies the allowed behavior of multithreaded programs executing with shared memory.”[1]
• “consistency(memory model) provide rules about loads and stores and how they act upon memory.”[1]
A contract between software and hardware.
![Page 5: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/5.jpg)
Does the computer execute the program you wrote.
NO!Source code
execution
compiler
processor
caches
![Page 6: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/6.jpg)
Real-world example
g_a == 0 g_b = 42
g_a == 24 g_b = 0
g_a == 0 g_b == 0 ??
g_a == 24 g_b = 42
![Page 7: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/7.jpg)
How dare they change my code!
• The program you wrote is not what you want.• Transformation to make better performance.• As long as long they have the same effects.
![Page 8: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/8.jpg)
Optimizations
Z = 3Y = 2X = 1// use X , Y, Z
X = 1Y = 2X = 3// use X and Y
Y = 2X = 3// use X and Y
Optimizations are ubiquitous: compiler, processor will do whatever they see fit to optimize your code to improve performance.
for(i = 0; i < cols; ++i) for(j = 0; j < rows; ++j) a[j*rows + i] += 42;
for(j = 0; j < rows; ++j) for(i = 0; i < cols; ++i) a[j*rows + i] += 42;
X = 1Y = 2Z = 3// use X ,Y, Z
![Page 9: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/9.jpg)
Memory model from HW’s perspective
Shared memory support for multicore computer system is the source of all these difficulties.
![Page 10: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/10.jpg)
Memory architecture• The effect of memory operation.
Core Memory
Core1
MemoryCore2
Core3
Accesses to memory are serialized.
![Page 11: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/11.jpg)
Cache(and store buffer)
Core 1 Core 2
Memory
cache core 1S1: store data = d1S2: store flag = d2
core 2L1: load r1 = flagB1: if(r1 != d2) goto L1L2: load r2 = data
Key point: Writes are not automatically visible. Reads/writes are not necessarily performed in order.
2 issues arise:a. coherence. (invisible to software)b. consistency. How to order stores and loads to the memory?
Store buffer
cache
Store buffer
![Page 12: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/12.jpg)
Program order & memory order
• Program order: the order of execution in the program.
what programmer wants.
• Memory order: the order of the corresponding operation with respect to memory.
the observed order.
![Page 13: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/13.jpg)
Sequential consistency
Program order is the same as memory order for every single thread.
If L(a) <p L(b) L(a) <m L(b)If L(a) <p S(b) L(a) <m S(b)If S(a) <p S(b) S(a) <m S(b)If S(a) <p L(b) S(a) <m L(b)
Every load gets its value from the last store before it in memory order.
core1 core2
store
store
load
store
load
simple & easy to program with.performance optimizations are constrained.
![Page 14: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/14.jpg)
Total store order(TS)
If L(a) <p L(b) L(a) <m L(b)If L(a) <p S(b) L(a) <m S(b)If S(a) <p S(b) S(a) <m S(b)If S(a) <p L(b) S(a) <m L(b)
Every load gets its value from the last store before it in memory order or in
program order.
Need fence to accomplish SC.
Also known as “processor consistency”, used in x86/64, SPARC, etc.
core1 core2
store
store
load
store
load
![Page 15: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/15.jpg)
Memory fence• independent memory operations are effectively
performed in random order.• Need a way to instruct compiler and processor to
restrict the order.
Memory fence, a per cpu based intervention.
• Fences are not guaranteed to have any effect on other cpu.
• Fences do not guarantee what order other cpu will see.
![Page 16: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/16.jpg)
Release consistency• Provide 2 types of operation(fence). a) acquire operation. b) release operation.
Acquire operation
Release operation
Key observations: Acquire operation indicates the start of an critical section.Release operation indicates the end of an critical section.
Memory operations are not allowed to move up across
Memory operations are not allowed to move down across
![Page 17: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/17.jpg)
Memory model from SW’s perspective
The other part of the contract for SW to obey
Software memory model
X86/64 PowerPC ARM
![Page 18: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/18.jpg)
Ordering
Down to the earth: it is all about side effect of the execution of your program with respect to memory interaction.a) Memory operations in program order are not the same as memory order.b) Use of fence to prevent the potential ordering.
![Page 19: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/19.jpg)
How does ordering matter
1: load(g_y)2: load(g_x) 3: store(g_x)4: store(g_y)
Non-deterministic reordering makes program nearly impossible to reason about.
![Page 20: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/20.jpg)
How does ordering matter?
• One more try, Peterson’s algorithm on x86/64.
int g_victim;bool g_flag[2];
void lock1(){ g_flag[o] = true; g_victim = 0; while (g_flag[1] && g_victim == 0); // lock acquired. }
void unlock1(){ g_flag[0] = false;}
void lock2(){ g_flag[1] = true; g_victim = 1; while (g_flag[0] && g_victim == 1); // lock acquired. }
void unlock2(){ g_flag[1] = false;}
Thread 0Store(g_flag[0])Store(g_victim)Load(g_flag[1])Load(g_victim)
Thread 1Store(g_flag[1])Store(g_victim)Load(g_flag[0])Load(g_victim)
![Page 21: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/21.jpg)
Is reordering that bad?
Yes. No.
It depends.
As long as we don’t see the reordering, whatever it is!
Hardware loves to do reordering in order to optimize performance.
Software, however, need SC to ensure correct code.
![Page 22: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/22.jpg)
SC-DRF
• Fully sequential consistency, ideal world. execute the code you wrote. what most programmers expect.
• SC-DRF: sequential consistency for data race free, the reality.
Compromise between software and hardware!
As long as you don’t write data race code, HW guarantees you the illusion of fully sequential consistency.
![Page 23: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/23.jpg)
Race condition• A memory location is simultaneously accessed
by two or more threads, and at least one thread is a writer.
• Key point: transaction. 1) atomic: no torn-read or torn-write. 2) visibility: propagate side effect from thread to thread.
![Page 24: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/24.jpg)
Critical section
• Race condition occurs only when we have to manipulate shared variables.
• Create a critical region to serialize the accesses.
a way to implement transaction.
![Page 25: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/25.jpg)
Critical section
Good fence makes good neighbor
Execution of shared variables
Reordering within critical section?
As long as they don’t move out of the section.
Acquire fence
Release fence
Full fence will work, but acquire and release operation are better.
![Page 26: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/26.jpg)
c++11 atomic• Operations on atomic type are performed atomically, AKA, synchronization operations.• User can specify the memory ordering for every load & store.
template <class T> struct atomic {bool is_lock_free() const noexcept;void store(T, memory_order = memory_order_seq_cst) noexcept;T load(memory_order = memory_order_seq_cst) const noexcept;T exchange(T, memory_order = memory_order_seq_cst) noexcept;bool compare_exchange_weak(T&, T, memory_order, memory_order) noexcept;bool compare_exchange_strong(T&, T, memory_order, memory_order) noexcept;bool compare_exchange_weak(T&, T, memory_order = memory_order_seq_cst) noexcept;bool compare_exchange_strong(T&, T, memory_order = memory_order_seq_cst) noexcept;
};
Synchronization operations specify how assignments in one thread visible to another. [c++ standard: 1.10.5]
![Page 27: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/27.jpg)
C++11 memory ordernamespace std {
typedef enum memory_order { memory_order_relaxed, // no ordering constraint.memory_order_acquire, // load operation using this order is an acquire operation.memory_order_consume, // a weaker version of acquire semantic.memory_order_release, // store operation using this order is an release operation.memory_order_acq_rel, // both, for RMW operation: eg, exchange().memory_order_seq_cst // sequential consistency, like memory_order_acq_rel,
// plus a single total order on all memory_order_acq_rel operation.} memory_order;
}
Note: applied only to read and write performed to the same memory location.
![Page 28: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/28.jpg)
Acquire/release and Consume/release
atomic<int> guard(0);int pay_load = 0;
// thread 0pay_load = 1;guard.store(1, memory_order_release);
// thread 1int pay;int g = guard.load(memory_order_acquire);If (g) pay = pay_load;
atomic<int*> guard(0);int pay_load = 0;
// thread 0pay_load = 1;guard.store(&pay_load, memory_order_release);
// thread 1int pay;Int* g = guard.load(memory_order_consume);If (g) pay = *g;
g mush carry a dependency to pay = *g data dependency
On most weak-order architectures, memory ordering between data dependent instructions is preserved, in such case explicit memory fence is not necessary.[7]
![Page 29: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/29.jpg)
memory_order_seq_cst
• Order memory operation the same way as release and acquire.
• Establish a single total order on all memory_order_seq_cst operations.
Suppose x,y are atomic variables and are initialized to 0.[6]
Thread 1x = 1
Thread 2y = 1
Thread 3if (y = 1 && x == 0)cout << “y first”;
Thread 4if (y = 0 && x == 1)cout << “x first”;
Must not allow to print both messages.
![Page 30: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/30.jpg)
C++11 memory fence• It is different from what you think comparing to a
traditional fence.• More like a way to do synchronization.
extern "C" void atomic_thread_fence(memory_order order) noexcept;
// other memory operation preceding the fence.std::atomic_thread_fence(std::memory_order_release); flag.store(1, std::memory_order_relaxed);
A release fence prevents all preceding memory operations from reordered past all subsequent writes.
flag.load(1, std::memory_order_relaxed);std::atomic_thread_fence(std::memory_order_acquire);// other memory operations.
An acquire fence prevents all subsequent memory operations from reordered past the all preceding read.
data.store(3, std::memory_order_relaxed);std::atomic_thread_fence(std::memory_order_release); flag.store(1, std::memory_order_relaxed);flag2.store(2, std::memory_order_relaxed);
Above code is NOT equivalent to the following:
data.store(3, std::memory_order_relaxed);flag.store(1, std::memory_order_release);flag2.store(2, std::memory_order_relaxed);
flag2.store() is allowed reorder before data.store().
![Page 31: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/31.jpg)
Quiz
State what order is needed to prevent reordering?
Hint:a. Need an acquire before load of g_y in foo1().b. Need an acquire before load of g_x in foo2().Can we accomplish that? acquire/release is pairwise operations.
![Page 32: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/32.jpg)
Quiz: Peterson’s algo again.
atomic<int> g_victim;atomic<bool> g_flag[2];
void lock1(){ g_flag[o].store(true, ?); g_victim.store(0, ?); while (g_flag[1].load(?) && g_victim.load(?) == 0); // lock acquired. }
void unlock1(){ g_flag[0].store(false, ?);}
Thread 0Store(g_flag[0])Store(g_victim)Load(g_flag[1])Load(g_victim)
Thread 1Store(g_flag[1])Store(g_victim)Load(g_flag[0])Load(g_victim)
atomic<int> g_victim;atomic<bool> g_flag[2];
void lock1(){ g_flag[o].store(true, memory_order_relaxed); g_victim.exchange(0, memory_order_acq_rel);
while (g_flag[1].load(memory_order_acquire) && g_victim.load(memory_order_relaxed) == 0); // lock acquired. }
void unlock1(){ g_flag[0].store(false, memory_order_release);}
Atomic read-modify-write operations shall always read the last value (in the modification order) writtenbefore the write associated with the read-modify-write operation.[standard §29.3.12]
![Page 33: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/33.jpg)
A few terms: synchronize with
• An operation A synchronizes-with an operation B if: 1) A is a store to some atomic variable m, with an ordering ofstd::memory_order_release, or std::memory_order_seq_cst. 2) B is a load from the same variable m, with an ordering of std::memory_order_acquire or std::memory_order_seq_cst. 3) and B reads the value stored by A.
Thread 1:
Data = 42Flag = 1
Thread 2:
R1 = FlagIf (R1== 1) Use data
![Page 34: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/34.jpg)
A few terms: dependency-ordered before
An operation A dependency-ordered before an operation B if: 1) A is a store to some atomic variable m, with an ordering ofstd::memory_order_release, or std::memory_order_seq_cst.
2) B is a load from the same variable m, with an ordering of std::memory_order_consume.
3) and B reads the value stored by the “release sequence headed by A.
Thread 1:
Data = 42Flag = &Data
Thread 2:
R1 = FlagIf (R1) Use R1
![Page 35: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/35.jpg)
A few terms: happen before
Sequence before: the order of evaluations within a single thread
Or synchronize with. Or dependency-ordered before. Or concatenations of the above 3 relationships
with 2 exceptions.[standard 1.10.11]
happen-before indicates visibility.
![Page 36: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/36.jpg)
volatile
• A compiler aware semantic. compiler guarantees that no reordering, no optimization enforced for this variable. other thread may not see this guarantee. has nothing to do with inter-thread synchronization.• Not an atomic operation.
![Page 37: Memory model](https://reader036.fdocuments.net/reader036/viewer/2022081513/55b6cbdbbb61ebfd468b471e/html5/thumbnails/37.jpg)
Further reading
• [1]https://class.stanford.edu/c4x/Engineering/CS316/asset/A_Primer_on_Memory_Consistency_and_Coherence.pdf
• [2]http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=F2BAAAED623D54B73C5FF41DF14D5864?doi=10.1.1.17.8112&rep=rep1&type=pdf
• [3]http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2075.pdf• [4]http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n1942.html• [5]http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2664.htm• [5]http://open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2427.html• [6]http://
channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2
• [7]https://www.kernel.org/doc/Documentation/memory-barriers.txt• [8]www.preshing.com