How shit works: the CPU
-
Upload
tomer-gabel -
Category
Software
-
view
186 -
download
0
Transcript of How shit works: the CPU
How shit works:
the CPUTomer Gabel
BuildStuff 2016 Lithuania
Image: Telecarlos (CC BY-SA 3.0)
Full Disclosure
Bullshit ahead!
• I’m not an expert
• Explanations may be:
– Simplified
– Inaccurate
– Wrong :-)
• We’ll barely scratch the
surface
Image: Public Domain
A CONUNDRUM?
Are you ready for…
Image: Louis Reed (CC BY-SA 4.0)
Setting the Stage// Generate a bunch of bytes
byte[] data = new byte[32768];
new Random().nextBytes(data);
Arrays.sort(data);
// Sum positive elements
long sum = 0;
for (int i = 0; i < data.length; i++)
if (data[i] >= 0)
sum += data[i];
1. Which is faster?
2. By how much?
3. And crucially…
why?!
# Run complete. Total time: 00:00:32
Benchmark Mode Cnt Score Error Units
Baseline.sum avgt 6 115.666 ± 3.137 us/op
Presorted.sum avgt 6 13.741 ± 0.524 us/op
Surprise, Terror and Ruthless Efficiency
# Run complete. Total time: 00:00:32
Benchmark Mode Cnt Error Units
Baseline.sum avgt 6 ± 3.137 us/op
Presorted.sum avgt 6 ± 0.524 us/op
* Ignoring setup cost
CPUS ARE COMPLEX BEASTS.
Image: Pauli Rautakorpi (CC BY 3.0)
It Is Known
• Your high-level code…
long sum = 0;
for (i = 0; i < length; i++)
if (data[i] >= 0)
sum += data[i];
• Gets compiled down to…
movsx eax,BYTE PTR [rax+rdx*1+0x10]
cmp eax,0x0
movabs rdx,0x11f3a9f60
movabs rcx,0x128
jl 0x000000010679e077
movabs rcx,0x138
mov r8,QWORD PTR [rdx+rcx*1]
lea r8,[r8+0x1]
mov QWORD PTR [rdx+rcx*1],r8
jl 0x000000010679e092
movsxd rax,eax
add rax,rbx
mov rbx,rax
inc edi
It Is Less Known
• What happens then?
• The instruction goes through phases…
Fetch Decode ExecuteMemory Access
Write-back
InstructionStream
CPU Architecture 101
Image: Appaloosa (CC BY-SA 3.0)
CPU Architecture 101
• What does a CPU do?
– Reads the program
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
– Talks to memory
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
– Talks to memory
– Performs I/O
CPU Architecture 101
• What does a CPU do?
– Reads the program
– Figures it out
– Executes it
– Talks to memory
– Performs I/O
• Immense complexity!
Execution Units• Arithmetic-Logic Unit (ALU)
– Boolean algebra
– Arithmetic
– Memory accesses
– Flow control
• Floating Point Unit (FPU)
• Memory Management Unit (MMU)– Memory mapping
– Paging
– Access control
Images: ALU by Dirk Oppelt (CC BY-SA 3.0), FPU by Konstantin Lanzet (CC BY-SA 3.0), MMU from unknown source
DESIGN CONSIDERATIONS
Image: William M. Plate Jr. (Public Domain)
Fetch Decode ExecuteMemory Access
Write-back
Fetch Decode ExecuteMemory Access
Write-back
Fetch Decode ExecuteMemory Access
Write-back
I1
I0
I2
PipeliningSequential Execution
Latency = 5 cyclesThroughput= 0.2 ops / cycle
Fetch Decode ExecuteMemory Access
Write-back
I1
I0
I2
Fetch Decode ExecuteMemory Access
Fetch Decode Execute
PipeliningSequential Execution Pipelined Execution
Latency = 5 cyclesThroughput= 0.2 ops / cycle
Latency = 5 cyclesThroughput= 1 ops / cycle
Fetch Decode ExecuteMemory Access
Write-back
Fetch Decode ExecuteMemory Access
Write-back
Fetch Decode ExecuteMemory Access
Write-back
I1
I0
I2
Pipelining
• A pipeline can stall
• This happens with:
– Branches
if (i < 0) i++ else i--;
F D E M WMemory Load
F D E MTest
F D EConditional Jump
? ????
F D E M WIncrementmemory address
F D E M
F D Stall
F D
Load from memory
Add +1
Store in memory
Pipelining
• A pipeline can stall
• This happens with:
– Branches
– Dependent Instructions
• A.K.A pipeline bubbling
i++; x = i + 1;
Stall
PRACTICALRAMIFICATIONS
Image: Hangsna (CC BY-SA 3.0)
1. Memory is Slow
• RAM access is ~60ns
• Random access on a
4GHz, 64-bit CPU:
– 250 cycles / memory access
– 130MB / second bandwidth
• Surely we can do better!
Image: Noah Wieder (Public Domain)
Source: 7-cpu.com
Enter: CPU Cache
Level Size Latency
L1 32KB + 32KB 1ns
L2 256KB 3ns
L3 4MB 11ns
Main Memory 62ns
Intel i7-6700 “Skylake” at 4 GHz
Image: Ferry24.Milan (CC BY-SA 3.0)
Source: 7-cpu.com
Enter: CPU Cache
• A unit of work is called cache line
– 64 bytes on x86
– LRU eviction policy
• Why is sequential access fast?
– Cache prefetching
In Real Life
• Let’s rotate an image!
for (y = 0; y < height; y++)
for (x = 0; x < width; x++) {
int from = y * width + x;
int to = x * height + y;
target[to] = source[from];
}
Image: EgoAltere (CC0 Public Domain)
In Real Life
• This is not efficient
• Reads are sequential
0 1 2 3 ... 9
0
1
2
3
…
9
In Real Life
• This is not efficient
• Reads are sequential
0 1 2 3 ... 9
0 0 1 2 3 … 9
1
2
3
…
9
In Real Life
• This is not efficient
• Reads are sequential
• Writes aren’t, though
• Different strides
– Worst case wins :-(
0 1 2 3 ... 9
0 0 1 2 3 … 9
1 10
2 20
3 30
… …
9 90
Cache-Friendly Algorithms
• Use blocking or tiling
for (y = 0; y < height; y += blockHeight)
for (x = 0; x < width; x += blockWidth)
for (by = 0; by < blockHeight; by++)
for (bx = 0; bx < blockWidth; bx++) {
int from = (y + by) * width + (x + bx);
int to = (x + bx) * height + (y + by);
target[to] = source[from];
}
Cache-Friendly Algorithms
• The results?
Benchmark Mode Cnt Score Error Units
CachingShowcase.transposeNaive avgt 10 43.851 ± 6.000 ms/op
CachingShowcase.transposeTiled8x8 avgt 10 20.641 ± 1.646 ms/op
CachingShowcase.transposeTiled16x16 avgt 10 18.515 ± 1.833 ms/op
CachingShowcase.transposeTiled48x48 avgt 10 21.941 ± 1.954 ms/op
• The results?
Benchmark Mode Cnt Error Units
CachingShowcase.transpose avgt 10 ± 6.000 ms/op
CachingShowcase.transpose avgt 10 ± 1.646 ms/op
CachingShowcase.transpose avgt 10 ± 1.833 ms/op
CachingShowcase.transpose avgt 10 ± 1.954 ms/op
x2.37 speedup!
2. Those Pesky Branches
• Do I go left or right?
• Need input!
• … but can’t wait for it
• Maybe...– Take a guess?
– Based on historic trends?
• Sounds speculative
Image: Michael Dolan (CC BY 2.0)
Those Pesky Branches
• Enter: Branch Prediction
• Concurrently:
– Speculate branch
– Evaluate condition
• It’s now a tradeoff
– Commit is fast
– Rollback is slow
Image: Alejandro C. (CC BY-NC 2.0)
// Generate a bunch of bytes
byte[] data = new byte[32768];
new Random().nextBytes(data);
Arrays.sort(data);
// Sum positive elements
long sum = 0;
for (int i = 0; i < data.length; i++)
if (data[i] >= 0)
sum += data[i];
Back to Our Conundrum
• Can you guess?
– 3…
– 2...
– 1...
• Here it is!
// Generate a bunch of bytes
byte[] data = new byte[32768];
new Random().nextBytes(data);
Arrays.sort(data);
// Sum positive elements
long sum = 0;
for (int i = 0; i < data.length; i++)
if (data[i] >= 0)
sum += data[i];
Catharsis
54 10 -4 -2 15 41-
3713 0 -9 14 25
-61
40
Original data array:
Catharsis
-61
-37
-9 -4 -2 0 10 13 14 15 25 40 41 54
After sorting:
0
data[i] >= 0Always false!
data[i] >= 0Always true!
QUESTIONS?Thank you for listening
@tomerg
http://engineering.wix.com
Sources and Examples:
https://goo.gl/f7NfGT
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0
International License.
Further Reading
• Jason Robert Carey Patterson –Modern Microprocessors, a 90-Minute Guide
• Igor Ostrovsky - Gallery of Processor Cache Effects
• Piyush Kumar –Cache Oblivious Algorithms