How shit works: the CPU

38
How shit works: the CPU Tomer Gabel BuildStuff 2016 Lithuania Image : Telecarlos (CC BY-SA 3.0)

Transcript of How shit works: the CPU

Page 1: How shit works: the CPU

How shit works:

the CPUTomer Gabel

BuildStuff 2016 Lithuania

Image: Telecarlos (CC BY-SA 3.0)

Page 2: How shit works: the CPU

Full Disclosure

Bullshit ahead!

• I’m not an expert

• Explanations may be:

– Simplified

– Inaccurate

– Wrong :-)

• We’ll barely scratch the

surface

Image: Public Domain

Page 3: How shit works: the CPU

A CONUNDRUM?

Are you ready for…

Image: Louis Reed (CC BY-SA 4.0)

Page 4: How shit works: the CPU

Setting the Stage// Generate a bunch of bytes

byte[] data = new byte[32768];

new Random().nextBytes(data);

Arrays.sort(data);

// Sum positive elements

long sum = 0;

for (int i = 0; i < data.length; i++)

if (data[i] >= 0)

sum += data[i];

1. Which is faster?

2. By how much?

3. And crucially…

why?!

Page 5: How shit works: the CPU

# Run complete. Total time: 00:00:32

Benchmark Mode Cnt Score Error Units

Baseline.sum avgt 6 115.666 ± 3.137 us/op

Presorted.sum avgt 6 13.741 ± 0.524 us/op

Surprise, Terror and Ruthless Efficiency

# Run complete. Total time: 00:00:32

Benchmark Mode Cnt Error Units

Baseline.sum avgt 6 ± 3.137 us/op

Presorted.sum avgt 6 ± 0.524 us/op

* Ignoring setup cost

Page 6: How shit works: the CPU

CPUS ARE COMPLEX BEASTS.

Image: Pauli Rautakorpi (CC BY 3.0)

Page 7: How shit works: the CPU

It Is Known

• Your high-level code…

long sum = 0;

for (i = 0; i < length; i++)

if (data[i] >= 0)

sum += data[i];

• Gets compiled down to…

movsx eax,BYTE PTR [rax+rdx*1+0x10]

cmp eax,0x0

movabs rdx,0x11f3a9f60

movabs rcx,0x128

jl 0x000000010679e077

movabs rcx,0x138

mov r8,QWORD PTR [rdx+rcx*1]

lea r8,[r8+0x1]

mov QWORD PTR [rdx+rcx*1],r8

jl 0x000000010679e092

movsxd rax,eax

add rax,rbx

mov rbx,rax

inc edi

Page 8: How shit works: the CPU

It Is Less Known

• What happens then?

• The instruction goes through phases…

Fetch Decode ExecuteMemory Access

Write-back

InstructionStream

Page 9: How shit works: the CPU

CPU Architecture 101

Image: Appaloosa (CC BY-SA 3.0)

Page 10: How shit works: the CPU

CPU Architecture 101

• What does a CPU do?

– Reads the program

Page 11: How shit works: the CPU

CPU Architecture 101

• What does a CPU do?

– Reads the program

– Figures it out

Page 12: How shit works: the CPU

CPU Architecture 101

• What does a CPU do?

– Reads the program

– Figures it out

– Executes it

Page 13: How shit works: the CPU

CPU Architecture 101

• What does a CPU do?

– Reads the program

– Figures it out

– Executes it

– Talks to memory

Page 14: How shit works: the CPU

CPU Architecture 101

• What does a CPU do?

– Reads the program

– Figures it out

– Executes it

– Talks to memory

– Performs I/O

Page 15: How shit works: the CPU

CPU Architecture 101

• What does a CPU do?

– Reads the program

– Figures it out

– Executes it

– Talks to memory

– Performs I/O

• Immense complexity!

Page 16: How shit works: the CPU

Execution Units• Arithmetic-Logic Unit (ALU)

– Boolean algebra

– Arithmetic

– Memory accesses

– Flow control

• Floating Point Unit (FPU)

• Memory Management Unit (MMU)– Memory mapping

– Paging

– Access control

Images: ALU by Dirk Oppelt (CC BY-SA 3.0), FPU by Konstantin Lanzet (CC BY-SA 3.0), MMU from unknown source

Page 17: How shit works: the CPU

DESIGN CONSIDERATIONS

Image: William M. Plate Jr. (Public Domain)

Page 18: How shit works: the CPU

Fetch Decode ExecuteMemory Access

Write-back

Fetch Decode ExecuteMemory Access

Write-back

Fetch Decode ExecuteMemory Access

Write-back

I1

I0

I2

PipeliningSequential Execution

Latency = 5 cyclesThroughput= 0.2 ops / cycle

Page 19: How shit works: the CPU

Fetch Decode ExecuteMemory Access

Write-back

I1

I0

I2

Fetch Decode ExecuteMemory Access

Fetch Decode Execute

PipeliningSequential Execution Pipelined Execution

Latency = 5 cyclesThroughput= 0.2 ops / cycle

Latency = 5 cyclesThroughput= 1 ops / cycle

Fetch Decode ExecuteMemory Access

Write-back

Fetch Decode ExecuteMemory Access

Write-back

Fetch Decode ExecuteMemory Access

Write-back

I1

I0

I2

Page 20: How shit works: the CPU

Pipelining

• A pipeline can stall

• This happens with:

– Branches

if (i < 0) i++ else i--;

F D E M WMemory Load

F D E MTest

F D EConditional Jump

? ????

Page 21: How shit works: the CPU

F D E M WIncrementmemory address

F D E M

F D Stall

F D

Load from memory

Add +1

Store in memory

Pipelining

• A pipeline can stall

• This happens with:

– Branches

– Dependent Instructions

• A.K.A pipeline bubbling

i++; x = i + 1;

Stall

Page 23: How shit works: the CPU

1. Memory is Slow

• RAM access is ~60ns

• Random access on a

4GHz, 64-bit CPU:

– 250 cycles / memory access

– 130MB / second bandwidth

• Surely we can do better!

Image: Noah Wieder (Public Domain)

Source: 7-cpu.com

Page 24: How shit works: the CPU

Enter: CPU Cache

Level Size Latency

L1 32KB + 32KB 1ns

L2 256KB 3ns

L3 4MB 11ns

Main Memory 62ns

Intel i7-6700 “Skylake” at 4 GHz

Image: Ferry24.Milan (CC BY-SA 3.0)

Source: 7-cpu.com

Page 25: How shit works: the CPU

Enter: CPU Cache

• A unit of work is called cache line

– 64 bytes on x86

– LRU eviction policy

• Why is sequential access fast?

– Cache prefetching

Page 26: How shit works: the CPU

In Real Life

• Let’s rotate an image!

for (y = 0; y < height; y++)

for (x = 0; x < width; x++) {

int from = y * width + x;

int to = x * height + y;

target[to] = source[from];

}

Image: EgoAltere (CC0 Public Domain)

Page 27: How shit works: the CPU

In Real Life

• This is not efficient

• Reads are sequential

0 1 2 3 ... 9

0

1

2

3

9

Page 28: How shit works: the CPU

In Real Life

• This is not efficient

• Reads are sequential

0 1 2 3 ... 9

0 0 1 2 3 … 9

1

2

3

9

Page 29: How shit works: the CPU

In Real Life

• This is not efficient

• Reads are sequential

• Writes aren’t, though

• Different strides

– Worst case wins :-(

0 1 2 3 ... 9

0 0 1 2 3 … 9

1 10

2 20

3 30

… …

9 90

Page 30: How shit works: the CPU

Cache-Friendly Algorithms

• Use blocking or tiling

for (y = 0; y < height; y += blockHeight)

for (x = 0; x < width; x += blockWidth)

for (by = 0; by < blockHeight; by++)

for (bx = 0; bx < blockWidth; bx++) {

int from = (y + by) * width + (x + bx);

int to = (x + bx) * height + (y + by);

target[to] = source[from];

}

Page 31: How shit works: the CPU

Cache-Friendly Algorithms

• The results?

Benchmark Mode Cnt Score Error Units

CachingShowcase.transposeNaive avgt 10 43.851 ± 6.000 ms/op

CachingShowcase.transposeTiled8x8 avgt 10 20.641 ± 1.646 ms/op

CachingShowcase.transposeTiled16x16 avgt 10 18.515 ± 1.833 ms/op

CachingShowcase.transposeTiled48x48 avgt 10 21.941 ± 1.954 ms/op

• The results?

Benchmark Mode Cnt Error Units

CachingShowcase.transpose avgt 10 ± 6.000 ms/op

CachingShowcase.transpose avgt 10 ± 1.646 ms/op

CachingShowcase.transpose avgt 10 ± 1.833 ms/op

CachingShowcase.transpose avgt 10 ± 1.954 ms/op

x2.37 speedup!

Page 32: How shit works: the CPU

2. Those Pesky Branches

• Do I go left or right?

• Need input!

• … but can’t wait for it

• Maybe...– Take a guess?

– Based on historic trends?

• Sounds speculative

Image: Michael Dolan (CC BY 2.0)

Page 34: How shit works: the CPU

// Generate a bunch of bytes

byte[] data = new byte[32768];

new Random().nextBytes(data);

Arrays.sort(data);

// Sum positive elements

long sum = 0;

for (int i = 0; i < data.length; i++)

if (data[i] >= 0)

sum += data[i];

Back to Our Conundrum

• Can you guess?

– 3…

– 2...

– 1...

• Here it is!

// Generate a bunch of bytes

byte[] data = new byte[32768];

new Random().nextBytes(data);

Arrays.sort(data);

// Sum positive elements

long sum = 0;

for (int i = 0; i < data.length; i++)

if (data[i] >= 0)

sum += data[i];

Page 35: How shit works: the CPU

Catharsis

54 10 -4 -2 15 41-

3713 0 -9 14 25

-61

40

Original data array:

Page 36: How shit works: the CPU

Catharsis

-61

-37

-9 -4 -2 0 10 13 14 15 25 40 41 54

After sorting:

0

data[i] >= 0Always false!

data[i] >= 0Always true!

Page 37: How shit works: the CPU

QUESTIONS?Thank you for listening

[email protected]

@tomerg

http://engineering.wix.com

Sources and Examples:

https://goo.gl/f7NfGT

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

International License.

Page 38: How shit works: the CPU

Further Reading

• Jason Robert Carey Patterson –Modern Microprocessors, a 90-Minute Guide

• Igor Ostrovsky - Gallery of Processor Cache Effects

• Piyush Kumar –Cache Oblivious Algorithms