How shit works: the CPU

How shit works:

the CPUTomer Gabel

BuildStuff 2016 Lithuania

Image: Telecarlos (CC BY-SA 3.0)

https://commons.wikimedia.org/wiki/File:Sincalir-QL-68000-Processor.jpg#/media/File:Sincalir-QL-68000-Processor.jpg

Full Disclosure

Bullshit ahead!

• I’m not an expert

• Explanations may be:

– Simplified

– Inaccurate

– Wrong :-)

• We’ll barely scratch the

surface

Image: Public Domain

https://commons.wikimedia.org/wiki/File:MUTCD_W11-4.svg#/media/File:MUTCD_W11-4.svg

A CONUNDRUM?

Are you ready for…

Image: Louis Reed (CC BY-SA 4.0)

https://commons.wikimedia.org/wiki/File:The_Riddler_Revenge_Pendulum_Ride_Logo.jpg

Setting the Stage// Generate a bunch of bytes

byte[] data = new byte[32768];

new Random().nextBytes(data);

Arrays.sort(data);

// Sum positive elements

long sum = 0;

for (int i = 0; i < data.length; i++)

if (data[i] >= 0)

sum += data[i];

1. Which is faster?

2. By how much?

3. And crucially…

why?!

# Run complete. Total time: 00:00:32

Benchmark Mode Cnt Score Error Units

Baseline.sum avgt 6 115.666 ± 3.137 us/op

Presorted.sum avgt 6 13.741 ± 0.524 us/op

Surprise, Terror and Ruthless Efficiency

# Run complete. Total time: 00:00:32

Benchmark Mode Cnt Error Units

Baseline.sum avgt 6 ± 3.137 us/op

Presorted.sum avgt 6 ± 0.524 us/op

* Ignoring setup cost

CPUS ARE COMPLEX BEASTS.

Image: Pauli Rautakorpi (CC BY 3.0)

https://commons.wikimedia.org/wiki/File:Intel_80286_die.JPG

It Is Known

• Your high-level code…

long sum = 0;

for (i = 0; i < length; i++)

if (data[i] >= 0)

sum += data[i];

• Gets compiled down to…

movsx eax,BYTE PTR [rax+rdx*1+0x10]

cmp eax,0x0

movabs rdx,0x11f3a9f60

movabs rcx,0x128

jl 0x000000010679e077

movabs rcx,0x138

mov r8,QWORD PTR [rdx+rcx*1]

lea r8,[r8+0x1]

mov QWORD PTR [rdx+rcx*1],r8

jl 0x000000010679e092

movsxd rax,eax

add rax,rbx

mov rbx,rax

inc edi

It Is Less Known

• What happens then?

• The instruction goes through phases…

Fetch Decode ExecuteMemory Access

Write-back

InstructionStream

CPU Architecture 101

Image: Appaloosa (CC BY-SA 3.0)

https://commons.wikimedia.org/wiki/File:Intel_i80286_arch.svg


• What does a CPU do?

– Reads the program




– Figures it out




– Figures it out

– Executes it




– Figures it out

– Executes it

– Talks to memory




– Figures it out

– Executes it

– Talks to memory

– Performs I/O




– Figures it out

– Executes it

– Talks to memory

– Performs I/O

• Immense complexity!

Execution Units• Arithmetic-Logic Unit (ALU)

– Boolean algebra

– Arithmetic

– Memory accesses

– Flow control

• Floating Point Unit (FPU)

• Memory Management Unit (MMU)– Memory mapping

– Paging

– Access control

Images: ALU by Dirk Oppelt (CC BY-SA 3.0), FPU by Konstantin Lanzet (CC BY-SA 3.0), MMU from unknown source

https://commons.wikimedia.org/wiki/File:National_NS32016D-6_S8520_top.jpg#/media/File:National_NS32016D-6_S8520_top.jpg

https://commons.wikimedia.org/wiki/File:KL_National_NS32081.jpg#/media/File:KL_National_NS32081.jpg

DESIGN CONSIDERATIONS

Image: William M. Plate Jr. (Public Domain)

https://en.wikipedia.org/wiki/Welding#/media/File:GMAW.welding.af.ncs.jpg


Write-back


Write-back


Write-back

I1

I0

I2

PipeliningSequential Execution

Latency = 5 cyclesThroughput= 0.2 ops / cycle


Write-back

I1

I0

I2


Fetch Decode Execute

PipeliningSequential Execution Pipelined Execution

Latency = 5 cyclesThroughput= 0.2 ops / cycle

Latency = 5 cyclesThroughput= 1 ops / cycle


Write-back


Write-back


Write-back

I1

I0

I2

Pipelining

• A pipeline can stall

• This happens with:

– Branches

if (i < 0) i++ else i--;

F D E M WMemory Load

F D E MTest

F D EConditional Jump

? ????

F D E M WIncrementmemory address

F D E M

F D Stall

F D

Load from memory

Add +1

Store in memory

Pipelining

• A pipeline can stall

• This happens with:

– Branches

– Dependent Instructions

• A.K.A pipeline bubbling

i++; x = i + 1;

Stall

PRACTICALRAMIFICATIONS

Image: Hangsna (CC BY-SA 3.0)

https://commons.wikimedia.org/wiki/File:Rubiks_cube_inside.JPG#/media/File:Rubiks_cube_inside.JPG

1. Memory is Slow

• RAM access is ~60ns

• Random access on a

4GHz, 64-bit CPU:

– 250 cycles / memory access

– 130MB / second bandwidth

• Surely we can do better!

Image: Noah Wieder (Public Domain)

Source: 7-cpu.com

https://pixabay.com/en/sign-turtle-back-road-rv-road-652815/

http://www.7-cpu.com/cpu/Skylake.html

Enter: CPU Cache

Level Size Latency

L1 32KB + 32KB 1ns

L2 256KB 3ns

L3 4MB 11ns

Main Memory 62ns

Intel i7-6700 “Skylake” at 4 GHz

Image: Ferry24.Milan (CC BY-SA 3.0)

Source: 7-cpu.com

https://en.wikipedia.org/wiki/Cache_memory#/media/File:L2_Shared_Cache.svg

http://www.7-cpu.com/cpu/Skylake.html

Enter: CPU Cache

• A unit of work is called cache line

– 64 bytes on x86

– LRU eviction policy

• Why is sequential access fast?

– Cache prefetching

In Real Life

• Let’s rotate an image!

for (y = 0; y < height; y++)

for (x = 0; x < width; x++) {

int from = y * width + x;

int to = x * height + y;

target[to] = source[from];

}

Image: EgoAltere (CC0 Public Domain)

https://pixabay.com/en/cat-animal-eyes-grey-view-views-351926/

In Real Life

• This is not efficient

• Reads are sequential

0 1 2 3 ... 9

0

1

2

3

…

9

In Real Life



0 1 2 3 ... 9

0 0 1 2 3 … 9

1

2

3

…

9

In Real Life



• Writes aren’t, though

• Different strides

– Worst case wins :-(

0 1 2 3 ... 9

0 0 1 2 3 … 9

1 10

2 20

3 30

… …

9 90

Cache-Friendly Algorithms

• Use blocking or tiling

for (y = 0; y < height; y += blockHeight)

for (x = 0; x < width; x += blockWidth)

for (by = 0; by < blockHeight; by++)

for (bx = 0; bx < blockWidth; bx++) {

int from = (y + by) * width + (x + bx);

int to = (x + bx) * height + (y + by);

target[to] = source[from];

}

Cache-Friendly Algorithms

• The results?

Benchmark Mode Cnt Score Error Units

CachingShowcase.transposeNaive avgt 10 43.851 ± 6.000 ms/op

CachingShowcase.transposeTiled8x8 avgt 10 20.641 ± 1.646 ms/op



• The results?

Benchmark Mode Cnt Error Units

CachingShowcase.transpose avgt 10 ± 6.000 ms/op




x2.37 speedup!

2. Those Pesky Branches

• Do I go left or right?

• Need input!

• … but can’t wait for it

• Maybe...– Take a guess?

– Based on historic trends?

• Sounds speculative

Image: Michael Dolan (CC BY 2.0)

https://www.flickr.com/photos/emilyrides/4568437675

Those Pesky Branches

• Enter: Branch Prediction

• Concurrently:

– Speculate branch

– Evaluate condition

• It’s now a tradeoff

– Commit is fast

– Rollback is slow

Image: Alejandro C. (CC BY-NC 2.0)

https://www.flickr.com/photos/zoso_tc/4668432540/in/photolist-87wWmd-8cyqb8-865FMn-8Hhnau-9Asu7J-865FRv-jHM1Zk-8GnHcr-qgZbbA-ay83PD-a9Wjj7-4Mg2co-2Y8yUx-6GaTap-4H5sN4-dRtgbt-dMwsbd-7dteGs-7gu6JX-6r9ki4-gutakK-dy1wXw-86mkPh-6A2JHy-op3rvM-nqnCth-B3jfM-f2xaaD-dxV65x-dLsDoh-o64w2S-3ojw6U-p7JvnE-f81vqE-4JuaBR-3XKh9W-pbQQzR-3XKdCG-34UvBN-3XEU8v-3XKdbm-oYfV7L-breAHm-5Veo1m-pvxEGM-5sgJeY-5VenM7-eUH9MY-5Veo43-e4HtE9

// Generate a bunch of bytes



Arrays.sort(data);


long sum = 0;


if (data[i] >= 0)

sum += data[i];

Back to Our Conundrum

• Can you guess?

– 3…

– 2...

– 1...

• Here it is!

// Generate a bunch of bytes



Arrays.sort(data);


long sum = 0;


if (data[i] >= 0)

sum += data[i];

Catharsis

54 10 -4 -2 15 41-

3713 0 -9 14 25

-61

40

Original data array:

Catharsis

-61

-37

-9 -4 -2 0 10 13 14 15 25 40 41 54

After sorting:

0

data[i] >= 0Always false!

data[i] >= 0Always true!

QUESTIONS?Thank you for listening

[email protected]

@tomerg

http://engineering.wix.com

Sources and Examples:

https://goo.gl/f7NfGT

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0

International License.

mailto:[email protected]

http://www.twitter.com/tomerg

http://engineering.wix.com

https://goo.gl/f7NfGT

http://creativecommons.org/licenses/by-sa/4.0/

Further Reading

• Jason Robert Carey Patterson –Modern Microprocessors, a 90-Minute Guide

• Igor Ostrovsky - Gallery of Processor Cache Effects

• Piyush Kumar –Cache Oblivious Algorithms

http://www.lighterra.com/papers/modernmicroprocessors/

http://igoro.com/archive/gallery-of-processor-cache-effects/

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.150.5426&rep=rep1&type=pdf

How shit works: the CPU

Software

Transcript of How shit works: the CPU