Samsung Exynos M1 Processor - Hot Chips · Samsung Exynos M1 Processor Brad Burgess VP, ... •...

18
Samsung Exynos M1 Processor Brad Burgess VP, Chief CPU Architect Samsung Austin R&D Center – SARC HotChips 2016 1

Transcript of Samsung Exynos M1 Processor - Hot Chips · Samsung Exynos M1 Processor Brad Burgess VP, ... •...

Samsung Exynos M1 Processor

Brad Burgess VP, Chief CPU Architect

Samsung Austin R&D Center – SARC HotChips 2016

1

M1 Processor • ISA - ARM v8.0, 64-bit/32-bit compliant.

– AArch64 (A64), AArch32 (A32 + T32, formerly ARM+Thumb) instruction sets

• Samsung’s first in house, from scratch design. • 3-year design cycle - Reqs to Tapeout. • Best in class, quadcore design. • 2.6Ghz , sub 3-watt/core • 14nm FinFet

2

Microarchitecture Overview • Advanced Branch Prediction • Quad instruction decode

– Most instructions map directly to a single uop

• Quad uOP dispatch and retire • Full Out-of-Order instruction execution

– Full OoO loads and stores

• Multistride / multistream prefetcher • Low latency / low power caches

3

Samsung M1 Micro-Architecture

4

Decode

Rename

Addr Q

Inst Q

Branch ALU ALU FMAC FAdd

iSIMD iSIMD

FDiv

ALUC StData

iMUL iDIV

Dcache

dTAG

Align

Disp Q

LdAddr StAddr

Icache

Branch Prediction

iMUL Fst Fcvt

Crypt

Sch Sch Sch Sch Sch Sch Sch Sch

Samsung M1 Micro-Architecture

5

Decode

Rename

Addr Q

Inst Q

Branch ALU ALU FMAC FAdd

iSIMD iSIMD

FDiv

ALUC StData

iMUL iDIV

Dcache

dTAG

Align

Disp Q

LdAddr StAddr

Icache

Branch Prediction

iMUL Fst Fcvt

Crypt

Sch Sch Sch Sch Sch Sch Sch Sch

Branch Prediction: • Neural Net based predictor • Two branches/cycle • Fetch up to 24-bytes/cycle • 64-entry microBTB • 4k-entry mainBTB • 64-entry Call/Return Stack • Indirect Predictor • Loop Predictor • Decoupled AddrQ

Samsung M1 Micro-Architecture

6

Decode

Rename

Addr Q

Inst Q

Branch ALU ALU FMAC FAdd

iSIMD iSIMD

FDiv

ALUC StData

iMUL iDIV

Dcache

dTAG

Align

Disp Q

LdAddr StAddr

Icache

Branch Prediction

iMUL Fst Fcvt

Crypt

Sch Sch Sch Sch Sch Sch Sch Sch

Instruction Cache: • 64kB/4-way • 128-byte line size • Read 24-bytes/cycle • Parity Protected • Decoupled InstQ • 256 entry iTLB

Disp Q

Samsung M1 Micro-Architecture

7

Decode

Addr Q

Inst Q

Branch ALU ALU FMAC FAdd

iSIMD iSIMD

FDiv

ALUC StData

iMUL iDIV

Dcache

dTAG

Align

LdAddr StAddr

Icache

Branch Prediction

iMUL Fst Fcvt

Crypt

Sch Sch Sch Sch Sch Sch Sch Sch

Decode / Rename / Retire: • Decode 4 inst/cycle • AArch64, AArch32 • Sequencer for multi-uop • Rename 4-uops/cycle • Special renaming for FP • Fast map recovery • Retire 4-uops/cycle • 96-entry ROB • Dispatch 4-uops/cycle

Rename

Samsung M1 Micro-Architecture

8

Decode

Rename

Addr Q

Inst Q

FMAC FAdd

iSIMD iSIMD

FDiv

iMUL iDIV

Dcache

dTAG

Align

Disp Q

Icache

Branch Prediction

iMUL Fst Fcvt

Crypt

Sch

Integer Execution: • Issue up to 7 uops/cycle • 96-entry integer PRF • 58-entry distributed sched. • Branch resolution • ALUC – three source uops • ALU – two source uops • Load Address Adder • Store Address Adder • Store Data Sch Sch Sch Sch Sch Sch Sch

Branch ALUC ALU ALU LdAddr StAddr StData

Samsung M1 Micro-Architecture

9

Decode

Rename

Addr Q

Inst Q

Branch ALU ALU FMAC FAdd

iSIMD iSIMD

FDiv

ALUC StData

iMUL iDIV

Dcache

dTAG

Align

Disp Q

LdAddr StAddr

Icache

Branch Prediction

iMUL Fst Fcvt

Crypt

Sch Sch Sch Sch Sch Sch Sch Sch

Floating Point Execution: • 32-entry scheduler • 96-entry FP PRF • FMAC : 5-cycle MAC

4-cycle Mul • FADD: 3-cycle

Samsung M1 Micro-Architecture

10

Decode

Rename

Addr Q

Inst Q

Branch ALU ALU FMAC FAdd

iSIMD iSIMD

FDiv

ALUC StData

iMUL iDIV

Dcache

dTAG

Align

Disp Q

LdAddr StAddr

Icache

Branch Prediction

iMUL Fst Fcvt

Crypt

Sch Sch Sch Sch Sch Sch Sch Sch

Load/Store: • 32kB/8-way Data Cache • 64-byte line size • ECC protected • Out-of-Order loads and stores • 1-Load/cycle, 4-cycle latency • 1-Store/cycle • 8-outstanding misses • 32-entry dTLB • 1024-entry L2TLB • Multi-stride Prefetcher • Stream/Copy Optimizations

Samsung M1 Micro-Architecture

11

L2/Biu: • 2MB, 16-way • Four banks • Inclusive w/ filtering • 22 Cycle latency • 16-bytes/cycle/CPU

L2 Bank1

CPU2

CPU1

L2 Bank2

L2 Bank0

L2 Bank3

CPU3

CPU0

L2 Ctrl BIU

EX

Samsung M1 Basic Pipeline

12

B0 F3 F4 D1 D2 D3 RN1 RN2 Disp

Sch RegRd EX WB

Sch RegRd EA Tag Data Align

Arb1 Arb2 L2Tag L2Data Arb2

B1 B1 B1 B1 F2

Rename, Retire

FP PRF

Instruction Decoders

ROB

64kB Icache

4K-entry mBTB

Fill Buffers

32kB Dache

Integer Execution

1024-entry L2TLB

Integer PRF

64-entry uBTB

13

L2 Tags

L2 Data

Bank Logic

Centralized Arb/Muxing

14

Samsung M1 Single-core Performance

15

0

500

1000

1500

2000

2500

3000

3500

4000

2.6 GHz M12.1 GHz A57

8996 6426

Samsung M1 Multi-core Performance

16

0

2000

4000

6000

8000

10000

12000

2.3 GHz M1

2.1 GHz A57

27767 27134

Samsung M1 Single-core Power and Efficiency

17

-0.4

0.1

0.6

1.1

1.6

2.1

0.0

0.5

1.0

1.5

2.0

2.5

3.0AE

STw

ofish

SHA1

SHA2

BZIP

2CBZ

IP2D

JPEG

CJP

EGD

PNGC

PNGD

Sobe

lLu

aDi

jkst

raBl

acks

chol

esM

ande

lbro

tSh

arpe

n Fi

lter

Blur

Filt

erSG

EMM

DGEM

MSF

FTDF

FTN

body

Ray

trac

eSt

r cop

ySt

r sca

leSt

r add

Str t

riad

Inte

ger

Floa

ting

Mem

ory

Ove

rall

Effic

ienc

y (p

erf/

W)

Pow

er (M

1/A5

7)

M1/A57 (M1 @2.3Ghz, [email protected])Efficiency

Conclusion • Samsung’s first from scratch CPU design • Very aggressive, 3-year design schedule • Cleanly into Production on time • Performance and efficiency gains over prior generation • Best in class quad core mobile design • More to come…

18