Samsung Exynos M1 Processor - Hot Chips · Samsung Exynos M1 Processor Brad Burgess VP, ... •...
Transcript of Samsung Exynos M1 Processor - Hot Chips · Samsung Exynos M1 Processor Brad Burgess VP, ... •...
Samsung Exynos M1 Processor
Brad Burgess VP, Chief CPU Architect
Samsung Austin R&D Center – SARC HotChips 2016
1
M1 Processor • ISA - ARM v8.0, 64-bit/32-bit compliant.
– AArch64 (A64), AArch32 (A32 + T32, formerly ARM+Thumb) instruction sets
• Samsung’s first in house, from scratch design. • 3-year design cycle - Reqs to Tapeout. • Best in class, quadcore design. • 2.6Ghz , sub 3-watt/core • 14nm FinFet
2
Microarchitecture Overview • Advanced Branch Prediction • Quad instruction decode
– Most instructions map directly to a single uop
• Quad uOP dispatch and retire • Full Out-of-Order instruction execution
– Full OoO loads and stores
• Multistride / multistream prefetcher • Low latency / low power caches
3
Samsung M1 Micro-Architecture
4
Decode
Rename
Addr Q
Inst Q
Branch ALU ALU FMAC FAdd
iSIMD iSIMD
FDiv
ALUC StData
iMUL iDIV
Dcache
dTAG
Align
Disp Q
LdAddr StAddr
Icache
Branch Prediction
iMUL Fst Fcvt
Crypt
Sch Sch Sch Sch Sch Sch Sch Sch
Samsung M1 Micro-Architecture
5
Decode
Rename
Addr Q
Inst Q
Branch ALU ALU FMAC FAdd
iSIMD iSIMD
FDiv
ALUC StData
iMUL iDIV
Dcache
dTAG
Align
Disp Q
LdAddr StAddr
Icache
Branch Prediction
iMUL Fst Fcvt
Crypt
Sch Sch Sch Sch Sch Sch Sch Sch
Branch Prediction: • Neural Net based predictor • Two branches/cycle • Fetch up to 24-bytes/cycle • 64-entry microBTB • 4k-entry mainBTB • 64-entry Call/Return Stack • Indirect Predictor • Loop Predictor • Decoupled AddrQ
Samsung M1 Micro-Architecture
6
Decode
Rename
Addr Q
Inst Q
Branch ALU ALU FMAC FAdd
iSIMD iSIMD
FDiv
ALUC StData
iMUL iDIV
Dcache
dTAG
Align
Disp Q
LdAddr StAddr
Icache
Branch Prediction
iMUL Fst Fcvt
Crypt
Sch Sch Sch Sch Sch Sch Sch Sch
Instruction Cache: • 64kB/4-way • 128-byte line size • Read 24-bytes/cycle • Parity Protected • Decoupled InstQ • 256 entry iTLB
Disp Q
Samsung M1 Micro-Architecture
7
Decode
Addr Q
Inst Q
Branch ALU ALU FMAC FAdd
iSIMD iSIMD
FDiv
ALUC StData
iMUL iDIV
Dcache
dTAG
Align
LdAddr StAddr
Icache
Branch Prediction
iMUL Fst Fcvt
Crypt
Sch Sch Sch Sch Sch Sch Sch Sch
Decode / Rename / Retire: • Decode 4 inst/cycle • AArch64, AArch32 • Sequencer for multi-uop • Rename 4-uops/cycle • Special renaming for FP • Fast map recovery • Retire 4-uops/cycle • 96-entry ROB • Dispatch 4-uops/cycle
Rename
Samsung M1 Micro-Architecture
8
Decode
Rename
Addr Q
Inst Q
FMAC FAdd
iSIMD iSIMD
FDiv
iMUL iDIV
Dcache
dTAG
Align
Disp Q
Icache
Branch Prediction
iMUL Fst Fcvt
Crypt
Sch
Integer Execution: • Issue up to 7 uops/cycle • 96-entry integer PRF • 58-entry distributed sched. • Branch resolution • ALUC – three source uops • ALU – two source uops • Load Address Adder • Store Address Adder • Store Data Sch Sch Sch Sch Sch Sch Sch
Branch ALUC ALU ALU LdAddr StAddr StData
Samsung M1 Micro-Architecture
9
Decode
Rename
Addr Q
Inst Q
Branch ALU ALU FMAC FAdd
iSIMD iSIMD
FDiv
ALUC StData
iMUL iDIV
Dcache
dTAG
Align
Disp Q
LdAddr StAddr
Icache
Branch Prediction
iMUL Fst Fcvt
Crypt
Sch Sch Sch Sch Sch Sch Sch Sch
Floating Point Execution: • 32-entry scheduler • 96-entry FP PRF • FMAC : 5-cycle MAC
4-cycle Mul • FADD: 3-cycle
Samsung M1 Micro-Architecture
10
Decode
Rename
Addr Q
Inst Q
Branch ALU ALU FMAC FAdd
iSIMD iSIMD
FDiv
ALUC StData
iMUL iDIV
Dcache
dTAG
Align
Disp Q
LdAddr StAddr
Icache
Branch Prediction
iMUL Fst Fcvt
Crypt
Sch Sch Sch Sch Sch Sch Sch Sch
Load/Store: • 32kB/8-way Data Cache • 64-byte line size • ECC protected • Out-of-Order loads and stores • 1-Load/cycle, 4-cycle latency • 1-Store/cycle • 8-outstanding misses • 32-entry dTLB • 1024-entry L2TLB • Multi-stride Prefetcher • Stream/Copy Optimizations
Samsung M1 Micro-Architecture
11
L2/Biu: • 2MB, 16-way • Four banks • Inclusive w/ filtering • 22 Cycle latency • 16-bytes/cycle/CPU
L2 Bank1
CPU2
CPU1
L2 Bank2
L2 Bank0
L2 Bank3
CPU3
CPU0
L2 Ctrl BIU
EX
Samsung M1 Basic Pipeline
12
B0 F3 F4 D1 D2 D3 RN1 RN2 Disp
Sch RegRd EX WB
Sch RegRd EA Tag Data Align
Arb1 Arb2 L2Tag L2Data Arb2
B1 B1 B1 B1 F2
Rename, Retire
FP PRF
Instruction Decoders
ROB
64kB Icache
4K-entry mBTB
Fill Buffers
32kB Dache
Integer Execution
1024-entry L2TLB
Integer PRF
64-entry uBTB
13
Samsung M1 Single-core Performance
15
0
500
1000
1500
2000
2500
3000
3500
4000
2.6 GHz M12.1 GHz A57
8996 6426
Samsung M1 Multi-core Performance
16
0
2000
4000
6000
8000
10000
12000
2.3 GHz M1
2.1 GHz A57
27767 27134
Samsung M1 Single-core Power and Efficiency
17
-0.4
0.1
0.6
1.1
1.6
2.1
0.0
0.5
1.0
1.5
2.0
2.5
3.0AE
STw
ofish
SHA1
SHA2
BZIP
2CBZ
IP2D
JPEG
CJP
EGD
PNGC
PNGD
Sobe
lLu
aDi
jkst
raBl
acks
chol
esM
ande
lbro
tSh
arpe
n Fi
lter
Blur
Filt
erSG
EMM
DGEM
MSF
FTDF
FTN
body
Ray
trac
eSt
r cop
ySt
r sca
leSt
r add
Str t
riad
Inte
ger
Floa
ting
Mem
ory
Ove
rall
Effic
ienc
y (p
erf/
W)
Pow
er (M
1/A5
7)
M1/A57 (M1 @2.3Ghz, [email protected])Efficiency