The AMD K8 Processor Architecture December 14 th 2006.
-
date post
19-Dec-2015 -
Category
Documents
-
view
221 -
download
1
Transcript of The AMD K8 Processor Architecture December 14 th 2006.
The AMD K8 Processor Architecture
December 14th 2006
K7 vs K8
K7: 3 x86 decoding units, 3 integer units (ALU), 3 floating point units (FPU),128KB L1 cache
K8: 3 decoders (16 bytes of instructions per clock cycle); x86 instructions decoded into fixed length micro-operations (µOPs). Complex instructions are decoded into 2 + µOps FastPath: Certain µOPs are packed together µOPs are then dispatched to the execution units. 3 Address Generation Units (AGU) for Loads and Stores Three integer units (ALU): most µOps executed in one cycle,
multiplication has a 3 cycles latency in 32 bits, and a 5 cycles latency in 64 bits
Three floating point units (FPU), that handle x87, MMX, 3DNow!, SSE and SSE2 instructions
Load/Store stage: The L1 is dual-ported, that means it can handle two 64 bits reads or writes each clock cycle
K8 Hammer Microarchitecture
K7 vs K8 Pipelines
K8 L1 and L2Cache The L1 cache
CPU K8 Athlon XP Pentium 4 Northwood Pentium 4 Prescott
Sizecode : 64KB
data : 64KBcode : 64Ko
data : 64KBTC : 12Kµops
data : 8KBTC : 12Kµops
data : 16KB
Associativitycode : 2 way
data : 2 waycode : 2 way
data : 2 wayTC : 8 way
data : 4 wayTC : 8 way
data : 8 way
Cache line sizecode : 64 bytes
data : 64 bytescode : 64 bytes
data : 64 bytesTC : n.adata : 64 bytes
TC : n.adata : 64 bytes
Write policy Write Back Write Back Write Through Write Through
Latency 3 cycles 3 cycles 2 cycles 4 cycles
The L2 cache
CPU K8 Athlon XP Pentium 4 Northwood Pentium 4 Prescott
Size512KB (Newcastle)
1024KB (Hammer)256 and 512KB 512KB 1024KB
Associativity 16 way 16 way 8 way 8 way
Cache line size 64 bytes 64 bytes 64 bytes 64 bytes
Latency(given by
manufacturer)? 8 cycles 7 cycles 11 cycles
Bus width 128 bits 64 bits 256 bits 256 bits
L1 relationship exclusive exclusive inclusive inclusive
Exclusive vs Inclusive Cache
Exclusive L1-L2Positive Negative
L1 and L2 cache designs a cache line (instructions/data) is not persisted from L1 to L2
No constraint on the L2 size (it can be small). Total cache size is sum of the sub-level sizes.
L2 performance impaired (latency)
Need to use a Victim Buffer
Inclusive L1-L2Positive Negative
Duplicates the content of the L1 cache in the L2 Cache
L2 performance improved Constraint on the L1/L2 size ratio (relatively large L2)Total cache size may be smaller.
K8 Athlon 64
Athlon 64 Operating Modes
Opteron VS. Xeon