RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008.
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008.
RAMP Gold:ParLab InfiniCore Model
Krste AsanovicUC Berkeley
RAMP Retreat, January 16, 2008
2
Outline UCB Parallel Computing Laboratory (ParLab)
overview InfiniCore: UCB’s Manycore prototype architecture RAMP Gold: A RAMP model for InfiniCore
3
Efficiency Language Compilers
Personal Health
Image Retriev
al
Hearing, Music
Speech
Parallel Browse
rMotifs/Dwarfs
Sketching
Legacy Code
Schedulers
Communication & Synch.
Primitives
UCB Par Lab OverviewEasy to write correct software that runs efficiently on
manycore
Legacy OS
Multicore/GPGPU
OS Libraries+ServicesHypervisorOS
Arch.
Composition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verificatio
n
Dynamic Checkin
gDebugging
with Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
Efficienc
y Layer
Productivi
ty Layer
Corr
ect
ness
Applicatio
ns
InfiniCore/RAMP Gold
4
“Manycore” covers huge design space
L2L2BankBank
DRAMDRAM
L2L2BankBank
DRAMDRAM
L2L2BankBank
FlashFlash
Mem & I/O InterconnectMem & I/O Interconnect
Fast Fast Serial I/O Serial I/O
PortsPorts
Multiple Off-Multiple Off-Chip Chip
DRAM/Flash DRAM/Flash ChannelsChannels
L2 InterconnectL2 Interconnect
CPUCPU
L1L1
CPUCPU
L1L1
CPCPUU
L1L1
CPCPUU
L1L1
CPCPUU
L1L1
CPCPUU
L1L1
HW HW Accel.Accel.HW HW
Accel.Accel.HW HW
Accel.Accel.
Multiple On-Multiple On-Chip L2 $/RAM Chip L2 $/RAM
banksbanks
““Fat” CoresFat” Cores
““Thin” CoresThin” Cores
Special-Purpose Special-Purpose CoresCores
Many alternative Many alternative memory memory
hierarchieshierarchies
5
Narrowing our search space Laptops/Handhelds => single-socket systems
Don’t expect >1 manycore chip per platform Servers/HPC will probably use multiple single-socket blades
Homogeneous, general-purpose cores Presents most of the interesting design challenges Resulting designs can later be specialized for improved
efficiency “Simple” in-order cores
Want low energy/op floor Want high performance/area ceiling More predictable performance
A “tiled” physical design Reduces logical/physical design verification costs Enables design reuse across large family of parts Provides natural locality to reduce latency and energy/op Natural redundancy for yield enhancement & surviving
failures
6
InfiniCore ParLab “strawman” manycore architecture
A playground (punching bag?) for trying out architecture ideas
Highlights: Flexible hardware partitioning & protected
communication Latency-tolerant CPUs Fast and flexible synchronization primitives Configurable memory hierarchy and user-level DMA Pervasive QoS and performance counters
7
InfiniCore Architecture Overview
Four separate on-chip network types
Control networks combine 1-bit signals in combinational tree for interrupts & barriers
Active message networks carry register-register messages between cores
L2/Coherence network connects L1 caches to L2 slices and indirectly to memory
Memory network connects L2 slices to memory controllers
I/O and accelerators potentially attach to all network types.
Flash replaces rotating disks.Only high-speed I/O is network &
display.
Active Message NetworkActive Message Network
Control/Barrier NetworkControl/Barrier Network
L2/Coherence NetworkL2/Coherence Network
Memory NetworkMemory Network
CoreCore
L1D$L1D$
L1I$L1I$
L2L2RAMRAM
L2L2TagsTags
L2 Cntl.L2 Cntl.
CoreCore
L1D$L1D$
L1I$L1I$
Acc
ele
rato
rs a
nd/o
r I/O
A
ccele
rato
rs a
nd/o
r I/O
in
terf
ace
sin
terf
ace
s
MEMCMEMC
DRAMDRAM
I/O I/O PinsPinsL2L2
RAMRAML2L2
TagsTags
L2 Cntl.L2 Cntl.
MEMCMEMC
DRAMDRAM
MEMCMEMC
FlashFlash
8
Physical View of Tiled Architecture
DR
AM
DR
AM
DRAMDRAM
DR
AM
DR
AM
FlashFlash
CoreCore
L1D$L1D$
L2$L2$SlicSlicee
L1I$L1I$
Inte
rcon
Inte
rcon
..CoreCore
L1D$L1D$
L2$L2$SlicSlicee
L1I$L1I$
Inte
rcon
Inte
rcon
..
Core Core
L1D$ L1D$
L2$ L2$Slic Slicee
L1I$ L1I$
Inte
rcon
Inte
rcon
..Core Core
L1D$ L1D$
L2$ L2$Slic Slicee
L1I$ L1I$
Inte
rcon
Inte
rcon
..
I/O
I/O
9
Core Internals
Control Control ProcessoProcesso
rr(Int 64b)(Int 64b)
L1D$L1D$
L1I$L1I$
Vector Vector UnitUnit
(Int/FP (Int/FP 64b)64b)
GPRsGPRs VRegsVRegsCommanCommand Queued Queue
TLB/PLBTLB/PLB
Load Load Data Data
QueuesQueues(Store (Store
Queues not Queues not shown)shown)
To outer levels of To outer levels of memory memory hierarchyhierarchy
Virtual Virtual AddressAddress
RISC-style 64-bit instruction set SPARC V9 used for pragmatic reasons
In-order pipeline with decoupled single-lane (64-bit) vector unit (VU) Integer control unit generates/checks
addresses in-order to give precise exceptions on vector loads/stores
VU runs behind executing queued instructions on queued load data
VU executes both scalar & vector, can mix (e.g., vector load plus scalar ALU)
Each VU cycle: 2 ALU, 1 load, 1 store (all 64b) Vector regfile configurable to trade
reduced I-fetch for fewer register spills 256 total registers (e.g., 32 regs. x 8
elements, or 8 regs. x 32 elements) Decoupling is cheap way to tolerate
memory latency inside thread (scalar & vector)
Vectors increase performance, reduce energy/op, and increase effective decoupling queue size
TLB/PLBTLB/PLB 1-3 issue?1-3 issue?2x64b 2x64b
FLOPS/clockFLOPS/clock
10
Cache CoherenceL1 cache coherence tracked at L1 cache coherence tracked at L2 memory managers (set of L2 memory managers (set of readers)readers)• All cases except write to currently All cases except write to currently
read shared line handled in pure read shared line handled in pure hardwarehardware
• Writer gets trap on memory Writer gets trap on memory response, invokes handlerresponse, invokes handler
• Same process used for Same process used for transactional memory (TM)transactional memory (TM)
• Cache tags visible to user-level Cache tags visible to user-level software in partition, useful for TM software in partition, useful for TM swappingswapping
Active Message NetworkActive Message Network
Control/Barrier NetworkControl/Barrier Network
L2/Coherence NetworkL2/Coherence Network
Memory NetworkMemory Network
CoreCore
L1D$L1D$
L1I$L1I$
L2L2RAMRAM
L2L2TagsTags
L2 Cntl.L2 Cntl.
CoreCore
L1D$L1D$
L1I$L1I$
Acc
ele
rato
rs a
nd/o
r I/O
A
ccele
rato
rs a
nd/o
r I/O
in
terf
ace
sin
terf
ace
s
MEMCMEMC
DRAMDRAM
I/O I/O PinsPinsL2L2
RAMRAML2L2
TagsTags
L2 Cntl.L2 Cntl.
MEMCMEMC
DRAMDRAM
MEMCMEMC
FlashFlash
11
RAMP Gold:A Model of ParLab InfiniCore Target
Target is single-socket tiled manycore system Based on SPARC ISA (v8->v9) Distributed coherent caches Multiple on-chip networks (barrier, active message,
coherence, memory) Multiple DRAM channels
Split timing/functional models, both in hardware Host multithreading of both timing and functional
models Expect to model up to 1024 64-bit cores in system
(8 BEE3 boards) Predict peak performance around 1-10 GIPS, with
full timing models
12
Host Multithreading(Zhangxi Tan (UCB), Chung, (CMU))
CPU1
CPU2
CPU3
CPU4Target Target
ModelModel
Multithreading emulation engine reduces FPGA resource use and improves emulator throughput
Hides emulation latencies (e.g., communicating across FPGAs)
Multithreaded Host Multithreaded Host Emulation Engine (on FPGA)Emulation Engine (on FPGA)
+1
2
PC1PC
1PC1PC
1
I$ IR GPR1GPR1GPR1GPR1
X
Y
2
D$Single hardware Single hardware
pipeline with pipeline with multiple copies multiple copies
of CPU stateof CPU state
13
Split Functional/Timing Models(HASIM Emer (MIT/Intel), FAST Chiou, (UT Austin))
Functional model executes CPU ISA correctly, no timing information Only need to develop functional model once for each ISA
Timing model captures pipeline timing details, does not need to execute code Much easier to change timing model for architectural
experimentation Without RTL design, cannot be 100% certain that timing is
accurate Many possible splits between timing and functional model
Functional Functional ModelModel
Timing Timing ModelModel
14
RAMP Gold Approach
Split (and decoupled) functional and timing models
Host multithreading of both functional and timing models
15
Multithreaded Func. & Timing Models
MT-Unit multiplexes multiple target units on a single host engine MT-Channel multiplexes multiple target channels over a single
host link
Functional Functional Model Model
PipelinePipeline
Arch State
Timing Timing Model Model
PipelinePipeline
TIming
State
MT-UnitMT-Unit
MT-ChannelsMT-Channels
16
RAMP Gold CPU Model (v0.1)
Commit Commit TimingTiming
Execute Execute TimingTiming
PC1PC1PC1PC
1 PC/Fetch PC/Fetch Func.Func.
ALU ALU Func.Func.
Decode/Decode/Issue TimingIssue Timing
InstructioInstructionsns
StatusStatus
GPR1GPR1GPR1GPR1
ImmediateImmediatess
PC ValuesPC Values
StoreStore
Fetch Fetch CommandCommand
ssGPR1GPR1GPR1Timing
State
GPR1GPR1GPR1Timing State
StatusStatus StatusStatus
AddressesAddresses
LoadLoadExec. Exec. CommComm
..
Mem. Mem. CommComm
..
Data Data Memory Memory InterfaceInterface
InstructioInstruction n
Memory Memory InterfaceInterface
StatusStatus
17
RAMP Gold Memory Model (v0.1)
CPUCPUModelModel
CPUCPUModelModel
Host DRAM CacheHost DRAM Cache
BEE DRAMBEE DRAM
GPR1
GPR1
GPR1
GPR1
GPR1
GPR1
Memory Memory ModelModel
(duplicate paths (duplicate paths for Instruction for Instruction
and Data and Data interface)interface)
18
Matching physical resources to utilization
Only implement sufficient functional units to match expected utilization, e.g.:
For single-issue core, expected IPC ~0.6 Regfile read ports (1.2 operands/instruction)
0.6*1.2=0.72 per timing model Regfile write ports (0.8 operands/instruction)
0.6*0.8=0.48 per timing model Instruction mix:
Mem 0.3 FPU 0.1 Int 0.5 Branch 0.1
Therefore only need (per timing model) 0.6*0.3 = 0.18 memory ports 0.6*0.1 = 0.06 FPUs 0.6*0.5 = 0.30 Integer execution units 0.6*0.1 = 0.06 Branch execution units
19
Balancing Resource Utilization
FPUFPU MemMemIntInt IntInt IntInt IntInt IntIntBranchBranch
RegfileRegfile RegfileRegfile RegfileRegfile
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
RegfileRegfile
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Timing Timing ModelModel
Operand InterconnectOperand Interconnect
20
RAMP Gold Capacity Estimates
For SPARC v8 (32-bit) pipeline Purely functional, no timing model Integer only For BEE3, predict 64 CPUs/engine, 8
engines/FPGA (LX110), or 512 CPUs/FPGA Throughput of 150MHz * 8 engines = 1200
MIPS/FPGA 8 BEE3 boards * 4 FPGAs/board = 38 GIPS/system
Perhaps 4x reduction in capacity with v9, FPU, and timing models