RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee,...

18
RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing Laboratory University of California, Berkeley
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee,...

RAMP Gold: Architecture and Timing Model

Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson,

Krste Asanović

Parallel Computing LaboratoryUniversity of California, Berkeley

RAMP Gold Overview

• Tiled CMP simulator• ISA: SPARC V8

– (ARM/Thumb-2 later?)

• Split timing and function (both on FPGA)

• Host-multithreaded• Runs on V5LX110T

(XUP)

Par Lab InfiniCore

Functional Functional Model Model PipelinePipeline

Arch State

Timing Timing Model Model PipelinePipeline

Timing State

RAMP Gold Target Machine

SPARC V8CORE

SPARC V8CORE

I$I$ D$D$

DRAMDRAM

Shared L2$ / InterconnectShared L2$ / Interconnect

SPARC V8CORE

SPARC V8CORE

I$I$ D$D$

SPARC V8CORE

SPARC V8CORE

I$I$ D$D$

SPARC V8CORE

SPARC V8CORE

I$I$ D$D$

64 cores

RAMP Gold v1 Target Features

• 64 single issue in-order SPARCv8 processors– Simple, 5-stage pipeline– FPU

• Cache Timing model– Configurable size, line size, associativity,

miss penalty, shared/private– Change parameters without resynthesis

RAMP Gold Architecture

• Mapping the target machine directly to an FPGA is inefficient

• Solution: split timing and functionality + Multithreading– The timing logic decides how many

target cycles an instruction sequence should take

– Simulating the functionality of an instruction might take multiple host cycles

Function/Timing Split Advantages

• Flexibility– Can configure target at runtime– Synthesize design once, change target

model parameters at will

• Efficient FPGA resource usage– Example 1: model a 2-cycle FPU in 10 host

cycles– Example 2: model a 16MB L2$ using only

256KB host BRAM to store tags/metadata

• Enables multithreading

Split Timing and Function

• Functional model executes ISA correctly• Timing model determines how long a program takes to

run

CPUCPU

L1 D$L1 D$

MEMMEM

=

Target Machine

CPU FMCPU FM

MEM FMMEM FM

Functional Model Timing Model

CPU TMCPU TM

L1 D$ TML1 D$ TM

MEM TMMEM TM

L1 D$ FML1 D$ FM +

• Functional model executes ISA correctly• Timing model determines how long a program takes to

run

CPUCPU

L1 D$L1 D$

MEMMEM

CPU FMCPU FM

MEM FMMEM FM=

Target Machine Functional Model Timing Model

CPU TMCPU TM

L1 D$ TML1 D$ TM

MEM TMMEM TM

+

Split Timing and Function

TM + FM from 30,000 ft

CPU TimingModel

CPU TimingModel

L1 D$ Timing Model

L1 D$ Timing Model

CPU FunctionalModel

CPU FunctionalModel

Memory TimingModel

Memory TimingModel

Memory FunctionalModel

Memory FunctionalModelinstruction

ld/st addressstore data

ld/st address stall

load data

ld/st addressstore data

stall

instructioncomplete

TM + FM from 3,000 ft

Memory TimingModel

Memory TimingModel

Memory Functional

Model

Memory Functional

Modelinstruction

ld/st address,store data

ld/st address stall

load data

ld/st address,store data

stall

instructioncomplete

CPU TM

IFIF CTRLCTRL

DECDEC EXEX MEMMEM WBWB

CPU FM

TM1TM1 TM2TM2

L1 D$ TM

Example: Target Load Miss

Memory TimingModel

Memory TimingModel

Memory Functional

Model

Memory Functional

Modelinstruction

ld/st address,store data

ld/st address stall

load data

ld/st address,store data

stall

instructioncomplete

CPU TM

IFIF CTRLCTRL

DECDEC EXEX MEMMEM WBWB

CPU FM

TM1TM1 TM2TM2

L1 D$ TM

11

22

33

44

44

44

55

66

77

Timing-Driven Host Pipeline

TSTS IFIF

DEDE EXEX WBWBMEM2MEM2

TM1TM1

TARGET MEMORY TM/FMTARGET MEMORY TM/FM

TM2TM2 TM3TM3

L1 D$ TM

MEM1MEM1

Store Buffer

Load ResultBuffer

CPU/D$ Timing Model

CPU Functional Model

{TID,INST} {TID,ADDR}

T0 T1 T2ADD LD ST

ST LD

ADD

Cache Modeling

• The cache model maintains tag, state, protocol bits internally

• Whenever the functional model issues a memory operation, the cache model determines how many target cycles to stall

tagtag indexindex offsetoffset

tag, statetag, state tag, statetag, state tag, statetag, state

==== ==

hit/miss

associativity

Multithreaded, Pipelined Cache TM

tag, statetag, state

==Address

tag, statetag, state

==

tag, statetag, state

==

Index

hit?

Quick & Dirty Validation

• 32KB, 2-way L1 D$, 64B lines• 256KB, 4-way L2$, 64B lines

Status

• Functional + simple timing model work in HW– Running real programs (e.g. SPLASH2)

• Near term future work– Move from current “functional-first + stall”

configuration to timing-driven described here– More interesting memory system timing

model– Functional potpourri (FDIV, MMU, …)

DEMO

• Run OCEAN with different L1 D$ parameters

Questions?

Thank you!