Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load...

Post on 13-May-2018

229 views 2 download

Transcript of Overview of SimpleScalar - Chalmers ·  · 2009-10-21Overview of SimpleScalar ... load...

2007-11-14 1

Overview of SimpleScalar

Mafijul Islam

Department of Computer Science and Engineering

2007-11-14 2

Acknowledgement

• SimpleScalar Tutorial available at www.simplescalar.com

• http://hpc5.cs.tamu.edu/docs/SimplescalarOverview2006.ppt

SimpleScalar TutorialSimpleScalar TutorialPage 4

• What is an architectural simulator?q tool that reproduces the behavior of a computing device

• Why use a simulator?q leverage faster, more flexible S/W development cycle

q permits more design space exploration

q facilitates validation before H/W becomes available

q level of abstraction can be throttled to design task

q possible to increase/improve system instrumentation

A Computer Architecture Simulator Primer

DeviceSimulator

SystemInputs

System Outputs

System Metrics

2007-11-14 4

Simulators

• Almost 40 simulators listed at http://www.cs.wisc.edu/arch/www/tools.html

• 1st simulator of the list

– SimpleScalar (uni-processor, superscalar)

– Developed by Todd Austin (U Michigan) while

in U of Wisconsin-Madison

– Still widely used in the academia and industry

Page 7SimpleScalar Tutorial

A Taxonomy of Simulation Tools

• shaded tools are part of SimpleScalar

Architectural simulators

Cycle timers

Performance

Inst. schedulersExec-driven

Functional

Trace-driven

Interpreters Direct execution

Page 9SimpleScalar Tutorial

Functional vs. performance simulators

• Functional simulators implement the architecture

- Perform the actual execution

- Implement what programmers see

• Performance (or timing) simulators implement the microarch.

- Model system resources/internals

- Measure time

- Implement what programmers do not see

2007-11-14 8

Trace Driven vs. Execution Driven

Simulators• Trace-Driven

– Simulator reads a ‘trace’ of the instructions captured during a previous execution

– Easy to implement

– No functional components necessary

– No feedback to trace (eg. mis-prediction)

• Execution-Driven– Simulator runs the program (trace-on-the-fly)

– Hard to implement

– Advantages• Faster than tracing

• No need to store traces

• Register and memory values usually are not in trace

• Support mis-speculation cost modeling

2007-11-14 9

Instruction Schedulers vs. Cycle Timers

• Instruction Schedulers

– Simulator schedules instruction when resources are available

– Instructions proceeded one at a time

– Simpler, but less detailed

• Cycle Timers

– Simulator tracks microarchitecture state each cycle

– Simulator state == microarchitecture state

– Perfect for microarchitecture simulation

SimpleScalar TutorialSimpleScalar TutorialPage 10

The SimpleScalar Tool Set• computer architecture research test bed

q compilers, assembler, linker, libraries, and simulators

q targeted to the virtual SimpleScalar PISA architecture

q hosted on most any Unix-like machine

• developed during Austin's dissertation work at UW-Madisonq third generation simulation tool (Sohi → Franklin → SimpleScalar)

q in development since ‘94

q first public release (1.0) in July ‘96

q second public release (2.0) testing completed in January ‘97

• available with source and docs from SimpleScalar LLC http://www.simplescalar.com

SimpleScalar TutorialSimpleScalar TutorialPage 12

SimpleScalar Tool Set Overview

• compiler chain is GNU tools ported to SimpleScalar

• Fortran codes are compiled with AT&T’s f2c

• libraries are GLIBC ported to SimpleScalar

F2C GCC

GAS

GLDlibf77.a

libm.alibc.a

Simulators

Bin Utils

Fortran code C code

Assembly code

object files

Executables

Page 4SimpleScalar Tutorial

Advantages of SimpleScalar• Extensible

- source for compiler, libraries, simulators

- user-extensible instruction format

• Portable

- runs on NT and most UNIX platforms

- target can support multiple ISAs

• Detailed

- Interfaces support simulators of arbitrary detail

- Multiple simulators included with distribution

• Fast (millions of instructions per second)

SimpleScalar TutorialSimpleScalar TutorialPage 9

The Zen of Simulator Design

• design goals will drive which aspects are optimized

• the SimpleScalar Tool Setq optimizes performance and flexibility

q in addition, provides portability and varied detail

Performance

Detail Flexibility

PickTwo

Performance: speeds design cycle

Flexibility: maximizes design scope

Detail: minimizes risk

SimpleScalar TutorialSimpleScalar TutorialPage 13

Simulation Suite Overview

Performance

Detail

Sim-Fast Sim-SafeSim-Cache/

Sim-Cheetah/Sim-BPred

Sim-Profile Sim-Outorder

- 420 lines- functional- 4+ MIPS

- 350 lines- functional w/ checks

- < 1000 lines- functional- cache stats- pred stats

- 900 lines- functional- lot of stats

- 3900 lines- performance- OoO issue- branch pred.- mis-spec.- ALUs- cache- TLB- 200+ KIPS

2007-11-14 12

Sim-Fast

• Functional simulation

• Optimized for speed

• Assumes no cache

• Assumes no instruction checking

• Does not support Dlite!

• Does not allow command line arguments

• <300 lines of code

2007-11-14 13

Sim-Safe

• Functional simulation

• Checks for instruction errors

• Optimized for speed

• Assumes no cache

• Supports Dlite!

• Does not allow command line arguments

2007-11-14 14

Sim-Cache

• Cache simulation

• Ideal for fast simulation of caches (if the effect of cache

performance on execution time is not necessary)

• Accepts command line arguments for:

– level 1 & 2 instruction and data caches

– TLB configuration (data and instruction)

– Flush and compress

– and more

• Ideal for performing high-level cache studies that don’t

take access time of the caches into account

2007-11-14 15

Sim-Bpred

• Simulate different branch prediction mechanisms

• Generate prediction hit and miss rate reports

• Does not simulate the effect of branch prediction on total

execution time

nottaken

taken

perfect

bimod bimodal predictor

2lev 2-level adaptive predictor

comb combined predictor (bimodal and 2-level)

2007-11-14 16

Sim-Profile

● Program Profiler

● Generates detailed profiles, by symbol and by address

● Keeps track of and reports

● Dynamic instruction counts

● Instruction class counts

● Branch class counts

● Usage of address modes

● Profiles of the text & data segment

2007-11-14 17

Sim-Outorder

• Most complicated and detailed simulator

• Supports out-of-order issue and execution

• Provides reports

– branch prediction

– cache

– external memory

– various configuration

SimpleScalar TutorialSimpleScalar TutorialPage 16

Generating SimpleScalar Binaries• compiling a C program, e.g.,

ssbig-na-sstrix-gcc -g -O -o foo foo.c -lm

• compiling a Fortran program, e.g.,ssbig-na-sstrix-f77 -g -O -o foo foo.f -lm

• compiling a SimpleScalar assembly program, e.g.,ssbig-na-sstrix-gcc -g -O -o foo foo.s -lm

• running a program, e.g.,sim-safe [-sim opts] program [-program opts]

• disassembling a program, e.g.,ssbig-na-sstrix-objdump -x -d -l foo

• building a library, use:ssbig-na-sstrix-{ar,ranlib}

SimpleScalar TutorialSimpleScalar TutorialPage 17

Global Simulator Options• supported on all simulators:

-h - print simulator help message-d - enable debug message-i - start up in DLite! debugger-q - quit immediately (use w/ -dumpconfig)-config <file> - read config parameters from <file>-dumpconfig <file>- save config parameters into <file>

• configuration files:q to generate a configuration file:

q specify non-default options on command line

q and, include “-dumpconfig <file>” to generate configuration file

q comments allowed in configuration files, all after “#” ignored

q reload configuration files using “-config <file>”

SimpleScalar TutorialSimpleScalar TutorialPage 18

The SimpleScalar Instruction Set• clean and simple instruction set architecture:

q MIPS/DLX + more addressing modes - delay slots

• bi-endian instruction set definitionq facilitates portability, build to match host endian

• 64-bit inst encoding facilitates instruction set researchq 16-bit space for hints, new insts, and annotations

q four operand instruction format, up to 256 registers

16-annote 16-opcode 8-ru 8-rt 8-rs 8-rd

16-imm

081624324863

SimpleScalar TutorialSimpleScalar TutorialPage 19

SimpleScalar InstructionsControl:j - jumpjal - jump and linkjr - jump registerjalr - jump and link registerbeq - branch == 0bne - branch != 0blez - branch <= 0bgtz - branch > 0bltz - branch < 0bgez - branch >= 0bct - branch FCC TRUEbcf - branch FCC FALSE

Load/Store:lb - load bytelbu - load byte unsignedlh - load half (short)lhu - load half (short) unsignedlw - load worddlw - load double wordl.s - load single-precision FPl.d - load double-precision FPsb - store bytesbu - store byte unsignedsh - store half (short)shu - store half (short) unsignedsw - store worddsw - store double words.s - store single-precision FPs.d - store double-precision FP

addressing modes: (C) (reg + C) (w/ pre/post inc/dec) (reg + reg) (w/ pre/post inc/dec)

Integer Arithmetic:add - integer addaddu - integer add unsignedsub - integer subtractsubu - integer subtract unsignedmult - integer multiplymultu - integer multiply unsigneddiv - integer dividedivu - integer divide unsignedand - logical ANDor - logical ORxor - logical XORnor - logical NORsll - shift left logicalsrl - shift right logicalsra - shift right arithmeticslt - set less thansltu - set less than unsigned

SimpleScalar TutorialSimpleScalar TutorialPage 20

SimpleScalar Instructions

Floating Point Arithmetic:add.s - single-precision addadd.d - double-precision addsub.s - single-precision subtractsub.d - double-precision subtractmult.s - single-precision multiplymult.d - double-precision multiplydiv.s - single-precision dividediv.d - double-precision divideabs.s - single-precision absolute valueabs.d - double-precision absolute valueneg.s - single-precision negationneg.d - double-precision negationsqrt.s - single-precision square rootsqrt.d - double-precision square rootcvt - integer, single, double conversionc.s - single-precision comparec.d - double-precision compare

Miscellaneous:nop - no operationsyscall - system callbreak - declare program error

SimpleScalar TutorialSimpleScalar TutorialPage 21

SimpleScalar Architected StateVirtual Memory

0x00000000

0x7fffffff

Unused

Text(code)

Data(init)(bss)

StackArgs & Env

0x00400000

0x10000000

0x7fffc000

.

.

r0 - 0 source/sink

r1 (32 bits)

r2

r31

Integer Reg File

.

.

f0 (32 bits)

f1

f2

f31

FP Reg File (SP and DP views)

r30

f30

f1

f3

f31

PC

HI

LO

FCC

SimpleScalar TutorialSimpleScalar TutorialPage 22

Simulator I/O

• a useful simulator must implement some form of I/Oq I/O implemented via SYSCALL instruction

q supports a subset of Ultrix system calls, proxied out to host

• basic algorithm (implemented in syscall.c):q decode system call

q copy arguments (if any) into simulator memory

q perform system call on host

q copy results (if any) into simulated program memory

write(fd, p, 4)

Simulated Program Simulator

sys_write(fd, p, 4)

args in

results out

SimpleScalar TutorialSimpleScalar TutorialPage 23

Simulator S/W Architecture• interface programming style

q all “.c” files have an accompanying “.h” file with same base

q “.h” files define public interfaces “exported” by moduleq mostly stable, documented with comments, studying these files

q “.c” files implement the exported interfacesq not as stable, study these if you need to hack the functionality

• simulator modulesq sim-*.c files, each implements a complete simulator core

• reusable S/W components facilitate “rolling your own”q system components

q simulation components

q “really useful” components

SimpleScalar TutorialSimpleScalar TutorialPage 24

Simulator S/W Architecture

• most of performance core is optional

• most projects will enhance on the “simulator core”

BPred SimulatorCore

Machine DefinitionFunctional

Core

SimpleScalar ISA POSIX System Calls

Proxy Syscall Handler

Dlite!

Cache MemoryRegsLoader

Resource

Stats

PerformanceCore

Prog/SimInterface

SimpleScalar Program BinaryUserPrograms

SimpleScalar TutorialSimpleScalar TutorialPage 41

SIM-OUTORDER: H/W Architecture

• implemented in sim-outorder.c and components

Fetch DispatchRegister

Scheduler

MemoryScheduler

Writeback CommitExec

Mem

D-Cache(DL1)

I-Cache(IL1)

Virtual Memory

D-TLBI-TLB

I-Cache(IL2)

D-Cache(DL2)

2007-11-14 18

Sim-Outorder HW Architecture

Fetch DispatchRegister

SchedulerExe Writeback Commit

I-Cache

Memory

SchedulerMem

Virtual Memory

D-Cache D-TLBI-TLB

ruu_fetch ruu_dispatch ruu_issue

lsq_refresh

ruu_writeback ruu_commit

2007-11-14 19

Sim-Outorder (Main Loop) • sim_main() in sim-outorder.c

ruu_init();

for(;;){

ruu_commit();

ruu_writeback();

lsq_refresh();

ruu_issue();

ruu_dispatch();

ruu_fetch();

}

• Executed once for each simulated machine cycle

• Walks pipeline from Commit to Fetch– Reverse traversal handles inter-stage latch synchronization by only

one pass

2007-11-14 20

Sim-Outorder (RUU/LSQ)

• RUU (Register Update Unit)

– Handles register synchronization/communication

– Serves as reorder buffer and reservation stations

• LSQ (Load/Store Queue)

– Handles memory synchronization/communication

– Contains all loads and stores in program order

• Relationship between RUU and LSQ

– Memory dependencies are resolved by LSQ

– Load/Store effective address calculated in RUU

SimpleScalar TutorialSimpleScalar TutorialPage 46

Fetch

misprediction (from Writeback)

to instruction fetch queue (IFQ)

Fetch Stage Implementation

• models machine fetch bandwidth

• implemented in ruu_fetch()

• inputs:q program counter

q predictor state (see bpred.[hc])

q misprediction detection from branch execution unit(s)

• outputs:q fetched instructions sent to instruction fetch queue (IFQ)

SimpleScalar TutorialSimpleScalar TutorialPage 47

Fetch

misprediction (from Writeback)

to instruction fetch queue (IFQ)

Fetch Stage Implementation

• procedure (once per cycle):q fetch instructions from one I-cache line, block until I-cache or

I-TLB misses are resolved

q queue fetched instructions to instruction fetch queue (IFQ)

q probe branch predictor for cache line to access in next cycle

SimpleScalar TutorialSimpleScalar TutorialPage 48

Dispatch to RUU or LSQinstructionsfrom IFQ

Dispatch Stage Implementation

• models machine decode, rename, RUU/LSQ allocationbandwidth, implements register renaming

• implemented in ruu_dispatch()

• inputs:q instructions from IFQ, from Fetch stage

q RUU/LSQ occupancy

q rename table (create_vector)

q architected machine state (for execution)

• outputs:q updated RUU/LSQ, rename table, machine state

SimpleScalar TutorialSimpleScalar TutorialPage 49

Dispatch Stage Implementation

• procedure (once per cycle):q fetch insts from IFQ

q decode and execute instructionsq permits early detection of branch mis-predicts

q facilitates simulation of “oracle” studies

q if branch misprediction occurs:q start copy-on-write of architected state to speculative state buffers

q enter instructions into RUU and LSQ (load/store queue)q link to sourcing instruction(s) using RS_LINK structure

q loads/stores are split into two insts: ADD + Load/Storeq improves performance of memory dependence checking

Dispatch to RUU or LSQinstructionsfrom IFQ

SimpleScalar TutorialSimpleScalar TutorialPage 50

RegisterScheduler

MemoryScheduler

RUU, LSQ to functional units

Scheduler Stage Implementation

• models instruction wakeup, selection, and issueq separate schedulers track register and memory dependencies

• implemented in ruu_issue()and lsq_refresh()

• inputs:q RUU/LSQ

• outputs:q updated RUU/LSQ

q updated functional unit state

SimpleScalar TutorialSimpleScalar TutorialPage 51

RegisterScheduler

MemoryScheduler

RUU, LSQ to functional units

Scheduler Stage Implementation

• procedure (once per cycle):q locate instructions with all register inputs ready

q in ready queue, inserted when dependent insts enter Writeback

q locate loads with all memory inputs readyq determined by walking the load/store queue

q if load addr unknown, then stall issue (and poll again next cycle)

q if earlier store w/ unknown addr, then stall issue (and poll again)

q if earlier store w/ matching addr, then forward store data

q else, access D-cache

SimpleScalar TutorialSimpleScalar TutorialPage 52

insts issued by Scheduler completed insts to WritebackExec

Mem

requests to memory hierarchy

Execute Stage Implementation

• models functional units and D-cacheq access port bandwidths, issue and execute latencies

• implemented in ruu_issue()

• inputs:q instructions ready to execute, issued by Scheduler stage

q functional unit and D-cache state

• outputs:q updated functional unit and D-cache state, Writeback events

SimpleScalar TutorialSimpleScalar TutorialPage 53

Execute Stage Implementation

• procedure (once per cycle):q get ready instructions (as many as supported by issue B/W)

q find free functional unit and access port

q reserve unit for entire issue latency

q schedule writeback event using operation latency offunctional unitq for loads satisfied in D-cache, probe D-cache for access latency

q also probe D-TLB, stall future issue on a miss

q D-TLB misses serviced in Commit with fixed latency

insts issued by Scheduler completed insts to WritebackExec

Mem

requests to memory hierarchy

SimpleScalar TutorialSimpleScalar TutorialPage 54

detected mispredictions to Fetch

Writebackfinished insts from Execute insts ready to Commit

Writeback Stage Implementation

• models writeback bandwidth, wakes up ready insts,detects mispredictions, initiated misprediction recovery

• implemented in ruu_writeback()

• inputs:q completed instructions as indicated by event queue

q RUU/LSQ state (for wakeup walks)

• outputs:q updated event queue, RUU/LSQ, ready queue

q branch misprediction recovery updates

SimpleScalar TutorialSimpleScalar TutorialPage 55

detected mispredictions to Fetch

Writebackfinished insts from Execute insts ready to Commit

Writeback Stage Implementation

• procedure (once per cycle):q get finished instructions (specified by event queue)

q if mispredicted branch, recover state:q recover RUU

q walk newest instruction to mispredicted branch

q unlink instructions from output dependence chains (tag increment)

q recover architected stateq roll back to checkpoint (copy-on-write bits reset, spec mem freed)

q wakeup walk: walk output dependence chains of finished instsq mark dependent instruction’s input as now ready

q if deps satisfied, wake up inst (memory checked in lsq_refresh())

SimpleScalar TutorialSimpleScalar TutorialPage 56

Commitinsts ready to Commit

Commit Stage Implementation

• models in-order retirement of instructions, storecommits to the D-cache, and D-TLB miss handling

• implemented in ruu_commit()

• inputs:q completed instructions in RUU/LSQ that are ready to retire

q D-cache state (for store commits)

• outputs:q updated RUU, LSQ, D-cache state

SimpleScalar TutorialSimpleScalar TutorialPage 57

Commit Stage Implementation

• procedure (once per cycle):q while head of RUU/LSQ is ready to commit (in-order

retirement)q if D-TLB miss, then service it

q if store, attempt to retire store into D-cache, stall commitotherwise

q commit instruction result to the architected register file, updaterename table to point to architected register file

q reclaim RUU/LSQ resources (adjust head pointer)

Commitinsts ready to Commit

2007-11-14 27

Sim-Outorder parameters

• Instruction fetch queue size, decode and issue bandwidth

• Capacity of RUU and LSQ

• Branch mis-prediction latency

• Number of functional units– integer ALU, integer multipliers/dividers

– FP ALU, FP multipliers/dividers

• Latency of I-cache/D-cache, memory and TLB

• Record statistic by text address

SimpleScalar TutorialSimpleScalar TutorialPage 109

Specifying Cache Configurations• all caches and TLB configurations specified with same format:

<name>:<nsets>:<bsize>:<assoc>:<repl>

• where:<name> - cache name (make this unique)<nsets> - number of sets<assoc> - associativity (number of “ways”)<repl> - set replacement policy

l - for LRUf - for FIFOr - for RANDOM

• examples:il1:1024:32:2:l 2-way set-assoc 64k-byte cache, LRUdtlb:1:4096:64:r 64-entry fully assoc TLB w/ 4k pages,

random replacement

SimpleScalar TutorialSimpleScalar TutorialPage 110

Specifying Cache Hierarchies• specify all cache parameters in no unified levels exist, e.g.,

-cache:il1 il1:128:64:1:l -cache:il2 il2:128:64:4:l

-cache:dl1 dl1:256:32:1:l -cache:dl2 dl2:1024:64:2:l

• to unify any level of the hierarchy, “point” an I-cache level into thedata cache hierarchy:

-cache:il1 il1:128:64:1:l -cache:il2 dl2

-cache:dl1 dl1:256:32:1:l -cache:dl2 ul2:1024:64:2:l

il1 dl1

il2 dl2

il1 dl1

ul2

SimpleScalar TutorialSimpleScalar TutorialPage 115

Specifying the Branch Predictor• specifying the branch predictor type:

-bpred <type>

the supported predictor types are:nottaken always predict not takentaken always predict takenperfect perfect predictorbimod bimodal predictor (BTB w/ 2 bit counters)2lev 2-level adaptive predictor

• configuring the bimodal predictor (only useful when “-bpred bimod” isspecified):

-bpred:bimod <size> size of direct-mapped BTB

SimpleScalar TutorialSimpleScalar TutorialPage 116

Specifying the Branch Predictor (cont.)• configuring the 2-level adaptive predictor (only useful when “-bpred

2lev” is specified):

-bpred:2lev <l1size> <l2size> <hist_size>

where:

<l1size> size of the first level table<l2size> size of the second level table<hist_size> history (pattern) width

l1size

patternhistory

hist_size

branchaddress

l2size

2-bitpredictors

branchprediction