MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Forwardflow: A Scalable Core for Power-Constrained CMPs

Dan Gibson and , and David A. Wood from University of Wisconsin—Madison

(ISCA), June 2010

MICROPROCESSOR ARCHITECTURE(ECE519)

Prof. Lynn ChoiPresented by Sam

Main ideal Motivation Challenges Core design

• A method for representing inter-instruction data dependences

• Forwardflow – Dataflow Queue (DQ)• Forwardflow architecture

Related Work Conclusion & problems

Outline

Motivation and Challenges

Consider this vision: Micro architects hope to improve applications’ overall efficiency by focusing on thread-level parallelism (TLP), rather than instruction-level parallelism (ILP) within a single thread.

Challenges

Parallel Speedup Limited by Parallel Fraction i.e., Only ~10x

speedup at N=512, f=90%

[ ](1 - f ) + f

N

-1

Two fundamental problems:• Amdahl’s Law ?• All cores can simultaneously operate at full

speed?

Increase more cores, but speed up a little bit

Simultaneously Active Fraction (SAF):

“the fraction of the entire chip resources that can be active simultaneously”

Challenges

The physical limits on power delivery and heat dissipation. In the long term, to maintain fixed power and area budgets as technology scales, The fraction of active transistors must decrease with each technology generation.

Two fundamental problems:• Amdahl’s Law ?• All cores can simultaneously operate at

full speed?

Motivation

• For single-thread performance Exploit ILP

• For multiple threads Save power Exploit TLP

CMPs will need Scalable Cores

Scale UP for PerformanceUse more resources for more performance (allowing single-threaded applications to aggressively exploit ILP and MLP to the limits of available power).

Resources: (e.g., cores, caches, hardware accelerators, etc.).

Scale DOWN

Motivation

A scalable core is a processor capable of operating in several different configurations, each offering a varied power/ performance point.

CMPs will need Scalable Cores Scale UP Scale DOWN for Energy Conservation

Exploit TLP with many small cores when power is constrained, scalable cores can scale down to conserve per-core energy

Motivation

Scalable Cores

CMP equipped with scalable cores: Scaled up to run few threads quickly(left), and scaled down to run many threads in parallel (right).

Scalable cores have the potential to adapt their behavior to best match their current workload and operating conditions.

Ideal: Forward flowNew Scalable Core μArch

Uses pointers Distributes values Scales to large instruction window sizes Full window scheduler

Scales dynamically Variable-sized instruction window

Ideas

Forwardflow architecture

Problems: in a scalable core, resource allocation changes over time. Designers of scalable cores should avoid structures that are difficult to scale, like centralized register files and bypassing networks. This work focus on scaling window

size

Serialized Successor Representation (SSR)

• A method for representing inter-instruction data dependences, called Serialized Successor Representation (SSR)

• Instead of maintaining value names, SSR describes values’ relationships to operands of other instructions.Instructions in SSR are represented as three-operand tuples:• SOURCE1 (S1), SOURCE2(S2), and DESTINATION

(D), • Each operand consists of a value and a successor

pointer.• Operand pointers are used to represent data

dependences


D S1 S2

• The pointer field of the producing instruction’s D-operand designates the first successor operand, belonging to a subsequent successor—usually the S1- or S2-operand of a later instruction.

• If a second successor exists, the pointer field at the first successor operand will be used to designate the location of the second successor operand

The locations of subsequent operands can be encoded in a linked-list fashion, relying on the pointer at successor i to designate the location of successor i+1.


distributed chains of pointers

NULL pointer

Pros:• not in renaming• never requires a search or broadcast operation to locate a

successor for any dynamic value• it can be built from simple SRAMs

Forwardflow – Dataflow Queue (DQ)

• Instructions, values, and data dependences reside in a distributed Dataflow Queue (DQ)

• The DQ is comprised of independent banks and pipelines, which can be activated or de-activated by system software to scale a core’s execution resources to implement core scaling.

Forwardflow architecture

Fetch

Read instructions from L1-I CachePredict BranchesPass on to Decode phase

Fetch proceeds no differently than other high-performance microarchitectures.

Decode

Determine to which pointer chains, if any, each instruction belongs. It does this using the Register Consumer Table (the RCT resembles a traditional rename table).The RCT is implemented as an SRAM-based table. The RCT also identifies registers last written by a committed instructionDecode detects and handles potential

data dependences, analogous to traditional renaming.

Dispatch

Dispatch inserts instructions into the Dataflow Queue (DQ) and instructions issue when their operands become available

Dispatched/Executing

• the ld instruction is ready to issue because both source available in the ARF

• Decode updates the RCT to indicate that ld produces R3

Dispatch reads the ARF to obtain R1’s value, writes both operands into the DQ, and issues the ld

When the add is decoded, it consults the RCT and finds that R3’s previous use was as the ld’s destination field

• Dispatch updates the pointer from ld’s destination to the add’s first source operand.

• the add’s immediate operand (55) is written into the DQ at dispatch.

The mult’s decode consults the RCT, and discovers that both operands, R3 and R4, are notyet available and were last referenced by the add’s source 1 operand and the add’s destinationoperand

Dispatch of the mult therefore checks for available results in both the add’s source 1 value array and destination value array, and appends the mult to R3’s and R4’s pointerchains.

the sub appends itself to the R3 pointer chain, and writes its dispatch-time ready operand (66) into the DQ.

Wakeup, Selection, and Issue

• completion of the ld, the memory value (99) is written into the DQ

• ld’s destination pointer is followed to the first successor


• the add’s metadata and source 2 value are read, and, coupled with the arriving value of 99, the add now be issued.

• The update hardware reads the add’s source 1 pointer, discovering the mult as the next successor.


• the mult’s metadata, other source operand, and next pointer field are read.

• the source 1 operand is unavailable, and the mult will issue at a later time


Finally, following the mult’s source 2 pointer to the sub delivers 99 to the sub’s first operand, enabling the sub to issue.

Methodology: target machine

On each tile resides a single core, a private L1-I cache (32KB), a private write-through write-invalidate L1-D cache (32KB), aprivate L2 cache (1MB) which manages coherency in the L1-D via inclusion, and one bank of ashared L3 cache. It is assumed that cores and private caches can be powered off without affecting the shared L3, the L3 operates in its own voltage domain

Scalable Schedulers Direct Instruction Wakeup [Ramirez04]:

Scheduler has a pointer to the first successor Secondary table for matrix of successors

Hybrid Wakeup [Huang02]: Scheduler has a pointer to the first successor Each entry has a broadcast bit for multiple successors

Half Price [Kim02]: Slice the scheduler in half Second operand often unneeded

Related Work

Dataflow & Distributed Machines Tagged-Token [Arvind90]

Values (tokens) flow to successors TRIPS [Sankaralingam03]:

Discrete Execution Tiles: X, RF, $, etc. EDGE ISA

Clustered Designs [e.g. Palacharla97] Independent execution queues

Related Work

Conclusion: Allowing the system to trade-off power and

performanceProblems: What happened if DQ banks is larger (>8) or smaller

(<8) We have know ideal about software must change to

accommodate concurrency

Conclusion and problems

MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Documents

Transcript of MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.