MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

31
Forwardflow: A Scalable Core for Power-Constrained CMPs Dan Gibson and , and David A. Wood from University of Wisconsin—Madison (ISCA), June 2010 MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam

Transcript of MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Page 1: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Forwardflow: A Scalable Core for Power-Constrained CMPs

Dan Gibson and , and David A. Wood from University of Wisconsin—Madison

(ISCA), June 2010

MICROPROCESSOR ARCHITECTURE(ECE519)

Prof. Lynn ChoiPresented by Sam

Page 2: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Main ideal Motivation Challenges Core design

• A method for representing inter-instruction data dependences

• Forwardflow – Dataflow Queue (DQ)• Forwardflow architecture

Related Work Conclusion & problems

Outline

Page 3: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Motivation and Challenges

Consider this vision: Micro architects hope to improve applications’ overall efficiency by focusing on thread-level parallelism (TLP), rather than instruction-level parallelism (ILP) within a single thread.

Page 4: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Challenges

Parallel Speedup Limited by Parallel Fraction i.e., Only ~10x

speedup at N=512, f=90%

[ ](1 - f ) + f

N

-1

Two fundamental problems:• Amdahl’s Law ?• All cores can simultaneously operate at full

speed?

Increase more cores, but speed up a little bit

Page 5: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Simultaneously Active Fraction (SAF):

“the fraction of the entire chip resources that can be active simultaneously”

Challenges

The physical limits on power delivery and heat dissipation. In the long term, to maintain fixed power and area budgets as technology scales, The fraction of active transistors must decrease with each technology generation.

Two fundamental problems:• Amdahl’s Law ?• All cores can simultaneously operate at

full speed?

Page 6: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Motivation

• For single-thread performance Exploit ILP

• For multiple threads Save power Exploit TLP

Page 7: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

CMPs will need Scalable Cores

Scale UP for PerformanceUse more resources for more performance (allowing single-threaded applications to aggressively exploit ILP and MLP to the limits of available power).

Resources: (e.g., cores, caches, hardware accelerators, etc.).

Scale DOWN

Motivation

A scalable core is a processor capable of operating in several different configurations, each offering a varied power/ performance point.

Page 8: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

CMPs will need Scalable Cores Scale UP Scale DOWN for Energy Conservation

Exploit TLP with many small cores when power is constrained, scalable cores can scale down to conserve per-core energy

Motivation

Page 9: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Scalable Cores

CMP equipped with scalable cores: Scaled up to run few threads quickly(left), and scaled down to run many threads in parallel (right).

Scalable cores have the potential to adapt their behavior to best match their current workload and operating conditions.

Page 10: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Ideal: Forward flowNew Scalable Core μArch

Uses pointers Distributes values Scales to large instruction window sizes Full window scheduler

Scales dynamically Variable-sized instruction window

Ideas

Page 11: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Forwardflow architecture

Problems: in a scalable core, resource allocation changes over time. Designers of scalable cores should avoid structures that are difficult to scale, like centralized register files and bypassing networks. This work focus on scaling window

size

Page 12: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Serialized Successor Representation (SSR)

• A method for representing inter-instruction data dependences, called Serialized Successor Representation (SSR)

• Instead of maintaining value names, SSR describes values’ relationships to operands of other instructions.Instructions in SSR are represented as three-operand tuples:• SOURCE1 (S1), SOURCE2(S2), and DESTINATION

(D), • Each operand consists of a value and a successor

pointer.• Operand pointers are used to represent data

dependences

Page 13: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Serialized Successor Representation (SSR)

D S1 S2

• The pointer field of the producing instruction’s D-operand designates the first successor operand, belonging to a subsequent successor—usually the S1- or S2-operand of a later instruction.

• If a second successor exists, the pointer field at the first successor operand will be used to designate the location of the second successor operand

The locations of subsequent operands can be encoded in a linked-list fashion, relying on the pointer at successor i to designate the location of successor i+1.

Page 14: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Serialized Successor Representation (SSR)

distributed chains of pointers

NULL pointer

Pros:• not in renaming• never requires a search or broadcast operation to locate a

successor for any dynamic value• it can be built from simple SRAMs

Page 15: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Forwardflow – Dataflow Queue (DQ)

• Instructions, values, and data dependences reside in a distributed Dataflow Queue (DQ)

• The DQ is comprised of independent banks and pipelines, which can be activated or de-activated by system software to scale a core’s execution resources to implement core scaling.

Page 16: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Forwardflow architecture

Page 17: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Fetch

Read instructions from L1-I CachePredict BranchesPass on to Decode phase

Fetch proceeds no differently than other high-performance microarchitectures.

Page 18: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Decode

Determine to which pointer chains, if any, each instruction belongs. It does this using the Register Consumer Table (the RCT resembles a traditional rename table).The RCT is implemented as an SRAM-based table. The RCT also identifies registers last written by a committed instructionDecode detects and handles potential

data dependences, analogous to traditional renaming.

Page 19: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Dispatch

Dispatch inserts instructions into the Dataflow Queue (DQ) and instructions issue when their operands become available

Page 20: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Dispatched/Executing

• the ld instruction is ready to issue because both source available in the ARF

• Decode updates the RCT to indicate that ld produces R3

Dispatch reads the ARF to obtain R1’s value, writes both operands into the DQ, and issues the ld

Page 21: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

When the add is decoded, it consults the RCT and finds that R3’s previous use was as the ld’s destination field

• Dispatch updates the pointer from ld’s destination to the add’s first source operand.

• the add’s immediate operand (55) is written into the DQ at dispatch.

Page 22: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

The mult’s decode consults the RCT, and discovers that both operands, R3 and R4, are notyet available and were last referenced by the add’s source 1 operand and the add’s destinationoperand

Dispatch of the mult therefore checks for available results in both the add’s source 1 value array and destination value array, and appends the mult to R3’s and R4’s pointerchains.

Page 23: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

the sub appends itself to the R3 pointer chain, and writes its dispatch-time ready operand (66) into the DQ.

Page 24: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Wakeup, Selection, and Issue

• completion of the ld, the memory value (99) is written into the DQ

• ld’s destination pointer is followed to the first successor

Page 25: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Wakeup, Selection, and Issue

• the add’s metadata and source 2 value are read, and, coupled with the arriving value of 99, the add now be issued.

• The update hardware reads the add’s source 1 pointer, discovering the mult as the next successor.

Page 26: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Wakeup, Selection, and Issue

• the mult’s metadata, other source operand, and next pointer field are read.

• the source 1 operand is unavailable, and the mult will issue at a later time

Page 27: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Wakeup, Selection, and Issue

Finally, following the mult’s source 2 pointer to the sub delivers 99 to the sub’s first operand, enabling the sub to issue.

Page 28: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Methodology: target machine

On each tile resides a single core, a private L1-I cache (32KB), a private write-through write-invalidate L1-D cache (32KB), aprivate L2 cache (1MB) which manages coherency in the L1-D via inclusion, and one bank of ashared L3 cache. It is assumed that cores and private caches can be powered off without affecting the shared L3, the L3 operates in its own voltage domain

Page 29: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Scalable Schedulers Direct Instruction Wakeup [Ramirez04]:

Scheduler has a pointer to the first successor Secondary table for matrix of successors

Hybrid Wakeup [Huang02]: Scheduler has a pointer to the first successor Each entry has a broadcast bit for multiple successors

Half Price [Kim02]: Slice the scheduler in half Second operand often unneeded

Related Work

Page 30: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Dataflow & Distributed Machines Tagged-Token [Arvind90]

Values (tokens) flow to successors TRIPS [Sankaralingam03]:

Discrete Execution Tiles: X, RF, $, etc. EDGE ISA

Clustered Designs [e.g. Palacharla97] Independent execution queues

Related Work

Page 31: MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Conclusion: Allowing the system to trade-off power and

performanceProblems: What happened if DQ banks is larger (>8) or smaller

(<8) We have know ideal about software must change to

accommodate concurrency

Conclusion and problems