Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures
description
Transcript of Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures
![Page 1: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/1.jpg)
Scalar Operand Networks:On-Chip Interconnect for ILP in
Partitioned Architectures
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
Laboratory for Computer ScienceMassachusetts Institute of Technology
![Page 2: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/2.jpg)
Motivation
INT6
As a thought experiment,let’s examine the Itanium II,published in last year’sISSCC:
6-way issue Integer Unit< 2% die area
Cache logic> 50% die area Cache logic
![Page 3: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/3.jpg)
Hypothetical Modification
INT6
Why not replace a smallportion of the cache withadditional issue units?
“30-way” issue micro!
Integer Units stilloccupy less than 10% area
> 42 % cache logic
INT6
INT6
INT6
INT6
![Page 4: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/4.jpg)
Can monolithic structures like this be attained at high frequency?
The 6-way integer unit in Itanium II already spends 50% of its critical path in bypassing.
[ISSCC 2002 – 25.6]
Even if dynamic logic or logarithmic circuits could be used to flatten the number of logic levels of these huge structures –
![Page 5: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/5.jpg)
...wire delay is inescapable
1 cycle180 nm 45 nm
Ultimately, wire delay limits the scalabilityof un-pipelined, high-frequency,centralized structures.
![Page 6: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/6.jpg)
One solution: Chip multiprocessors
e.g., IBM’s two-core Power4
Research and Commercial multiprocessors have been designed to scale to 1000’s of ALUs
These multiprocessors scale because theydon’t have any centralized resources.
![Page 7: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/7.jpg)
Multiprocessors: Not Quite Appropriate for ILP
- High cost of inter-node operand routing
Vast difference between local and remotecommunication costs ( 30x )...
10’s to 100’s of cycles to transfer the output ofone instruction to the input of an instruction on another node
.. forces programmers and compilers to use entirelydifferent algorithms at the two levels
![Page 8: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/8.jpg)
An alternative to a CMP: a distributed microprocessor design
Such a microprocessor would distribute resources to varying degrees:
Partitioned register files,Partitioned ALU clusters,Banked caches,Multiple independent compute pipelines,
... even multiple program counters
![Page 9: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/9.jpg)
Some distributed microprocessordesigns
Conventional
Alpha 21264 – integer clusters
Radical Proposals
UT Austin’s Grid,Wisconsin’s ILDP and MultiscalarMIT’s Raw and Scale,Dynamic Dataflow,TTA,Stanford Smart Memories...
![Page 10: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/10.jpg)
Some distributed microprocessorsdesigns
Interesting Secondary Development:
The centralized bypass network is beingreplaced by a more general, distributed,interconnection network!
![Page 11: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/11.jpg)
Artist’s View
I$RF D$
I$RF D$
I$RF D$
I$RF D$
SophisticatedInterconnect
Distributed Resources
ld a
ld b
+ >> 3
*
st b
![Page 12: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/12.jpg)
How are these networks differentthan existing networks?
Route scalar values, not multi-word packets
Designed to join operands and operations in space:
Ultra-Low latency
Ultra-Low occupancy
Unstructured communication patterns
In this paper, we call these networks “scalar operand networks”, whether centralized or distributed.
![Page 13: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/13.jpg)
What can we do to gain insight aboutthe scalar operand networks?
Looking at a existing systems and proposals,
Try to figure out what’s hard about these networks
Find a way to classify them
Gain a quantitative understanding
![Page 14: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/14.jpg)
5 Challenges for Scalar Operand Networks
Delay Scalability- ability of a design to maintain high frequencies as that design scales
![Page 15: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/15.jpg)
Challenge 1: Delay Scalability
Intra-component Structures that grow as the system scales become bottlenecked by both interconnect delay and logic depths
Register FilesMemoriesSelection LogicWakeup Logic ....
![Page 16: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/16.jpg)
Challenge 1: Delay Scalability
Intra-component Structures that grow as the system scales become bottlenecked by both interconnect delay and logic depths
Register FilesMemoriesSelection LogicWakeup Logic ....
Solution: Pipeline the structure
Turn propagation delay into pipeline latencyExample: Pentium 4 pipelines regfile access
Solution: Tile
![Page 17: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/17.jpg)
Challenge 1: Delay Scalability
Intra-componentInter-component
Problem of wire delay between componentsOccurs because it can take many cycles for remote components to communicate
![Page 18: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/18.jpg)
Challenge 1: Delay Scalability
Intra-componentInter-component
Problem of wire delay between componentsOccurs because it can take many cycles for remote components to communicate
Each component must operate with onlypartial knowledge. Assign time cost for transfer of non-local information.Examples: ALU outputs, stall signals,
branch mispredicts, exceptions, memory dependence info
Examples: Pentium 4 wires, 21264 int. clusters
Solution: Decentralize
![Page 19: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/19.jpg)
5 Challenges for Scalar Operand NetworksDelay Scalability
- ability of design to scale while maintaining high frequencies
Bandwidth Scalability- ability of design to scale without inordinately increasing the relative percentage of resources dedicated to interconnect
![Page 20: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/20.jpg)
Challenge 2: Bandwidth Scalability
Global broadcasts don’t scale
Example: Snoopy cachesSuperscalar Result Buses
Problem: Each node has to process some sort of incoming data proportional to the total number of nodes in the system.
![Page 21: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/21.jpg)
Challenge 2: Bandwidth Scalability
Global broadcasts don’t scale
Example: Snoopy cachesSuperscalar Result Buses
Problem: Each node has to process some sort of incoming data proportional to the total number of nodes in the system.
The delay can be pipelined ala Alpha 21264, but each node still has to process too many incoming requests each cycle.
Imagine a 30-way issue superscalar where eachALU has its own register file copy. 30 writes per cycle!
![Page 22: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/22.jpg)
Challenge 2: Bandwidth Scalability
Global broadcasts don’t scale
Example: Snoopy cachesSuperscalar Result Buses
Problem: Each node has to process some sort of incoming data proportional to the total number of nodes in the system.
The delay can be pipelined ala Alpha 21264, but each node still has to process too many incoming requests.
Solution: Switch to a directory scheme Replace bus with point-to-point network
Replace broadcast with unicast or multicastDecimate bandwidth requirement
![Page 23: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/23.jpg)
Challenge 2: Bandwidth Scalability
A directory scheme for ILP?!!! Isn’t that expensive?
Directories store dependence information, in other words,the locations where an instruction should send its result
Fixed Assignment Architecture: Assign each static instruction to an ALU at compile time
Compile dependent ALU locations w/ instrs. The directory is “looked up” locally when the
instruction is fetched.
![Page 24: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/24.jpg)
Challenge 2: Bandwidth Scalability
A directory scheme for ILP?!!! Isn’t that expensive?
Directories store dependence information, in other words,the locations where an instruction should send its result
Fixed Assignment Architecture: Assign each static instruction to an ALU at compile time
Compile dependent ALU locations w/ instrs. The directory is “looked up” locally when the
instruction is fetched.
Dynamic Assignment Architecture: Harder, somehow we have to figure out which ALU owns the dynamic instruction that we are sending to. True directory lookup may be too $$$.
![Page 25: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/25.jpg)
5 Challenges for Scalar Operand NetworksDelay Scalability
- ability of design to scale while maintaining high frequencies
Bandwidth Scalability- ability of design to scale without inordinately increasing the relative percentage of resources dedicated to interconnect
Deadlock and Starvation- distributed systems need to worry about over-committing internal buffering example: dynamic dataflow machines “throttling”
Exceptional Events- Interrupts, branch mispredictions, exceptions
![Page 26: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/26.jpg)
5 Challenges for Scalar Operand NetworksDelay Scalability
- ability of design to scale while maintaining high frequencies
Bandwidth Scalability- ability of design to scale without inordinately increasing the relative percentage of resources dedicated to interconnect
Deadlock and StarvationExceptional Events
Efficient Operation-Operand Matching- Gather operands and operations to meet at some point in space to perform a dataflow computation
![Page 27: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/27.jpg)
Challenge 5: Efficient Operation-Operand Matching
The rest of this talk!
If operation-operand matching is too expensive,there’s little point to scaling.
Since this is so important, let’s try to come up witha figure of merit for a scalar operand network -
![Page 28: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/28.jpg)
What can we do to gain insight aboutscalar operand networks?
Looking at a existing systems and proposals,
Try to figure out what’s hard about these networks
Find a way to classify the networks
Gain a quantitative understanding
![Page 29: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/29.jpg)
Defining a figure of merit foroperation-operand matching
5-tuple <SO, SL, NHL, RL, RO>:
Send Occupancy
Send Latency
Network Hop Latency
Receive Latency
Receive Occupancy
tip: Ordering follows timing of message from sender to receiver
![Page 30: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/30.jpg)
The interesting region
conventional <10, 30, 5,30,40>distributed multiprocessor
Superscalar < 0, 0, 0, 0, 0>(not scalable)
![Page 31: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/31.jpg)
16 instructions per cycle
(fp, int, br, ld/st, alu..)
no centralized resources
~250 Operand Routes / cycle
Two applicable on-chip networks
- message passing - dedicated scalar operand network
Scalability story:
Raw: Experimental Vehicle
tiles registered on input,
just add more tiles
Simulations are for 64 tiles, prototype has 16
![Page 32: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/32.jpg)
The interesting region
conventional <10, 30, 5,30,40>distributed multiprocessor
Superscalar < 0, 0, 0, 0, 0>(not scalable)
![Page 33: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/33.jpg)
Two points in the interesting region
conventional <10, 30, 5,30,40>distributed multiprocessor
Raw / msg passing < 3, 2, 1, 1, 7>
Raw / scalar < 0, 1, 1, 1, 0>
Superscalar < 0, 0, 0, 0, 0>(not scalable)
![Page 34: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/34.jpg)
Message Passing 5-tuple <3,
compute value
send header
send sequence #
send value
Three wasted cycles per send Sender Occupancy = 3
sendmessage
(Using Raw’s on-chip message passing network)
use the value
![Page 35: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/35.jpg)
Message Passing 5-tuple <3,2,
compute value
send header
send sequence #
send value
Two cycles for message to exit proc Sender Latency = 2(Assumes early commit point)
use the value
![Page 36: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/36.jpg)
Message Passing 5-tuple <3,2,1,
compute value
send header
send sequence #
send value
Messages take one cycle per hop Per-hop latency = 1
use the value
![Page 37: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/37.jpg)
Message Passing 5-tuple <3,2,1,1,
compute value
send header
send sequence #
send value
One cycle for message to enter proc Receive Latency = 1
use the value
![Page 38: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/38.jpg)
Message Passing 5-tuple <3,2,1,1,7>
compute value
send header
send sequence #
send value
branch if set
get sequence #
compare #
load tag
branch if not eq
use the value
Seven wasted cycles for receive Receive Occupancy = 7 (minimum)
demultiplexmessage
![Page 39: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/39.jpg)
Raw’s 5-tuple <0,
compute, send value
Zero wasted cycles per send Sender Occupancy = 0
use the value
![Page 40: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/40.jpg)
Raw’s 5-tuple <0,1,
compute value
One cycles for message to exit proc Sender Latency = 1
compute, send value use the value
![Page 41: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/41.jpg)
Raw’s 5-tuple <0,1,1,
Messages take one cycle per hop Per-hop latency = 1
compute, send value use the value
![Page 42: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/42.jpg)
Raw’s 5-tuple <0,1,1,1,
use the value
One cycle for message to enter proc Receive Latency = 1
compute, send value
![Page 43: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/43.jpg)
Raw’s 5-tuple <0,1,1,1,0>
use the value
No wasted cycles for receive Receive Occupancy = 0
compute, send value
![Page 44: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/44.jpg)
Superscalar 5-tuple <0,
use the value
Zero wasted cycles for send Send Occupancy = 0
compute, send value
![Page 45: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/45.jpg)
Superscalar 5-tuple <0,0,0,0,
use the value
Zero cycles for all latencies Send, Hop, Receive Latencies = 0
compute, send value
![Page 46: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/46.jpg)
Superscalar 5-tuple <0,0,0,0,0>
use the value
No wasted cycles for receive Receive Occupancy = 0
compute, send value
![Page 47: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/47.jpg)
Superscalar 5-tuple, late wakeup
use the value
Wakeup signal will usually have tobe sent ahead of time. If it’s not, thenthe 5-tuple could be <0,0,0,1,0>.
compute, send value
Wakeup, select
![Page 48: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/48.jpg)
5-tuples of several architectures
Superscalar <0, 0,0, 0,0>Message Passing <3, 2+c,1, 1,7>
<3, 3+c,1, 1,12>
Distributed Shared Memory (F/E bits) <1,14+c,2,14,1>
Raw <0, 1,1, 1,0>
ILDP <0, n,0, 1,0> (n = 0, 2)
Grid <0, 0,n/8, 0,0> (n = 0..8)
![Page 49: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/49.jpg)
What can we do to gain insight aboutscalar operand networks?
Looking at a existing systems and proposals,
Try to figure out what’s hard about these networks
Find a way to classify the systems
Gain a quantitative understanding
![Page 50: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/50.jpg)
5-tuple Simulation ExperimentsRaw’s actual scalar operand network
Raw + Magic parameterized scalar operand network
<0,1,1,1,0> Magic Network<0,1,1,1,0> Raw
- Allows us to vary latencies and measure contention - Each tile has FIFOs connected to every other tile.
<1,14,2,14,0> Magic Network, Shared Memory Costs<3,3,1,1,12> Magic Network, Message Passing Costs
..and others
..Vary all 5 parameters..
![Page 51: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/51.jpg)
2 4 8 16 32 64
cholesky 1.62 3.23 6.00 9.19 11.90 12.93
vpenta 1.71 3.11 6.09 12.13 24.17 44.87
mxm 1.93 3.73 6.21 8.90 14.84 20.47
fppp-kernel 1.51 3.34 5.72 6.14 5.99 6.54
sha 1.12 1.96 1.98 2.32 2.54 2.52
swim 1.60 2.62 4.69 8.30 17.09 28.89
jacobi 1.43 2.76 4.95 9.30 15.88 22.76
life 1.81 3.37 6.44 12.05 21.08 36.10
Speedup versus 1 Tile
Raw’s Scalar Operand Network i.e., <0,1,1,1,0>
![Page 52: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/52.jpg)
0
0.2
0.4
0.6
0.8
1
1.2
0 4 8 12 16
Cycles
Spee
dup
vs. R
aw
choleskyvpentamxmfpppp-kernelshaswimjacobilife
Impact of Receive Occupancy, 64 tiles,i.e., <0,1,1,1,n>
![Page 53: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/53.jpg)
00.20.40.60.8
11.21.41.6
0 16 32 48 64
Cycles
Spee
dup
vs. R
aw cholesky
vpentamxmfpppp-kernelshaswimjacobilife
Impact of Receive Latency, 64 tiles,i.e., <0,0,0,n,0> magic
![Page 54: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/54.jpg)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 16 32 48 64
Number of tiles
Spee
dup
vs. R
aw
choleskyvpentamxmfpppp-kernelshaswimjacobilife
Impact of Contentioni.e. Magic <0,1,1,1,0> / Raw’s <0,1,1,1,0>
![Page 55: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/55.jpg)
Raw ILDPGridSuperscalar
OperandTransportMessageDemultiplex
Intranodeinstr. order
Free intranode bypassingInstr distribution
5-Tuple
AssociativeInstrWindow
Runtimeordering
yes
DynamicAssignment
<0,0,0,0,0>
Broadcast
<0,1,1,1,0>
Point to Point
Compile Time Scheduling
CompileTimeOrdering
RuntimeOrdering
yes no yes
CompilerAssignment
CompilerAssignment
DynamicAssignment
<0,1,N/8,1,0>
Point to Point
Distributed AssociativeInstr Window
F/E bits ondistributedregister files
<0,N,0,1,0>
Broadcast
CompileTimeOrdering
![Page 56: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/56.jpg)
Open Scalar Operand Network ?’s• Can we beat <0,1,1,1,0> ?
• How close can we get to <0,0,0,0,0> ?
• How do we prove that our <0,0,,0,0>scalar operand network would have a high frequency?
![Page 57: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/57.jpg)
Open Scalar Operand Network ?’s• Can we build scalable dynamic-assignment architectures
• What is the impact of run-time vs. compile-time routingon the 5-tuple?
for load-balancing? Is there a penalty for the 5-tuple?
• What are the benefits of heterogeneous scalar operandnetworks? For instance, a <0,1,2,1,0> of 2-way <0,0,0,0,0>’s.
• Can we generalize these networks to support othermodels of computation, like streams or SMT-style threads?
![Page 58: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/58.jpg)
More Open ?’s• How do we design low energy scalar operand networks ?
• How do we support speculation on a distributed scalar operand network ?
• How do compilers need to change?
![Page 59: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures](https://reader035.fdocuments.net/reader035/viewer/2022062816/5681564a550346895dc3ed01/html5/thumbnails/59.jpg)
Summary• 5 Challenges
Delay Scalability
Bandwidth Scalability
Deadlock / Starvation
Exceptions
Efficient Operation-Operand Matching
• 5 tuple model
• Quantitative Results
• Mentioned open questions