Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network...
-
Upload
rodney-matthews -
Category
Documents
-
view
216 -
download
0
Transcript of Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network...
Performance, Cost, and Energy Evaluation of Fat H-
Tree:
A Cost-Efficient Tree-BasedOn-Chip Network
Hiroki Matsutani (Keio Univ, JAPAN)Michihiro Koibuchi (NII, JAPAN)
Hideharu Amano (Keio Univ, JAPAN)
Introduction• Network-on-Chips
– Tile architecture– On-chip routers– Packet switching
• Various NoC topologies– Mesh, Torus– H-Tree, Fat Trees
• Fat H-Tree (FHT)
• Evaluations of FHT– Performance– Area– EnergyA mesh-based on-chip network
0 1 2
3 4 5
6 7 8
Tile (RISC, DSP, RAM, I/O)
We proposed FHT as an alternative to Fat Trees
NoCs’ topologies: Mesh & Torus
• 2-D Mesh • 2-D Torus– 2x bandwidth of meshRAW [Taylor, IEEE Micro’02]
Router Core
Fat H-Tree is a tree-based topology, but it includes a torus
structure
NoCs’ topologies: Fat Trees
• Fat Tree (p, q, c)p: # of upward linksq: # of downward
linksc: # of core ports
Router Core
Fat Tree (2,4,2)Fat Tree (2,4,1)
Rank-1
Rank-2
Trees are duplicated in Fat Trees and Fat H-Tree, but the connection patterns of trees are different!
Outline• NoCs’ topologies
– Mesh, Torus– H-Trees, Fat Trees
• Fat H-Tree (FHT)– Structure– 2-D layout– Routing algorithm (DTR)
• Evaluations of FHT– Network logic area– Energy consumption– Throughput
Fat H-Tree: Structure
• Fat H-Tree– Red Tree (H-Tree)– Black Tree (H-Tree)
[Yamada, EUC’04]
Combining two H-Trees (red & black)
Router Core Router Core
Location of black tree is shifted lower-right direction of red tree
By shifting the location of black tree, the connection pattern of trees
are different from original Fat Trees
Fat H-Tree: Structure
• Fat H-Tree– Red Tree (H-Tree)– Black Tree (H-Tree)
[Yamada, EUC’04]
Combining two H-Trees (red & black)
Router Core Router Core
Fat H-Tree is formed on red & black trees
Fat H-Tree: Structure
• Fat H-Tree– Red Tree (H-Tree)– Black Tree (H-Tree)
[Yamada, EUC’04]
Combining two H-Trees (red & black)
Router Core Router Core
Fat H-Tree is formed on red & black trees
Fat H-Tree: Structure
• Fat H-Tree– Red Tree (H-Tree)– Black Tree (H-Tree)
[Yamada, EUC’04]
Combining two H-Trees (red & black)
Router Core Router Core
Fat H-Tree is formed on red & black trees
Fat H-Tree: Structure
• Fat H-Tree– Red Tree (H-Tree)– Black Tree (H-Tree)
[Yamada, EUC’04]
Combining two H-Trees (red & black)
Router Core Router Core
Rank-2 or upper routers are omitted in this figure
Each core is connected to
both red & black trees
Ring is formed with cores & rank1
routers
Torus-level performance by combing only two H-Trees
Fat H-Tree: 2-D layout on VLSI
• Fat H-Tree– Torus structure Folded as well as the folded layout of 2-D Torus
Fat H-Tree’s 2-D layoutRouter Core
Topologically equivalent
(Long feedback links across chip)
Fat H-Tree: Routing algorithm
• Paths on a single H-tree– Only red tree, or– Only black tree
Only red tree 6-
hopOnly black
tree 6-hop
Fat H-Tree: Routing algorithm
• Paths on a single H-tree– Only red tree, or– Only black tree
• Paths across trees– Transit between
trees– Minimum paths
Firstly red is used
Then black is used, total 4-hop (minimum)
Transit!
Exploiting such paths is key for improving the
performance
Fat H-Tree: Dual tree routing (DTR)
• Dual tree routing– Transit trees for
minimum paths– Cycles across trees
• Deadlock avoidance– VC# is increased
when a packet transits from red to black
VC#0 is used
VC#1 is used
Transit!
Sufficient number of VCs is only TWO in 64-node FHT
Outline• NoCs’ topologies
– Mesh, Torus– H-Trees, Fat Trees
• Fat H-Tree (FHT)– Structure– 2-D layout– Routing algorithm (DTR)
• Evaluations of FHT– Network logic area– Energy consumption– Throughput
Ideal throughput: Channel bisection
Bandwidth of FHT is much improved by the torus structure
N=16 N=64 N=256
HT 4 4 4 4
FT1 8 16 32
FT2 16 32 64
FHT 24 40 72
Mesh 8 16 32
Torus 16 32 64
FT1: Fat Tree(2,4,1) FT2: Fat Tree(2,4,2)
nn 22N
1n2
2n2
2n2
1n2
82 2n
due to torus
due to two H-Trees
Number of routers
Router count of FHT is less than Fat Tree(2,4,2)
N=16 N=64 N=256
HT 5 21 85
FT1 6 28 120
FT2 12 56 240
FHT 10 42 170
Mesh 16 64 256
Torus 16 64 256
FT1: Fat Tree(2,4,1) FT2: Fat Tree(2,4,2)
nn 22N
2/)24( nn nn 24
N
3/)14(2 n
3/)14( n
N
Note number of NI is not considered.
FHT requires 2-port NIs for red & black
Network logic area (routers & NIs)
• Synthesis of NoC– 16-core, 64-core– Design Compiler– 0.18um CMOS
• Router architecture– 1-flit = 32-bit– 4-stage pipeline– Wormhole, 2VCs
• NI architecture– In: 2-flit FIFO– Out: 2-flit FIFO
CrossbarInput Ports
Buf
Wormhole router
Buf
Buf
Buf
2VCs
2VCs
FHT’s NI is implemented as a “router” to forward packets
between trees
Synthesis result (64-
core)
Network logic area: 16/64-core
Synthesis result (16-
core)
Network logic area of FHT is smaller than Fat Tree(2,4,2)
FHT’s NI is larger than others
Total wire length of all links
• Total unit-length of links– Core router– Router router
1-unit link
1-unit link
How many unit-links would FHT require?
1-unit = distance between neighboring cores
N=16 N=64 N=256
HT 24 112 480
FT1 32 192 1,024
FT2 64 384 2,048
FHT 72 392 1,800
Mesh 24 112 480
Torus 48 224 960
FT1: Fat Tree(2,4,1) FT2: Fat Tree(2,4,2)
nn 22N
nN
)2(2 nN 1
1
2
)12(88
n
nN
nN2
)2(4 nN
n
nN
2
)12(2
Wire length of FHT is almost the same as Fat Tree(2,4,2)
Energy: NoC’s energy model
• Ave. flit energy– Send 1-flit to dest.– How much
energy[J] ?
• Parameters– 12mm square chip– 16/64-core– 0.18um CMOS
• Switching energy– 1-bit switching @ router– Gate-level sim– 1.88 [pJ / hop]– 1.27 [pJ / hop]– 1.45 [pJ / hop]
• Link energy– 1-bit transfer @ link– 0.67 [pJ / mm]
flitE
swE
linkE)( linkswaveflit EEHwE
[Wang, DATE’05]
12mm
for routers
for NI
for NI(fht)
Energy consumption: 16/64-core
Simulation result (16-
core)
Energy consumption of FHT is less than Fat Tree(2,4,2)
Simulation result (64-
core)
Throughput: Simulation environment
• Flit-level simulation– Throughput / latency– 16/64-core
• Topology (routing)– Mesh, Torus (DOR)– Fat Trees (up/down)– Fat H-Tree (DTR)
• Traffic patterns– Uniform– BT.W– SP.W– CG.W– MG.W– IS.W
Packet size 16-flit (1-flit header)Buffer size 1-flit per channel
Switching Wormhole
# of VCs 2Latency 3-cycle per 1-hop
NAS Parallel Benchmark
FHT vs. FTs: Uniform (16/64-core)• FHT (DTR) • Fat Tree(2,4,2)• Fat Tree(2,4,1)
FHT outperforms FT2 in 16-core,but it doesn’t in 64-core
Uniform (16-core) Uniform (64-core)
FHT(DTR) causes
congestion around root of
trees
FHT vs. FTs: BT (16/64-core)
BT has neighboring communications. Advantage for FHT(DTR)
BT traffic (64-core)
• FHT (DTR) • Fat Tree(2,4,2)• Fat Tree(2,4,1) FHT(DTR)
doesn’t cause congestion
around roots
BT traffic (16-core)
FHT vs. FTs: MG (16/64-core)
Performance is … FHT(DTR) > FT2 > FT1
MG traffic (16-core) MG traffic (64-core)
• FHT (DTR) • Fat Tree(2,4,2)• Fat Tree(2,4,1)
Summary: Evaluations of FHT
• Performance– FHT outperforms Fat Tree (FT2), except for
uniform
• Network logic area– FHT requires 20.5%-28.1% smaller area than FT2
• Energy consumption– FHT requires 6.7%-7.0% less energy than FT2
• Wire length– Wire length of FHT is almost the same as FT2
• Ongoing works– Evaluation in 90nm CMOS– 3-D layout of FHT for 3-D NoCs
wafer
wafer
wafer
(stacked ICs)
Thank you for your attention
Feasibility of Fat H-Tree
• Total wire length– Slightly longer than Fat Trees– But a lot of wire resources are available on-chip
• Wire delay– Length of the longest wire is same as Fat Trees
Fat Tree (2,4,1)Fat H-Tree
If Fat Trees are feasible, Fat H-Tree can be implemented with smaller area but higher
performance
Routings for FHT: Torus routing(TOR)
• Single tree (STR)– Select a single tree
per packet– Can’t transit trees
• Dual tree (DTR)– Transit trees for
minimal paths– VCs are needed
• Torus routing (TOR)– Use torus formed
with rank1 & cores– VCs are needed
Fat H-Tree’s torus structure
Can’t use rank-2 or upper
routers
To avoid congestion around roots, but non-minimal paths
FHT vs. Torus: Uniform (16/64-core)
• FHT (DTR): • FHT (TOR): • 2-D Torus• 2-D Mesh
Minimum routing using links around roots
Using torus structure (can’t use links around roots)
Uniform (64-core)
FHT achieves torus-level throughput using only torus structure
Uniform (16-core)
Number of VCs in Dual Tree Routing
• # of VCs required is– H_max is the longest hop count in the
network
• E.g.,– 16-core FHT requires 2VCs– 64-core FHT requires 2VCs– …
14/max H
VC# is increased when a packet transits red to
black
Two VCs is not so costly…
NIs in Fat H-Tree• Implemented as a
“simplified router”– Connecting red & black
trees
• Routing @ NI is simple– Forward packets to another
tree if dst is not me
Processing Core
Crossbar
for red tree for black tree
Fat H-Tree
Synthesis result (64-
core)
Network logic area: 16/64-core
Synthesis result (16-
core)
Network logic area of FHT is smaller than Fat Tree(2,4,2)
FHT’s NI is larger than others
• Fat H-Tree– Minimum routing (DTR)
routing N=16 N=64 N=256
FT up/down 3.60 5.43 7.36
FHT DTR 3.20 4.84 6.78
Mesh DOR 2.67 5.33 10.67
Torus DOR 2.13 4.06 8.03
FHT offers shorter average hop count than Fat Trees
Average hop count
Nyx,
y)(x,2ave HN-N
H1
FT: Fat Trees
Wire length of links
• Case studies– 16-core (1-unit = 3.0mm)– 64-core (1-unit = 1.5mm)
1-unit = 3mm
Utilization rate of wire resources in 2 metal layers (%)
1-unit = 1.5mm
Flit-width = 32-bit @ 12mm square chip
12mm
N=16 N=64
HT 1.6% 3.7%
FT1 2.1% 6.4%
FT2 4.3% 12.8%
FHT 4.8% 13.1%
Mesh 1.6% 3.7%
Torus 3.2% 7.5%
Wire length of FHT is almost the same as Fat Tree(2,4,2)
Routings for FHT: Single tree (STR)
• Single tree (STR)– Select a single tree
per packet– Can’t transit trees
• Dual tree (DTR)– Transit trees for
minimal paths– VCs are needed
• Torus routing (TOR)– Use torus formed
with rank1 & cores– VCs are needed
Case 1: red tree 6-hop
Case 2: black tree 4-hop
Routings for FHT: Dual tree (DTR)
• Single tree (STR)– Select a single tree
per packet– Can’t transit trees
• Dual tree (DTR)– Transit trees for
minimal paths– VCs are needed
• Torus routing (TOR)– Use torus formed
with rank1 & cores– VCs are needed
Firstly red is used
Then black is used
# of VC is increased when a packet transits red to
black
Fat H-Tree: Structure
• Fat H-Tree– Red Tree (H-Tree)– Black Tree (H-Tree)
[Yamada, EUC’04]
Combining two H-Trees (red & black)
Router Core Router Core
Both edges are connected (folded)
By shifting and folding black tree, the connection pattern of trees are
different from original Fat Trees