Post on 01-Feb-2018
Architectures for Networks‐on‐Chips with Emerging Interconnect Technologies
1
CMPE 750 –Project Presentation
Sagar Saxena
Content…
2
System on chip : Idea & overview Need for Multi-Core chip NOCs Paradigm Traditionally used & Limitation 2D Conventional NOCs architecture Wiring complexity Emerging interconnect technology
3D NOCsPhotonic NOCs
CORONA ARCHITECTUREFIREFLY ARCHITECTURE
On- chip RF/wireless interconnect On-chip Antennas used Small world network : Wattz Strogatz Model
3
System on chipTechnology Scaling
Transistor count:exponential growth
Prohibitive manufacturing costs
Over 100 millions for today’s SoCs
10M$ - 100M$0.13micron designs
Reusability + Platform Based Design
4
System on Board vs. System on ChipConceptually System on Chip refers to integrating the components of a board onto a single chip.Looks straightforward but productivity levels are too low to make it a reality
5
Examples of IP Blocks in Use today• RISC: ARM, MIPS, PowerPC, SPARC• CISC: 680x0 x86• Interfaces: USB, PCI, UART, Rambus• Encryptions: DES, AES• Multimedia” JPEG coder, MPEG decoder• Networking: ATM switch, Ethernet• Microcontroller: HC11, etc.• DSP: Oak, TI, etc.
SoC is forcing companies to develop high-quality IP blocks to stay in business.
Overview of Design and Technology Trends
1.5GHz Itanium chip (Intel), 410M tx, 374mm2 , 130W@1.3V
1.1 GHz POWER4 (IBM), 170M tx, 115W@1.5V
if these trends continue, power will become unmanageable
150Mhz Sony Graphics Processor, 7.5M tx (logic) + 280M tx (memory) = 288M tx,
400mm2 10W@1.8V
if trend continues, most designs in the future will have a high percentage of memory
Contd…
Interconnects are the biggest bottleneck
We need to look beyond the metal/dielectric‐based planar architectures
Optical, 3D integration and Wireless are the emerging alternatives
Single‐chip Bluetooth transceiver (Alcatel), 400mm2, 150mW@2.5V
required 30 designers over 2.5 years (75 person‐years)
if trend continues, it will be difficult to integrate larger systems on a single chip in a
reasonable time
Moore’s Law
•The number of transistors in a
chip doubles approximately every
two years
• This provide the computational
power demanded so far
•Challenge :
• Scaling up of frequency causes
elevated power dissipation. Original Moore's Law graph, 1965
Source: Intel
Power Dissipation
0.1
1
10
100
1,000
10,000
’71 ’74 ’78 ’85 ’92 ’00 ’04 ’08
Power Density(Watts/cu. m)
40048008
80808085
8086286
386
486
Pentium®
processorsHot PlateNuclear Reactor
Rocket Nozzle
Sun’s Surface
Source: Intel
• Scaling up speed/frequency is impossible
Multi‐Core Systems Era
Challenge :
Keeping up the demand of computational power as clock
frequency cannot be scaled.
Solution:
Increase number of cores ‐ parallelism
• Intel, AMD dual‐core and quad‐core CPUs
• Custom Systems‐on‐Chip (SoCs)
• Number of cores will increase manifold in the next 5‐10 years
New challenge: interconnection of the cores!Single-chip Cloud
ComputerIntel processor
Noc: The Network‐on‐Chip Paradigm Packet switched on‐chip network
Route packets, not wires –Bill Dally, 2000.
Dedicated infrastructure for data transport
Decoupling of functionality from communication
NoC infrastructure
Multiple publications in IEEE ISSCC, 2010 from Intel, IBM, AMD, Renesas Tech. and Sun Microsystems show that multi-core NoC is a reality
Limitations of a Traditional NoCMulti‐hop wireline communicationHigh Latency and energy dissipation
source destination
-core
-NoC interface
-NoC switch
80% of chip power will be from on-chip interconnects in the next 5 years –ITRS, 2007
Wireless/RFInterconnects
Optical Interconnects
Three DimensionalIntegration
Novel Interconnect Paradigms for Multicore designs
High Bandwidth and
Low Energy Dissipation
14
The network-on-chip paradigm
Driven by
Increased levels of integration
Complexity of large SoCs
– New designs counting 100s of IP blocks
Need for platform‐based design methodologies
DSM constraints (power, delay, time‐to‐market, etc…)
15
Some Common Architectures (a) Mesh, (b) Folded-Torus (FT) and (c) Butterfly Fat Tree (BFT)
- Functional IP - Switch
(a) (b)
(c)
16
Data Transmission Packet‐based communication Low memory requirement
Packet switching
Wormhole routing Packets are broken down into flow control units or flits
which are then routed in a pipelined fashion
17
Connecting Different IP Blocks Using Tree Architecture
18
2D Network on Chips
• Current SoC contains several different microprocessor components, memory sub‐systems and I/Os together on single chip.
• Fast developing SoC Market• Development of Multi‐processor SoC platforms for embedded processors.
• MP‐SoC contains combination of numerous IPs performing computation on different clock frequencies which communicate with each other.
19
•Integration of all the blocks on the single chip.
•Non-scalable global wires, as delay increases exponentially with increase in the length.
•In ultra-deep sub-micron process 80 % of delay is due to interconnects
•FIFO are used for data synchronization but it becomes costly due to process variability and
power dissipation for achieving cross chip signaling in single clock cycle.
•Thus there is a need for on chip interconnect architecture: frequently used are shared medium
arbitrated bus.
•Hence use of network on chip : parallelism, exhibits modularity, less power consumption.
•High throughput and low latency, low energy consumption with little area overhead.
Challenges & Need
20
SPIN (Scalable, Programmable Interconnect Network)
• Guerrier and Greiner proposed SPIN
• Parent is represented four times at the
entry.
• Picture shows N = 16 nodes, hence 16
IP blocks can be connected.
• Network growth = (NlogN)/8.
• Switches are placed at vertices and
number of switches = (3N)/4
21
CLICHÉ (Chip level Integration of communicating heterogeneous Elements)
• m x n Mesh of switches
interconnecting Ips
• Every switch is connecting to the
neighboring four switches.
2D Torus
Dally and Towles proposed 2D Torus
22
Folded Torus
23
• Enhanced version of 2D Torus.
• Can suitably be implemented in
the VLSI implementation.
• The length of each segment
remains same.
24
OCTAGON
Karim et al. Proposed OCTAGON for
MP‐SoC
Picture shows 8 nodes and 12 bi‐
directional links.
Communication between pair of nodes
will take atleast 2 hops within the basic
orthogonal unit.
For more large systems orthogonal unit
is extended to multidimensional space.
25
Butterfly Fat‐Tree (BFT)• Ips are placed in the leaves and switches are placed at
the vertices.
• Nodes are denoted by (l,p) where l denotes level and p
denotes position.
• Switch is denoted by S(l,p).
• Four down links corresponding to child ports and two
uplinks for parent port, so total number of switches (at
level j = N/2 ^ (j+1) ) at j= 1 is N/4.
• As we go on going up the switches required will reduce
by 2.
2626
Performance Metrics•Architecture should exhibit ‐> high throughput, low latency, energy efficiency, & low area overhead.•Message Throughput ‐>
•Transport Latency ‐> Occurrence of message header from source to occurrence of tail at the destination.
•Energy ‐> Energy consumed due to the use of interconnects and switches.
•Area Requirements ‐> The area consumed by the switches and the interconnects. Some interconnects need to be buffered, this also increases the area.
2727
Evaluation Methodology•Developed Simulator for flit‐level event driven worm‐hole routing to study characteristics of interconnect
architectures.
•Traffic injection is based on two distributions namely poisson and self‐similar.
•Each simulation run for 1000 cycles to allow transient effects to stabilize and then executed for 20,000 cycles.
•Throughput calculated as number of flits reaching each destination per unit time.
•Average latency and energy is calculated as shown in previous slide.
•For area complete VHDL model was synthesized using 0.13 micro meter technology and calculated for area.
•To determine interconnect energy capacitance was calculated for each topology.
2828
No. of Channels & throughput Variation throughput under spatially uniform distribution
Variation of latency with virtual channels
Energy Dissipation profile for uniform traffic
3030
Chip area based on Topology•Wire length between switches in BFT and SPIN architectures
•To keep inter‐switch delay within one clock cycle some wires need to be buffered.
•In CLICHÉ and Folded Torus, all the inter‐switch wires are of equal length and delay is within one clock
cycle.
•In Octagon, the inter‐switch wires connecting functional IP blocks in disjoint octagon units needs to be
buffered.
31
Area Overhead
• Silicon area overhead for all platforms under
consideration across different technology nodes,
assuming die size of 20mm x 20mm and each
functional IP block consists of 100k gates equivalent
of 2‐input NAND gates.
• SPIN and Octagon has higher silicon area overhead
due to higher degree of connectivity.
• The percentage of silicon area overhead for different
platforms increases slightly with technology scaling
but relative area remains unchanged.
32
Wiring complexity•Wire lengths between switches in the BFT and SPIN depends upon level of switches.
•Inter‐switch wire length is given by:
•For CLICHÉ and Folder Torus all the wire segments are of same length.
•
•For CLICHÉ is given by and folded torus is all the wire lengths are double than CLICHÉ.
•In Octagon links are differentiated as long links (0‐4, and 7‐3), medium links(0‐7, 3‐4, 1‐5, 2‐6) and
short links(0‐1, 1‐2, 2‐3, 4‐5, 5‐6, and 6‐7).
33
Wiring Octagon
34
Wiring SPIN
35
Wiring BFT
36
Conclusion• NoC based architecture with respect to topology, various topologies such as SPIN,
Folded Torus, CLICHÉ, Octagon and BFT are discussed with effect to various factors
power consumption, Latency, effect on area, wiring complexity, throughput,
bandwidth, etc.
• Results showing some can achieve high data rate with an expense of high energy
while others can provide lower data rate with low energy consumption.
• Locality also is an important aspect while designing NoC as it can reduce power
consumption and increase throughput significantly.
• These are important aspects of characterizing emerging network on chip architecture.
37
SO FAR……. NOC Paradigm
Traditionally used & Limitation2D Conventional NOC architectureWiring complexity
Now……. Emerging interconnect technology
3D NOCsPhotonic NOCs
CORONA ARCHITECTUREFIREFLY ARCHITECTURE
On- chip RF/wireless interconnect On-chip Antennas used
38
• Reduction in total power consumption
• Smaller footprint
• Shorter interconnects
• Circuit Security and many more…..
Figure 1. Stacked IC [2]
Why go 3D??????
Three-Dimensional Integrated Circuits
• Developing very fast
• Active devices stacked in many layers
• Reasons
– Limited floor planning choices
– Desire to integrate disparate technologies
(GaAs, SOI, SiGe, BiCMOS) and diparate
signals (analog, digital,RF)
– Interconnect bottleneck
39
2D IC
3D IC
As small as 20µm
40
3D NoC
• Pavlidis et al., “3-D topologies for Networks-on-Chip”, IEEE Transactions on Very Large Scale Integration (TVLSI), 2007.
• Stacking multiple active layers
• Manufacturability
– Mismatch between various layers
– Yield is an issue
• Temperature concerns
– Despite power advantages, reduced
footprint increases power density
41
Architectures
424242
Photonic NoC• High bandwidth photonic links for high payload transfers• Challenges: On‐going research– On‐chip integration of photonic components
• A. Shacham et al., “Photonic Network-on-Chip for Future Generations of Chip Multi-Processors”, IEEE Transactions on Computers, 2008.
43
Corona Architecture
• HP Labs
• Serpentine Waveguide
• 3D layer for photonic
devices
43
44
Firefly Architecture
*Yan Pan, Prabhat Kumar, John Kim, Gokhan Memik, Yu Zhang, and Alok Choudhary. 2009. Firefly: illuminating future network-on-chip with nanophotonics.
In Proceedings of the 36th annual international symposium on Computer architecture (ISCA '09). ACM, New York, NY, USA, 429-440.
45
On-Chip RF/Wireless Interconnects• Replace long distance
wires• Use of waveguides out of
package or IC structures like parallel metal wires
• Chang et al. demonstrated Transmission Line based RF interconnect for on chip communication– Not really wireless
46
STATE –OF‐ART OF NoC
On‐Chip Antennas
Several types• Metal• Carbon Nanotube (CNT)High B/W, frequency light (IR, visible, UV)
• Small wavelength: small antennas• less areaCNTs as Optical AntennasDirectional radiation characteristics
• Agrees with radio antenna theoryLaser excitation
47
• Kempa, et al., "Carbon Nanotubes as Optical Antennae," Advanced Materials, 2007. • Slepyan, et al., "Theory of optical scattering by achiral carbon nanotubes and their potential as optical nanoantennas," Physical Review B.
48
Reliability of CNT Antennas• Various uncertainties in CNT growth
– Length– Diameter– Electronic properties
• Semiconducting vs. Metallic
• Rates of uncertainties tend to be high– Much higher than CMOS processes
• Can lead to poor performance– Failure of wireless links
Design Constraints
Limited wireless bandwidth Off‐chip laser sources Design of metal antennas/transceivers
On‐chip wireless nodes have associated overhead Transceiver
Modulator/demodulator antennas
49
How to efficiently distribute the wireless resources?
505050
Hybrid nature of the WiNoC• Augment wireline with
wireless• Not completely wireless• Can use wireless links only for
long distance communication• >10mm
51
Architectural Solutions in Nature• Natural Complex Networks
– Brain: Efficient function and memory retention
– Microbes: Biological Networks
– social networks – Paul Erdös
52
Topology• Small-World graphs: The Watts-Strogatz Model
• Often found in nature
• Scales well: low average distance
• Few high speed shortcuts: Wireless
• Local, shorter links: Wireline
regular latticeL: HiC: Hi
Small-worldL: LoC: Hi
random graphL: LoC: Lo
9 May 2016 53
Small World and Scale-Free Network Model
• Stochastic links• Power-tail probability distribution
– SF: Degree/node– SW: Link Length
P
l or d
54
Fault-Tolerant Wireless NoC• Small-World topology
– Inherently fault-tolerant• Link Insertion
– Following adjacent probability distribution
• Few high speed shortcuts: Wireless Interconnects
• Local, shorter links: Wireline
P(i, j) lij fij
lij fij
j
i
55
Experimental Setup• Topology
– Small World• wire line and wireless
• System Size– 64 & 256 cores– Representative of current industry designs
• Wireless links– 24 links
B.G. Lee et al., “Ultrahigh-Bandwidth Silicon Photonic Nanowire Waveguides for On-ChipNetworks,” IEEE Photonics Technology Letters, 2008
56
Experimental Results• Throughput
– Flits/core/cycleThis image cannot currently be displayed.
64 cores 256 coresMinimal effect on performance with wireless link failures
A. Ganguly, et al., “Complex Network Inspired Fault-tolerant NoC Architectures with WirelessLinks”, Proc. Of IEEE/ACM International Symposium on Networks-on-Chip (NOCS), May, 2011.
57
Comparison of Saturation Throughput
• Hierarchical Wireless NoC (WiNoC)– Same CNT wireless links
This image cannot currently be displayed.
This image cannot currently be displayed.
64 cores 256 cores
Better fault-tolerance than hierarchical NoC
58
Packet Energy Dissipation•Packet Energy Dissipation
•Significantly less than wireline NoCs
Increase is insignificant with wireless link failures
Wiring & Area Overheads
59
This image cannot currently be displayed.
This image cannot currently be displayed.
• Wiring requirements for 64 core systems
•Area overheads for 64 core systems
Insignificant real estate overheads
60
Conclusion
• On-chip wireless communication with CNTantenna• Efficient NoC architecture can be built• But emerging interconnect technologies are defect-prone
• Inherent resilience of natural complex networks• Wireless NoC topologies inspired by the small-world architecture• Better power-performance trade-off• Fault-Tolerance• Security
61
REFERENCE PAPERS Networks‐on‐Chip in a Three‐Dimensional Environment: A
Performance Evaluation. Scalable Hybrid Wireless Network‐on‐Chip Architectures for Multi‐
Core . Systems Co‐design of 3D Wireless Network‐on‐Chip Architectures
with Microchannel‐Based Cooling. Performance Evaluation and Design Trade‐Offs for Network‐on‐
Chip Interconnect Architectures. Corona: System Implications of Emerging Nanophotonic
Technology. Firefly: Illuminating Future Network‐on‐Chip with Nanophotonics. It’s a Small World After All”: NoC Performance Optimization Via
Long‐Range Link Insertion.
62