GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011...

GPUs and Future of Parallel Computing

Authors: Stephen W. Keckler et al.in NVIDIA

IEEE Micro, 2011

Taewoo Lee2013.05.24

[email protected]

Three Challenges for Parallel-Computing Chips

• Limited power budget• Bandwidth gap between computation and

memory• Parallel programmability

http://voice.korea.ac.kr p.2

Computers have been constrained by power and energy rather than area

• Power budget is limited – about 150W for desktops or 3W for mobile devices

( leakage, cooling∵ )

• transistor components per chip have been con-tinuously increased (Moore’s law)– Total power consumption also has been increased


Computers have been constrained by power and energy rather than area

• E.g. Supercomputer – Power budget= 20 MW– Target compute capability= 1018 Flops/sec= 1 exaFlops/sec– Power/Flop= 20×10-12= 20 pJ/Flop

• However, modern CPUs (Intel’s Westmere)– 1700 pJ/Flop (double-precision, 130W/77GFlops∵ )

• GPU (Fermi architecture)– 225 pJ/Flop (single-precision, 130W/665GFlops∵ )

• ×1/85, ×1/11 improvement is needed


Energy-efficiency will require reducing both instruction execution and data movement overheads


• Instruction overheads– Modern CPUs were optimized for single-thread perfor-

mance• E.g. Branch prediction, out-of-order execution, and large primary in-

struction and data caches• So, energy is consumed in overheads of data supply, instruction supply,

and control

– To get higher throughput, future architectures must con-sume their energy to more useful work (i.e. computation)



• Todays, energy consumption of double-precision fused-multiply add (DFMA) is around 50 pJ

• Data movement power dissipation is also large– E.g. Power to read three 64-bit source operands and to write one destina-

tion operand • to SRAM= 56 pJ (≈DFMA)• to 10 mm more distance memory= 56×6 (pJ)• to external DRAM= 56 ×200 (pJ)

∵14 pJ × 4 = 56 pJ

• Because communication dominates energy, both within the chip and across the external memory interface, energy-efficient architec-tures must decrease the amount of data movement by exploiting locality


p.7

• With the scaling projection to 10 nm, The ratios between DFMA, on-chip SRAM, and off-chip DRAM access energy stay relatively constant

• However, the relative energy cost of 10 mm global wires goes up to 23 times the DFMA energy ( wire C remains constant)∵– Feature size ↓ → relative power consumption of wire ↑

1: 6.2

1: 23

3.6 :1

3.6 :1

Bandwidth gap between computation and memory is severe. Also, power consumption by data movement is

pretty serious

Bandwidth gap between computa-tion and memory becomes bigger and bigger→ How to narrow this gap is very important


Despite the relatively narrow memory bandwidth, chip-to-chip power comsumption is too big!( DRAM max. BW 175 GB/sec ∵ ⅹ20 pJ/bit= 28W + 21W for signaling= 49 W/sec,49W accounts for 20% of total GPU TDP (thermal design power) )→ Again, reducing data movement is necessary

To cope with the bandwidth gap problem,

• Architects are trying• Multichip modules (MCMs)

• DRAMs on-chip (to reduce latency)• CPU + GPU on-chip (to reduce transfer overheads)

• but also sharing bandwidth by both CPU and GPU can aggravate bandwidth utilization

• 3D chip stacking• Deeper memory hierarchy

• Bandwidth utilization– Coalescing– Prefetch– Data compression (more data per transaction)

http://www.extremetech.com/computing/95319-ibm-and-3m-to-stack-100-silicon-chips-together-using-glue


For Parallel Programmability, Programmers must be able to

• Represent data access pattern and data placement(∵ Memory model is no more flat, coalesced access)

• Deal thousands of threads• Choose what kind of processing cores their tasks are

running on ( heterogeneity will be increased)∵

• Also, coherence and consistency should be relaxed to facilitate grater memory-level parallelism– ∵the cost of coherence protocol is too high– Sol) Give programmers selective coherence


To cope with these challenges




Echelon: A Research GPU Architecture

• Goals– Double precision 16 TFlops/sec–Memory bandwidth= 1.6 TB/sec– Power budget ≤ 150W– 20 pJ/Flop


Echelon Block Diagram:Chip Level Architecture


- 64 Tiles- Each tile consists of 4 throughput opti-mized cores (TOCs) i.e. GPU for through-put oriented parallel tasks

16 DRAM memory controllers (MCs)

8 latency optimized cores (LOCs)i.e. CPUs for operat-ing system, serial portion

Echelon Block Diagram:Throughput Tile Architecture


- 4 TOCs per tile.- Each TOC has sec-ondary on-chip storage.- It may be DRAMs on-chip.

Characteristics of a TOC: MIMD + SIMD, Configurable and Sharable SRAM, and LIW per

lane

p.17

Temporal SIMT- Divergent code → MIMD- Non-divergent → SIMT (more energy-efficient)Two-level register files- Operand register file (ORF)

for producer-consumer relationship between subsequent instructions

- Main register file (MRF)Multilevel scheduling- 4 active and 60 on-deck sets (total 64 threads)

Lane Mem-ory

Malleable Memory System

• Selective SRAM– H/W controlled cache + scratch pads (S/W controlled

cache)– The ratio can be determined by programmers

• E.g. 16KB/48KB or 48KB/16KB (total 64KB)

– Where to inherit can be determined by programmers• (GMEMs, L2, ranges)


http://www.hardwarecanucks.com/reviews/processors/huma-amds-new-heterogeneous-unified-memory-architecture/

To make writing a parallel program as easy as writ-ing a sequential program

• Unified memory addressing– An address space spanning

LOCs and TOCs, as well as across multiple Echelon chips

• Selective memory coherence– First, place data on coherence domain, Later, remove

coherence to get better performance (energy, execu-tion time)

• H/W fine-grained thread creation– Automated fine-grained parallelization by H/W


This work is licensed under a Creative Commons Attribution 3.0 Unported License.


http://creativecommons.org/licenses/by/3.0/



GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011...

Documents

Transcript of GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011...