GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011...
-
Upload
janessa-leger -
Category
Documents
-
view
213 -
download
1
Transcript of GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011...
GPUs and Future of Parallel Computing
Authors: Stephen W. Keckler et al.in NVIDIA
IEEE Micro, 2011
Taewoo Lee2013.05.24
Three Challenges for Parallel-Computing Chips
• Limited power budget• Bandwidth gap between computation and
memory• Parallel programmability
http://voice.korea.ac.kr p.2
Computers have been constrained by power and energy rather than area
• Power budget is limited – about 150W for desktops or 3W for mobile devices
( leakage, cooling∵ )
• transistor components per chip have been con-tinuously increased (Moore’s law)– Total power consumption also has been increased
http://voice.korea.ac.kr p.3
Computers have been constrained by power and energy rather than area
• E.g. Supercomputer – Power budget= 20 MW– Target compute capability= 1018 Flops/sec= 1 exaFlops/sec– Power/Flop= 20×10-12= 20 pJ/Flop
• However, modern CPUs (Intel’s Westmere)– 1700 pJ/Flop (double-precision, 130W/77GFlops∵ )
• GPU (Fermi architecture)– 225 pJ/Flop (single-precision, 130W/665GFlops∵ )
• ×1/85, ×1/11 improvement is needed
http://voice.korea.ac.kr p.4
Energy-efficiency will require reducing both instruction execution and data movement overheads
http://voice.korea.ac.kr p.5
• Instruction overheads– Modern CPUs were optimized for single-thread perfor-
mance• E.g. Branch prediction, out-of-order execution, and large primary in-
struction and data caches• So, energy is consumed in overheads of data supply, instruction supply,
and control
– To get higher throughput, future architectures must con-sume their energy to more useful work (i.e. computation)
Energy-efficiency will require reducing both instruction execution and data movement overheads
http://voice.korea.ac.kr p.6
• Todays, energy consumption of double-precision fused-multiply add (DFMA) is around 50 pJ
• Data movement power dissipation is also large– E.g. Power to read three 64-bit source operands and to write one destina-
tion operand • to SRAM= 56 pJ (≈DFMA)• to 10 mm more distance memory= 56×6 (pJ)• to external DRAM= 56 ×200 (pJ)
∵14 pJ × 4 = 56 pJ
• Because communication dominates energy, both within the chip and across the external memory interface, energy-efficient architec-tures must decrease the amount of data movement by exploiting locality
Energy-efficiency will require reducing both instruction execution and data movement overheads
p.7
• With the scaling projection to 10 nm, The ratios between DFMA, on-chip SRAM, and off-chip DRAM access energy stay relatively constant
• However, the relative energy cost of 10 mm global wires goes up to 23 times the DFMA energy ( wire C remains constant)∵– Feature size ↓ → relative power consumption of wire ↑
1: 6.2
1: 23
3.6 :1
3.6 :1
Three Challenges for Parallel-Computing Chips
• Limited power budget• Bandwidth gap between computation and
memory• Parallel programmability
http://voice.korea.ac.kr p.8
Bandwidth gap between computation and memory is severe. Also, power consumption by data movement is
pretty serious
Bandwidth gap between computa-tion and memory becomes bigger and bigger→ How to narrow this gap is very important
http://voice.korea.ac.kr p.9
Despite the relatively narrow memory bandwidth, chip-to-chip power comsumption is too big!( DRAM max. BW 175 GB/sec ∵ ⅹ20 pJ/bit= 28W + 21W for signaling= 49 W/sec,49W accounts for 20% of total GPU TDP (thermal design power) )→ Again, reducing data movement is necessary
To cope with the bandwidth gap problem,
• Architects are trying• Multichip modules (MCMs)
• DRAMs on-chip (to reduce latency)• CPU + GPU on-chip (to reduce transfer overheads)
• but also sharing bandwidth by both CPU and GPU can aggravate bandwidth utilization
• 3D chip stacking• Deeper memory hierarchy
• Bandwidth utilization– Coalescing– Prefetch– Data compression (more data per transaction)
http://www.extremetech.com/computing/95319-ibm-and-3m-to-stack-100-silicon-chips-together-using-glue
http://voice.korea.ac.kr p.10
Three Challenges for Parallel-Computing Chips
• Limited power budget• Bandwidth gap between computation and
memory• Parallel programmability
http://voice.korea.ac.kr p.11
For Parallel Programmability, Programmers must be able to
• Represent data access pattern and data placement(∵ Memory model is no more flat, coalesced access)
• Deal thousands of threads• Choose what kind of processing cores their tasks are
running on ( heterogeneity will be increased)∵
• Also, coherence and consistency should be relaxed to facilitate grater memory-level parallelism– ∵the cost of coherence protocol is too high– Sol) Give programmers selective coherence
http://voice.korea.ac.kr p.12
To cope with these challenges
• Limited power budget• Bandwidth gap between computation and
memory• Parallel programmability
http://voice.korea.ac.kr p.13
Echelon: A Research GPU Architecture
• Goals– Double precision 16 TFlops/sec–Memory bandwidth= 1.6 TB/sec– Power budget ≤ 150W– 20 pJ/Flop
http://voice.korea.ac.kr p.14
Echelon Block Diagram:Chip Level Architecture
http://voice.korea.ac.kr p.15
- 64 Tiles- Each tile consists of 4 throughput opti-mized cores (TOCs) i.e. GPU for through-put oriented parallel tasks
16 DRAM memory controllers (MCs)
8 latency optimized cores (LOCs)i.e. CPUs for operat-ing system, serial portion
Echelon Block Diagram:Throughput Tile Architecture
http://voice.korea.ac.kr p.16
- 4 TOCs per tile.- Each TOC has sec-ondary on-chip storage.- It may be DRAMs on-chip.
Characteristics of a TOC: MIMD + SIMD, Configurable and Sharable SRAM, and LIW per
lane
p.17
Temporal SIMT- Divergent code → MIMD- Non-divergent → SIMT (more energy-efficient)Two-level register files- Operand register file (ORF)
for producer-consumer relationship between subsequent instructions
- Main register file (MRF)Multilevel scheduling- 4 active and 60 on-deck sets (total 64 threads)
Lane Mem-ory
Malleable Memory System
• Selective SRAM– H/W controlled cache + scratch pads (S/W controlled
cache)– The ratio can be determined by programmers
• E.g. 16KB/48KB or 48KB/16KB (total 64KB)
– Where to inherit can be determined by programmers• (GMEMs, L2, ranges)
http://voice.korea.ac.kr p.18
http://www.hardwarecanucks.com/reviews/processors/huma-amds-new-heterogeneous-unified-memory-architecture/
To make writing a parallel program as easy as writ-ing a sequential program
• Unified memory addressing– An address space spanning
LOCs and TOCs, as well as across multiple Echelon chips
• Selective memory coherence– First, place data on coherence domain, Later, remove
coherence to get better performance (energy, execu-tion time)
• H/W fine-grained thread creation– Automated fine-grained parallelization by H/W
http://voice.korea.ac.kr p.19
This work is licensed under a Creative Commons Attribution 3.0 Unported License.
http://voice.korea.ac.kr p.20