New Rules: Scaling Performance for Extreme Scale Computing
description
Transcript of New Rules: Scaling Performance for Extreme Scale Computing
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
New Rules: Scaling Performance for Extreme Scale Computing
S. YalamanchiliSchool of Electrical and Computer
EngineeringGeorgia Institute of Technology
Atlanta, GA. 30332
Sponsors: National Science Foundation, Sandia National Laboratories, Semiconductor Research Corporation, and IBM
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2
Scaling Computing
Cray Titan
Hybrid Memory Cube
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 3
Outline
Performance Scaling Rules & Constraints
Case Study: CPU-GPU Coordinated Power Management
Some New Projects MIDAS – Cooling-Power-Microarchitecture Co-Design
Hardware/Software Reliability Enhancement
Adaptive Architectures
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 4
Moore’s Law
From wikipedia.org
• Performance scaled with number of transistors
• Dennard scaling: power scaled with feature size
Goal: Sustain Performance
Scaling
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Post Dennard Architecture Performance Scaling
W. J. Dally, Keynote IITC 2012
Data_movement_cost
Three operands x 64 bits/operand
5
Power Delivery Cooling
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Scaling Performance: Cost of Data Movement
6
Embedded Platforms
Goal: 1-100 GOps/w Goal: 20MW/Exaflop
Big Science: To Exascale
• Sustain performance scaling through massive concurrency
• Data movement becomes more expensive than computation
Courtesy: Sandia National Labs :R. Murphy.
Cost of Data Movement
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Post Dennard Architecture Performance Scaling
W. J. Dally, Keynote IITC 2012
Operator_cost + Data_movement_cost
Three operands x 64 bits/operandSpecialization heterogeneity
and asymmetry
7
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Scaling Performance: Simplify and MultiplyAMD Bulldozer Core
ARM A7 Core (arm.com)
Extracting single thread performance costs energy
Out-of-order execution Branch prediction Scheduling etc.
Multithread performance exploits parallelism
Simpler pipelines Core scaling
Still important!
8
NVIDIA Fermi
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Asymmetry vs. Heterogeneity
Multiple voltage and frequency islands
Different memory technologies
STT-RAM, PCM, Flash
9
Tile Tile
Tile Tile
Tile Tile
Tile Tile
Tile Tile
Tile Tile
Tile Tile
Tile Tile
MC
MC
MC
MC
Tile
Tile
Tile
Tile
Tile
Tile
Tile
Tile
MC
MC
MC
MC
Performance Asymmetry
Functional Asymmetry
Heterogeneous
Complex cores and simple cores
Shared instruction set architecture (ISA)
Subset ISA Distinct microarchitecture Fault and migrate model of
operation1
Uniform ISA Multi-ISA
1Li., T., et.al., “Operating system support for shared ISA asymmetric multi-core architectures,” in WIOSCA, 2008.
Multi-ISA Microarchitecture
Memory & Interconnect hierarchy
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 10
The Challenge: The Memory System
Xeon Phi
Hybrid Memory Cube
What should the memory hierarchy look like?Parallelism vs. locality tradeoffsMinimize data movement Processor in Memory?
H. Kim (SCS) and S. Yalamanchili (ECE)Sponsor: Sandia National Labs
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Thermal Capacity
Exploit package physics Temperature changes on the
order of milliseconds Workload behaviors change on
the order of microsecondsImpact on device behavior?
Inst
ruct
ions
/cyc
le
Time
Time Varying Workload
Thermal Capacity
Power-Performance
Management!Figures: psdgraphics.com and wikipedia.org
11
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Adapting to Power + Thermal ConstraintsExploit package physics
Temperature changes on the order of milliseconds
Use the thermal headroom
Max Power
TDP Power
Low power – build up thermal credits
Turbo boost region
10s of seconds
Intel Sandy Bridge
Time12
E. Rotem, et.al., Power-management Aarchitecture of the Intel Microarchitecture code-named Sandy Bridge, IEEE Micro March 2012
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Summary: New Performance Scaling Rules
Energy efficiency: Scale performance by scaling energy efficiency diversify programming models?
Parallelism: Scale number of cores rather than performance of a single core multiply
Data Movement: Energy cost of data movement is more expensive than the energy cost of computation communication-centric
Physics Capacity: Scaling limited by thermal/power capacity power management
13
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 14
Physics of Computation
Physical phenomena interact with architecture to affect system level metrics such as energy, power, performance and reliability
These interactions are modulated by workloadsNeed management techniques that can improve energy efficiency
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 15
Thermal Coupling
CU1CU0GPU
Thermal coupling between CPU and GPU leads to inefficient operation
Interaction with power management, e.g., turbo core/boost reduces efficiency
AMD Trinity APU
I. Paul et.al., (ISCA 2013)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 16
Thermal Signatures and Fields
CPUs are more thermally dense Thermal gradient from CPUs to GPU
GPU temperature rise due thermal pollution Premature throttling and hence performance loss
Measurements on an AMD Trinity A8-4555M APU
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 17
Thermal Coupling
1 19 37 55 73 91 1091271451631811992172352532712893073253430.5
1
1.5
2
2.5
3
3.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
GPU Pow CPU CU0 Pow
CPU CU1 Pow PeakDieTemp
Time (seconds) ->
CPU
& G
PU
Rel
ativ
e P
ower
->
Pea
k D
ie T
emp
erat
ure
->
CPU power is limited, GPU running at max DVFS state
Thermal cou-pling
Temp throttling
CU1CU0GPU
Thermal coupling between CPU and GPU accelerates temperature rise
Solution Cooperative Boosting
AMD Trinity APU
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 18
CPU Core Performance States: AMD Trinity APU
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 19
Performance Coupling Effects
Need to balance performance coupling and thermal coupling effects
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 20
Power Savings Using CB
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21
Energy-Delay2
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 22
Related Projects
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 23
MIDAS
Probe Stations: Chip and Board Level Probe, Test, & Power
Measurement
FPPE
On-Demand Cooling
Advanced micro-fabrication for microfluidics
FPGA Implementation
Multi-scale Transient Models for Coupled Multi-physics
Emulating the Physics
Microarchitecture Optimization
Locally transient Adaptations
GPU
CPU core
CPU core
CPU core
CPU core
CPU core
CPU core
cache cache cache
Globally Transient Adaptations
Power Model
Functional simulator (Apps, OS)
Cycle-Level Multicore Timing Simulator
PDN Model
Thermal Model
MIDAS Tasks
Indicates existing infrastructures/models
System
Arch. Impa
ct
Power
Distr.
Network
Novel
Cooling Technology
Co-Design
M. Bakir (ECE), H. Kim (SCS), Y. Joshi (ME), S. Mukhopadhyay (ECE),
S. Yalamanchili (ECE)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 24
Reliability Enhancement - Microarchitecture
Peak temperature and MTTF analysis from J. Srinivasan et al.,“Lifetime Reliability: Toward An Architectural Solution,” Micro 2005.
64-core asymmetric chip multiprocessor layout and failure
probability distribution
x10-10In-order core Out-of-order core
25% peak-to-peak difference of failure distribution across the processor die; induced by architectural asymmetry, thermal coupling, power management, and workload characteristics
Single-core processor lifetime reliability Multicore processor lifetime reliability
S. Yalamanchili (ECE) and S. Mukhopadhyay (ECE)Sponsor: SRC (IBM)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 25
Reliability Enhancement - Software
Key Idea: low level code injectionFramework:
On-demand Customizable Transparent
Examples: Alignment check Bounds check Control flow check
S. Yalamanchili (ECE) and K. Schwan (SCS)Sponsor: NSF
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 26
Adaptive 3D Architecture
Network
Many Core Processor
Memory
Memory
Memory
Memory
GPUCPU core
CPU core
CPU core
CPU core
CPU core
CPU core
cache cache cache
Reliability
Functional simulator (Apps, OS)
Cycle-Level Multicore Timing Simulator
PDN Model
Power & Thermal Model
Applications and Threads
workload (w(t))Manifold
1. Workload Characterization
2. Characterize Interacting Physical Effects
3. Explore Architecture Design Space
4. Microarchitectural Adaptation Mechanisms
S. Yalamanchili (ECE) and S. Mukhopadhyay (ECE)
Sponsor: SRC
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 27
Summary
Architecture
Applications
The physical phenomena should not be masked, but rather integrated into the design
We need tightly coupled integrated models of software, algorithms, architecture and technology
Technology
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Thank You
Questions?
28