Dynamic Thermal Management in Modern Processors

57
Dynamic Thermal Management in Modern Processors Shervin Sharifi PhD Candidate CSE Department, UC San Diego

Transcript of Dynamic Thermal Management in Modern Processors

Page 1: Dynamic Thermal Management in Modern Processors

Dynamic Thermal Management

in Modern Processors

Shervin Sharifi

PhD Candidate

CSE Department, UC San Diego

Page 2: Dynamic Thermal Management in Modern Processors

Power Outlook

0.0

0.4

0.8

1.2

2001 2005 2009 2013 2017

Vdd (Volts)

Ideal

Realistic

•• VVdddd scaling will slow downscaling will slow down

•• Power will increase constantlyPower will increase constantly

•• Feature sizes decreaseFeature sizes decrease

•• Significant increase in Significant increase in

Power densitiesPower densities

0

200

400

600

800

1,000

1,200

1,400

2001 2005 2009 2013 2017

Power (Watts)

S. Borkar, "Thousand Core Chips: A Technology Perspective," DAC07 [B. [B. CharlotCharlot & K. & K. TorkiTorki, TIMA], TIMA]

Page 3: Dynamic Thermal Management in Modern Processors

Temperature Induced Problems

• Thermal hot spots

– Accelerates failure mechanisms

• Exponentially dependent on temperature

(Electromigration, etc.)

– Performance loss

– Higher leakage power

• Spatial variations

– Performance mismatch,

Clock skew

– Mechanical stress

• Temporal variations

– Thermal cycling

Ajami, et al, ICCAD 01

Meterelliyoz, et al, ITC 2005

www.aitechnology.com

Page 4: Dynamic Thermal Management in Modern Processors

Thermal Management

• Techniques to control the chip temperature

• Power management is not enough for thermal management

– TM techniques are concerned with thermal hotspots and temperature variations

– PM techniques usually concerned about the overall power consumption

• Off-chip : e.g. Cooling techniques

• On-chip : e.g. Temperature aware task scheduling

• Static : e.g. Temperature-aware floorplanning

• Dynamic : e.g. Thread migration

M. Santarini, EDN, Sep 2005

Page 5: Dynamic Thermal Management in Modern Processors

Dynamic Thermal Management

• Low-Power design techniques

are not sufficient

• Goals of DTM– Address thermal hotspots

– Some recent techniques also address

spatial and temporal temperature variations

• Design for typical temperature instead of worst case temperature

• Achieve the highest performance under thermal constraints

• DTM techniques respond to

thermal emergencies by– Reducing the heat generation

– Distributing the heat generation

• DTM incurs overhead– Performance

– Hardware overhead

[Coskun, et al. ASP-DAC 08]

0%

20%

40%

60%

80%

100%

Load Balancing Energy-Aware Optimization

Thermally-Aware Optimization

>85 C

[80,85] C

[75,80) C

<75 C

Page 6: Dynamic Thermal Management in Modern Processors

Sequence of Events in DTM

Trigger

Reached

Turn

DTM OnCheck

TempCheck

Temp

Turn

DTM Off

temperature

Maximum Temperature without DTM

Maximum Temperature

with DTM

Reduction in

Max. Temp.

DTM Trigger

Level

Startup

Delay

Policy

DelayShutoff

Delay

time

D.Brooks, M. Martonosi, "Dynamic Thermal Management for High-Performance Microprocessors." HPCA01

Performance Loss

x

x

Execution Ends

Page 7: Dynamic Thermal Management in Modern Processors

Classification of DTM Techniques

• Software– Implemented only in software

– Example: Temperature aware scheduling

• Hardware throttling– Global

• Hardware support only for throttling the whole chip

– Local• Hardware support for throttling parts of the chip

– Examples: Clock gating

• Hybrid– Combinations of previous classes

Page 8: Dynamic Thermal Management in Modern Processors

Classification of DTM Techniques

• Software

• Hardware throttling

– Global

– Local

• Hybrid

Page 9: Dynamic Thermal Management in Modern Processors

Kernel

Temperature

Sensor

Kernel

Temperature

Sensor

Temperature Aware Scheduling for a

Single-threaded Processor

• OS-level scheduler for a single processor– Process-level control

– Access to hardware statistics

• Reacts to the thermal emergencies detected by the temperature sensors

• Hot processes are identified based on their CPU activity and slowed down

E. Rohou,et al., "Dynamically Managing Processor Temperature and Power", Workshop on Feedback-Directed Optimization, 1999

Page 10: Dynamic Thermal Management in Modern Processors

• Advantages

– No hardware overhead

– Only penalizes the hot processes, the rest run at full

speed

– Fine granularity in adjusting the temperature

• Disadvantages

– Limited cooling capability

– Reduces the performance for the most demanding jobs

E. Rohou, et al., "Dynamically Managing Processor Temperature and Power", Workshop on Feedback-Directed Optimization, 1999

Temperature Aware Scheduling for a

Single-threaded Processor

Page 11: Dynamic Thermal Management in Modern Processors

Temperature Aware Scheduling for

Simultaneously Multithreading Processors

• Assumption– Program’s hotspot behavior is characterized

by intensity of its accesses to int and fp register files

• Approach– Picks instruction from the thread that is likely

to cool or less quickly heat the register files

• Hardware performance counters – To identify int (fp) intensive threads

– To detect thermal danger

• To keep processor temperature below 85°C – 30% performance increase compared to fetch gating

J. Donald, et al., "Leveraging Simultaneous Multithreading for Adaptive Thermal Control", Workshop on Temperature-Aware Computing Systems, 2005.

fp_reg

i f i f

int_reg

Page 12: Dynamic Thermal Management in Modern Processors

• Advantages

– No hardware overhead

– Doesn’t slow down the whole processor

• Disadvantages

– Limited cooling capability

– Works only for SMT processors

– Reduced thread fairness

J. Donald, et al., "Leveraging Simultaneous Multithreading for Adaptive Thermal Control", Workshop on Temperature-Aware Computing Systems, 2005.

Temperature Aware Scheduling for

Simultaneously Multithreading Processors

Page 13: Dynamic Thermal Management in Modern Processors

ThermalHerd

• DTM for on-chip networks

• Dynamically steers network traffic to avoid hotspots

– Distributed traffic throttling

– Thermal correlation based traffic routing

• Distributed traffic throttling

– Quota reduces exponentially as temperature rises

Shang et al, “Temperature Aware On-Chip Networks,” IEEE Micro 2006

Page 14: Dynamic Thermal Management in Modern Processors

ThermalHerd

• Thermal correlation based routing

• Thermal correlation– The mutual thermal effect of two units

• Choose minimal path with thermal correlations to hotspot less than a threshold

• Reduced network peak temperature by 10°C

– Throughput degradation of less than 1%

– Latency overhead of less than 1.2%

Shang et al, “Temperature Aware On-Chip Networks,” IEEE Micro 2006

Page 15: Dynamic Thermal Management in Modern Processors

• Advantages

– No hardware overhead

– Low performance degradation

• Disadvantages

– Specific to the chips with on-chip networks

– Computation and communication are treated

independently, which is not optimal

ThermalHerd

Shang et al, “Temperature Aware On-Chip Networks,” IEEE Micro 2006

Page 16: Dynamic Thermal Management in Modern Processors

Classification of DTM Techniques

• Response Mechanism

– Software

– Hardware throttling

• Global

• Local

– Hybrid

Page 17: Dynamic Thermal Management in Modern Processors

Global Clock Gating

• The clock signal to the bulk of the processor logic is stopped for a short time period

• Dynamic thermal management for Pentium4– Limited to a few microseconds

Gunther, et al. "Managing the impact of increasing microprocessor power consumption." Intel Technology Journal. 2001

Jacobson et al. HPCA 2005

Page 18: Dynamic Thermal Management in Modern Processors

• Advantages

– High cooling capability

• Disables the whole processor

• Eliminates clock tree power

– Low invocation time

• Disadvantages

– Performance impact is high

• Slows down all processes

– Hardware overhead

Global Clock Gating

Gunther, et al. "Managing the impact of increasing microprocessor power consumption." Intel Technology Journal. 2001

Page 19: Dynamic Thermal Management in Modern Processors

Global Dynamic Voltage and

Frequency Scaling

• DVFS for DTM

• For DTM, only two voltage steps is enough– Low voltage → Low frequency but

less time to reduce temperature

• To keep temperature below 85°C– 20-30% slowdown

Skadron, et al., "Temperature-Aware Computer Systems: Opportunities and Challenges", IEEE Micro, 2003.

Frequency Voltage

Dynamic Power : a CL VDD2f

time

frequency

Page 20: Dynamic Thermal Management in Modern Processors

• Advantages

– Fast reduction of temperature

– The processor can continue running

• Disadvantages

– All processes are slowed down

– Hardware overhead

Global DVFS

Page 21: Dynamic Thermal Management in Modern Processors

Classification of DTM Techniques

• Response Mechanism

– Software

– Hardware throttling

• Global

• Local

– Hybrid

Page 22: Dynamic Thermal Management in Modern Processors

Fetch Gating

• Utilizes ILP to hide the performance impact

• The important decision– Duty cycle

• ILP helps only with mild duty cycles

• At more aggressive duty cycles, slowdown becomes proportional to the duty cycle

• To keep temperature below 85°C– 10-20% slowdown

Skadron, et al., "Temperature-Aware Computer Systems: Opportunities and Challenges", IEEE Micro, 2003.

Page 23: Dynamic Thermal Management in Modern Processors

• Advantages

– Processor is not disabled completely, available ILP

compensates for the wasted cycles

– Low invocation time

– Hardware overhead is relatively low

• Disadvantages

– Moderate cooling capability

– Choice of optimal duty cycle is not easy

Fetch Gating

Skadron, et al., "Temperature-Aware Computer Systems: Opportunities and Challenges", IEEE Micro, 2003.

Page 24: Dynamic Thermal Management in Modern Processors

Activity Migration

• Computations are migrated to spare units in cold areas of the chip

• Power density is reduced by distributing the heat generation

• Activity ping-ponging

– One unit is disabled when the other one is active

• Register file

– About 12°C Max. temperature reduction with about 2% IPC loss

Die

Activity Ping-Ponging

Original

Unit

Duplicated

Unit

Heo, et al. "Reducing Power Density through Activity Migration “, ISLPED 2003

Page 25: Dynamic Thermal Management in Modern Processors

Activity Ping-ponging

Time

Temperature

T2

T1

Reduced

Temperature

Migration Period

Heo, et al. "Reducing Power Density through Activity Migration " ISLPED 2003

Die

Original

Unit

Duplicated

Unit

Page 26: Dynamic Thermal Management in Modern Processors

A Thermal-Aware Superscalar

Microprocessor

• A secondary pipeline

– Architecturally simple

– Ultra low power

• Response

– Clock gates the primary pipeline

– Secondary pipeline takes over

• To keep temperature below 85°C at

least 11% improvement in energy-cpu

time product compared to global DVFS

Lim, et al. "A thermal-aware superscalar microprocessor." ISQED 2002.

Page 27: Dynamic Thermal Management in Modern Processors

• Advantages

– Effective in reducing local hotspots

– Processor is not disabled completely

• Disadvantages

– Extra hardware for the redundant units

– Longer interconnects lead to higher delays and

higher power

– Overhead of copying data to the new unit

Activity Migration

Page 28: Dynamic Thermal Management in Modern Processors

Classification of DTM Techniques

• Response Mechanism

– Software

– Hardware throttling

• Global

• Local

– Hybrid

Page 29: Dynamic Thermal Management in Modern Processors

Hybrid Architectural DTM

+ Enough Cooling Capability

+ Low Performance Impact- High Performance Impact

- High Performance Impact + High Cooling Capability

Mild Thermal Emergencies

Fetch Gating

Global DVFS

Severe ThermalEmergencies

K. Skadron, “Hybrid architectural dynamic thermal management”, DATE 2004.

Page 30: Dynamic Thermal Management in Modern Processors

Hybrid Architectural DTM

• Combines

– Local Hardware Throttling

• Fetch gating

– Global Hardware Throttling

• Global DVFS

• To keep temperature below 85°C

– 25% Reduction in DTM overhead

compared to Global DVFS

K. Skadron, “Hybrid architectural dynamic thermal management”, DATE 2004.

Page 31: Dynamic Thermal Management in Modern Processors

• Advantages

– Low performance impact

– Fast temperature reduction

– Ability to adjust the response to the severity of

thermal emergency

• Disadvantages

– Design complexity

– Hardware support for both fetch gating and DVFS

Hybrid Architectural DTM

K. Skadron, “Hybrid architectural dynamic thermal management”, DATE 2004.

Page 32: Dynamic Thermal Management in Modern Processors

HybDTM

• Combines – Temperature-aware Scheduling

• Mild thermal adjustments

and preventing hotspots

– Global clock gating

• Severe thermal emergencies

• Execution time overhead to keep temperature below 65°C

– 10% compared to 20% in global clock gating

Kumar et al., “HybDTM: A Coordinated Hardware-Software Approach for Dynamic Thermal Management” , DAC 2006.

Proactive DTM trigger

T > Tsw

Reactive

DTM trigger

T > Thw

Per process temperature

Overall temperature

Thermal Model

Hardware directed

DTM

Level 1

Level 3

Level 2

Software directed

DTM

Priority management

Timesliceadjustment

Scheduler

Increasing temperature

Page 33: Dynamic Thermal Management in Modern Processors

• Advantages

– Low performance impact

– Fast temperature reduction

– Finely adjusts the response to the severity of

thermal emergency

• Disadvantages

– Design complexity

HybDTM

Kumar et al., “HybDTM: A Coordinated Hardware-Software Approach for Dynamic Thermal Management,” , DAC 2006.

Page 34: Dynamic Thermal Management in Modern Processors

Dynamic Temperature Aware Scheduling

• For multiple processors, there is no optimal dynamic scheduling algorithm for dynamically changing tasks [Liu’73]

– Heuristics are required

• Temperature information is passed to scheduler at each interval

• Scheduler makes decisions based on temperature

– Fast and simple implementation

OS- level

Scheduler

Job A

Job B

Job C

Temperature

from

sensors

34

Page 35: Dynamic Thermal Management in Modern Processors

Adaptive-Random Policy

• Goal: To balance workload, minimize &

balance temperature with low scheduling

complexity

• Updates the probability of sending workload to

a core based on its temperature history

• Pn : Probability a core receiving workload

– Evaluated at each job arrival

• W: Evaluated periodically

– Interval length: 1 sec

– W =β / Avthr

Avthr : (Avg. T below Tthr) / Tthr

Tthr : Threshold temperature

β : Empirically set constant

(system dependent)

Hot

Cool

> 80oC

[75, 80]oC

< 75oC

Pn = 0

Pn = Po

Pn = Po+ W

35Ayse K. Coskun

Page 36: Dynamic Thermal Management in Modern Processors

Adaptive Random

• OS level scheduler for MPSoCs– Takes floorplan into account

– Addresses temperature variations

• Adaptive Random

• When scheduling is not enough, it is combined with thread migration or local DVFS

• Reduces the hotspots to ~1%

• >10X reduction in temporal variations

• Performance impact– With thread migration : ~3%

– With DVFS : ~7%

A. Coskun, T. S. Rosing and K. Whisnant. “Temperature Aware Task Scheduling in MPSoCs,” DATE07

Page 37: Dynamic Thermal Management in Modern Processors

• Advantages

– Low overhead

– Local control of temperature

– Ability to reduce temperature variations

• Disadvantages

– Design complexity

– Large hardware overhead if local per-core DVFS is

used

Adaptive Random

A. Coskun, T. S. Rosing and K. Whisnant. “Temperature Aware Task Scheduling in MPSoCs,” DATE07

Page 38: Dynamic Thermal Management in Modern Processors

Migration

DecisionsMigration Controller

PerformanceRequirements

Thread-CoreThermal Trend

PI Controller

CoreThermal

System

Chip

Thermal System

Thermal

Sensors

TemperatureSetpoints

Hybrid Control-theoretic DTM for MPSoCs

• Combines

– Thread migration

• Balancing the heat and optimizing the performance

– Local per-core DVFS

• Fine-grained local adjustments

• To keep temperature below 85°C

– 2.5X instruction throughput compared to distributed clock gating

Donald et al., “Techniques for Multicore Thermal Management: Classification and New Exploration,” ISCA 2006.

Page 39: Dynamic Thermal Management in Modern Processors

• Advantages

– Low performance impact

– Distributed local control of temperature

– Formal control theory allows accurate design and

provable guaranty for temperature control

• Disadvantages

– Design complexity

– High hardware overhead

Hybrid Control-theoretic DTM for MPSoCs

Donald et al., “Techniques for Multicore Thermal Management: Classification and New Exploration,” ISCA 2006.

Page 40: Dynamic Thermal Management in Modern Processors

Summary

Average-

High

Low-High

Low

None

Hardware

Cost

HighAverage-

HighLow-AverageHybrid

Average-

High

Average-

HighAverage

Local

Throttling

LowHighHighGlobal

Throttling

Hardware

LowLowLowSoftware

Design

Complexity

Cooling

Capability

Performance

Impact

Page 41: Dynamic Thermal Management in Modern Processors

Proactive Dynamic Thermal Management

Page 42: Dynamic Thermal Management in Modern Processors

Reactive vs. Proactive

• Reactive

– Not activated until the thermal emergency

– Typically have high performance impact

• Proactive

– Prevent thermal emergencies

– Part of the normal system operation

Page 43: Dynamic Thermal Management in Modern Processors

Reactive vs. Proactive Management

• Reactive

70

75

80

85

90

Time

Temperature (C) .

43Ayse K. Coskun

Page 44: Dynamic Thermal Management in Modern Processors

• Reactive

• e.g., DVFS,

fetch-gating,

workload migration,

70

75

80

85

90

Time

Temperature (C) .

70

75

80

85

90

Time

Temperature (C) .

Forecast

Reactive vs. Proactive

Management

44Ayse K. Coskun

• Proactive

Page 45: Dynamic Thermal Management in Modern Processors

• Proactive

• Reduce and balance temperature – Adjust workload,

V/f setting, etc.

70

75

80

85

90

Time

Temperature (C) .

70

75

80

85

90

Time

Temperature (C) .

T after proactive

management

Reactive vs. Proactive Management

Forecast

45

• Reactive

• e.g., DVFS,

fetch-gating,

workload migration,

Ayse K. Coskun

Page 46: Dynamic Thermal Management in Modern Processors

Flow

Temperature Data from

Thermal Sensors

Predictor (ARMA)

Periodic ARMA

Model Validation

&

Model Update

Temperature at time (tcurrent + tn)

for all cores

SCHEDULER

Temperature-Aware

Allocation on Cores46

[Coskun, et al. ISLPED 08]

Page 47: Dynamic Thermal Management in Modern Processors

Temperature Modeling and Sensing

Page 48: Dynamic Thermal Management in Modern Processors

Thermal model

K. Skadron, et. al. Micro 03

• Based on the dualitybetween heat flow and electrical phenomena

• Heat flow could be considered as a currentpassing through a thermal resistance, leading to a temperature differenceanalogous to voltage

• Thermal R and Cs are calculated based on the system characteristics

(R× C) Electrical RC Constant(Rth× Cth) Thermal RC Constant

(C) Electrical Capacitance (Cth) Thermal Capacitance

(R) Electrical Resistance (Rth) Thermal Resistance

Voltage DifferenceTemperature Difference

CurrentPower

Duality between Thermal and Electrical Quantities

Page 49: Dynamic Thermal Management in Modern Processors

Thermal Model

K. Skadron, et. al. ACM TACO 04, Vol. 1, No. 1

Page 50: Dynamic Thermal Management in Modern Processors

Challenges in using thermal sensors

• Limitations in sensor placement

– Sensors may not be placed at locations of interest

– Routing and I/O considerations

• Sensor overhead

– Sensing circuitry

– A/D converter

• Few sensors available

– Silicon real estate

– Low mean time to failure

• Sensor noise – Inaccuracies due to power variations,

sensor degradation, etc.

• Dynamic change of hot spot locations

– Static placement can not cover all locations

Page 51: Dynamic Thermal Management in Modern Processors

Maximum temperature differences

• Temperature differences were

traditionally found by extensive

simulations

• Long simulation times

• Workload dependent

• No guarantee on the maximum

found temperature differences

• Under what situations does

this maximum difference

happen?

0.75

0.50

0.25

0 0.25 0.5

Width (0.5 mm)

Height

(0.75 mm)

Contour map of temperature difference

to a location of interest on die

[Sharifi, et al., TCAD’10]

Page 52: Dynamic Thermal Management in Modern Processors

Sensor placement

• Sensor overhead

• Sensor placement limitations

• Sensor placement error

– Temperature difference between the hotspot and its corresponding sensor

• Sensor placement technique

– Covering all locations of interest guaranteeing a maximum tolerable placement error

– Minimum number of sensors

X

X

X

o

X

o

o

X

o

X

o

o Sensor

X Point of interest

X

Page 53: Dynamic Thermal Management in Modern Processors

X

*

X

Sensor Placement Comparison

88888886667777Circular range*

56678884555677Our technique

76543217654321Accuracy (°C)

SOC2SOC1

• Circular range

[Lee. et. al. ICCD 05]

Based on exponential

dependence of temperature

difference to the distance

from a location of interest

Potential sensor

location

Circular ranges of

Hotspots 1 and 2

in other techniques

Observability areas of

the hotspots 1 and 2

*

X Point of interest

[Sharifi, et al., TCAD’10]

Page 54: Dynamic Thermal Management in Modern Processors

Indirect temperature sensing

• Runtime temperature information is needed– Few sensors available

– Sensor noise

– Dynamic change of hot spot locations

� Accurate temperature estimates are required at the points other than sensor locations

� Applications– Small number of deployed sensors due to overhead/ placement

limitations

– Operational systems with degraded/failed sensors

– Changes in the locations of interest

Page 55: Dynamic Thermal Management in Modern Processors

Accurate Temperature Estimation

Thermal

Network

System error sources

Measurement error sources

Thermal

SensorsAccurate

Temperature

Estimates

Power

Consumption

Values

Kalman Filter

Correct

(Measurement

Update)

Predict

(Time Update )

Observed

Temperature

Shervin Sharifi

Page 56: Dynamic Thermal Management in Modern Processors

Accurate Temperature Estimation (Cont’d)

[Sharifi, et al., TCAD’10]

Page 57: Dynamic Thermal Management in Modern Processors

Questions?