Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and...
-
Upload
valerie-anabel-snow -
Category
Documents
-
view
226 -
download
7
Transcript of Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and...
Challenges for High Performance Processors
Hiroshi NAKAMURAResearch Center for Advanced Science and Technology, The University of Tokyo
2007. 11. 2.France-Japan PAAP Workshop 2
What’s the challenge? Our Primary Goal: Performance How ?
increase the number and/or operating frequency of functional units
AND supply functional units with sufficient data (bandwidth)
Problems: Memory Wall
system performance is limited by poor memory performance
Power Wall power consumption is approaching cooling limitation
2007. 11. 2.France-Japan PAAP Workshop 3
Memory Wall Problem Performance improvement
CPU: 55% / year DRAM: 7% / year
1
10
100
1000
10000
100000
1000000
Year
Rel
ativ
e P
erfo
rman
ce
CPU Memory
2007. 11. 2.France-Japan PAAP Workshop 4
Example of Memory Wall: Performance of 2GHz Pentium4 for a[i]=b[i]+c[i]
050
100150200250300350400450500
10 100 1000 10000 100000 1000000
Vector Length
Per
form
ance
[M
FLO
PS
]L1 hit L2 hit
cache miss
1/6
lack of effective memory throughput
non-blocking cache &out-of-order issue
2007. 11. 2.France-Japan PAAP Workshop 5
Recap: Memory Wall Problem growing gap between processor and memory speed
performance is limited by memory ability in High Performance Computing (HPC) long access latency of main memory lack of throughput of main memory
making full use of local memory (on-chip memory) of wide bandwidth is indispensable on-chip memory space is valuable resource not enough for HPC should exploit data locality
Itanium2/Montecito : Huge L3 cache (12MB x 2)
2007. 11. 2.France-Japan PAAP Workshop 6
works well in many cases, but not the best for HPC data location and replacement by hardware
× unfortunate line conflicts occur although most of data accesses are regular
ex. data used only once flush out other useful data
transfer size of cache off-chip is fixed for consecutive data: larger transfer size is preferable for non-consecutive data: large line transfer incurs
unnecessary data transfer waste of bandwidth Most of HPC applications exhibit regularity in data
access, which is sometimes not well enjoyed.
Does cache work well in HPC?
2007. 11. 2.France-Japan PAAP Workshop 7
SCIMA (Software Controlled Integrated Memory Architecture) [kondo-ICCD2000]
overview of SCIMA
SCM
register
ALU FPU
Cache
address space
NIA
Network
Memory(DRAM)
・・・
addressable SCMSCM in addition to ordinary cache
a part of logical address space no inclusive relations with Cache SCM SCM and cache are reconfigurable at
the granularity of way
CacheSCM
reconfigurable
(SCM: Software Controllable Memory)
(joint work with Prof. Boku@ Univ. of Tsukuba and others)
2007. 11. 2.France-Japan PAAP Workshop 8
Data Transfer Instruction
Register
Off-Chip Memory
CacheSCM
load/store
line transfer
page-load/page-store
load/store register SCM/Cache
page-load/page-store SCM Off-Chip Memory large granularity transfer
wider effective bandwidthby reducing latency stall
block stride transfer avoid unnecessary
data transfer more effective utilization
of On-Chip Memory
New
2007. 11. 2.France-Japan PAAP Workshop 9
Strategy of Software Control SCM must be controlled by software
arrays are classified into 6 groups
Reusabilitynot-reusable
conse- cutive
stride
(1)
(2)
(3)
use SCM as a
stream buffer
reusable
use SCM as a stream buffer
not use SCM
(5) reserve SCMfor reused data
reserve SCMfor reused data (4)
(6) reserve SCMfor reused data irregular
Consecutiveness first, apply (1) (2)allocate small stream buffer in SCM
second, apply (4) (5) and (6)
allocate rest area of SCM for reused data
・ prototype of semi-automatic compiler : users specify hints on reusability of data arrays
2007. 11. 2.France-Japan PAAP Workshop 10
Results of Memory Traffic
1% - 61% of memory traffic decreases in SCIMA
0.00.20.40.60.81.01.21.41.6
Cache SCIMA Cache SCIMA Cache SCIMA
CG FT QCD
Nor
mal
ized
Mem
ory
Traffi
c cache miss page- load/ store
line size
due to fully exploitation of data reusability
unnecessary memory traffic is suppressed
assumption cache model:
cache size = 64KB(4way) SCM size = 0KB
SCIMA mode: cache size = 16KB (1way) SCM size = 48KB
total # of way: 4 line size: 32B, 128B
benchmark programs CG, FT, QCD
2007. 11. 2.France-Japan PAAP Workshop 11
Results of Performance
CPU busy time latency stall :
elapsed time due to memory latency
throughput stall : elapsed time due to lack of throughput
1.3-2.5 times faster than cache latency stall reduction by large granularity of data transfer throughput stall reduction by suppressing unnecessary data transfer
load/store latency: 2cyclebus throughput: 4B/cyclememory latency: 40cycle
assumption0
0.2
0.4
0.6
0.8
1
1.2
32 128 32 12
8 32 128 32 12
8 32 128 32 12
8
Cache SCIMA Cache SCIMA Cache SCIMA
CG FT QCD
Nor
mal
ized
Exe
cutio
n Ti
me CPU busy time latency stall throughput stall
line size
normalized execution time
2007. 11. 2.France-Japan PAAP Workshop 12
Power Wall Next Focus: Power Consumption of Processors
Is there any room for power reduction ? If yes, then how to reduce ?
Trends of Heat Density
♦ Itanium (130W)
2007. 11. 2.France-Japan PAAP Workshop 13
Observation(1) Moore’s Law
Num. of transistors : doubles every 18 months
2007. 11. 2.France-Japan PAAP Workshop 14
Observation (2) – frequency –
Frequency doubles every 3 years.
Number of transistors : doubles every 18 months
Number of switching on a chip: 8 times every 3 years
2007. 11. 2.France-Japan PAAP Workshop 15
Observation (3) – performance – # of switching on a chip: 8 times every 3 years effective performance: 4 times every 3 years
“microprocessor performance improved 55% per year” from “Computer Architecture A Quantitative Approach” by J.Henessy and D.Patterson, Morgan Kaufmann
unnecessary switching = chance of power reduction: doubles every 3 years
2007. 11. 2.France-Japan PAAP Workshop 16
An Evidence of the Observation - unnecessary switching = x2 / 3 years - [Zyuban00] @ ISLPED’00
4 6 8 10 12
IssueWidth
acce
ss e
nerg
y pe
r in
stru
ctio
n (n
J)
flushedinstruction
committedinstruction
energy/instr. increases to exploit ILP for higher performance at functional units : no increase at issue window, register file : increase flushed instruction by incorrect prediction: increase
rename map tablebypass mechanismload/store windowissue windowregister filefunctional units
waste of power
2007. 11. 2.France-Japan PAAP Workshop 17
Registers Register consumes a lot of power
roughly speaking, power ∝(num. of registers) X (num. of ports) high performance wide issue superscalar processors
more registers, more read/write ports Open Question
in HPC, what is the best way to use many function units (or accelerators) from the perspective of register file design
scalar registers with SIMD operations vector registers with vector operations ………
Personal Impression vector registers are accessed in well-organized fashion, it is easy
to reduce “num. of ports” by sub-banking technique can vector operations make good use of local on-chip memory?
(at least, traditional vector processors can never!)
2007. 11. 2.France-Japan PAAP Workshop 18
Dual Core helps …
Voltage Frequency Power Performance
1% 1% 3% 0.66%
Rule of thumb
CoreCore
CacheCache
CoreCore
CacheCache
CoreCore
Voltage = 1Freq = 1Area = 1Power = 1Perf = 1
Voltage = -15%Freq = -15%Area = 2Power = 1Perf = ~1.8
In the same process technology…
2007. 11. 2.France-Japan PAAP Workshop 19
Multi-Core helps more …
C1C1 C2C2
C3C3 C4C4
Cache
Large CoreLarge Core
Cache
1
2
3
4
1
2 SmallCoreSmallCore
1 1
1
2
3
4
1
2
3
4
Power
PerformancePower = 1/4
Performance = 1/2
Multi-Core:Multi-Core:Power efficientPower efficient
Better power and Better power and thermal managementthermal management
Multi-Core:Multi-Core:Power efficientPower efficient
Better power and Better power and thermal managementthermal management
no need for wider instruction issue
2007. 11. 2.France-Japan PAAP Workshop 20
Leakage problem How to attack
leakage problem?
IEEE Computer Magazine
[Borkar-MICRO05] VDD
Input 0 OFF
ON
leakagecurrent
0
200
400
600
800
1000
1200
1400
90nm 65nm 45nm 32nm 22nm 16nm
Po
wer
(W
), P
ow
er D
ensi
ty (
W/c
m2)
SiO2 Lkg
SD Lkg
Active
10 mm Die
[Borkar-MICRO05]
2007. 11. 2.France-Japan PAAP Workshop 21
Introduction of our research Innovative Power Control for Ultra Low-Power and H
igh-Performance System LSIs 5 years project started October, 2006 supported by JST (Japan Science and Technology Agency) a
s a CREST (Core Research for Evolutional Science and Technology) program
Objective: drastic power reduction of high-performance system LSIs by innovative power control through tight cooperation of various design levels including circuit, architecture, and system software.
Members: Prof. H. Nakamura (U. Tokyo): architecture & compiler [leader] Prof. M. Namiki (Tokyo Univ of Agri. Tech): OS Prof. H. Amano (Keio Univ): architecture & F/E design Prof. K. Usami (Shibaura I.T.): circuit & B/E design
2007. 11. 2.France-Japan PAAP Workshop 22
How to reduce leakage: Power Gating
Focusing on Power Gating for reducing leakage Inserting a Power Switch (PS) between VDD and GND Turning off PS when sleep
Virtual GNDSleep Power Switch
logic gatesVDD logic gates VDD
GND
2007. 11. 2.France-Japan PAAP Workshop 23
Run-time Power Gating (RTPG)
CircuitA
Circuit B
CircuitC
SleepControl ckt
PowerSwitch
control power switch at run time
Coarse grain: Mobile processor by Renesas(independent power domains for BB module, MPEG module, ..)
Fine grain (our target): power gating within a module
2007. 11. 2.France-Japan PAAP Workshop 24
Fine-grain Run-time Power Gating
Longer sleep time is preferable Leakage savings Overheads: power penalties for wakeup
Evaluation through a real chip not reported Test vehicle: 32b x 32b Multiplier
Either or both operands (input data) are likely less than 16-bit Circuit portions to compute upper bits of product need not to
operate waste leakage power
By detecting 0s at upper 16-bits of operands, power gate internal Multiplier array
2007. 11. 2.France-Japan PAAP Workshop 25
Test chip "Pinnacle"
Not applied FG-RTPG applied
Technology STARC 90nm CMOS
Multiplier core Area # cells
0.544 × 0.378 mm2
15,000
Design time 4.5 months
Design members
3 Master students, 1 Bachelor student,1 Faculty
2.0
2.5
3.0
3.5
4.0
25C
85C
125C
Sequence 1
(No sleep)
Sequence 2
(Domain H
sleeps)
Sequence 3
(Domain H and M
sleep)
Pow
er d
issi
patio
n(
mW
)- Exhibits good power reduction
- Current Status Designing a pipelined
microprocessor with FG-RTPG Compiler (instruction scheduler)
to increase sleep time
real measurement
2007. 11. 2.France-Japan PAAP Workshop 26
Low Power Linux Scheduler based onstatistical modeling Co-optimization of System Software and Architecture Objective:
process scheduler which reduce power consumption by DVFS (dynamic voltage and frequency scaling) of each process with satisfying its performance constraint
How to find the lowest frequency with satisfying performance constraints ? it depends on hardware and program characteristics performance ratio is different from frequency ratio hard to find the answer straightforward
modeling by statistical analysis of hardware events
2007. 11. 2.France-Japan PAAP Workshop 27
Evaluation result
Specified threshold Black dotted line
Perf. is within the threshold in all the cases except for mgrid 3-7% below the threshold
Accurate model is obtained Linux scheduler using this model is developed
May 8, 2007 27
1.0
0.9
0.8
0.7
0.6
0.5
0.4
Rela
tive
Perf
orm
ance
1.00.90.80.70.60.5
Performance Threshold
mcf bzip2 swim mgrid matrix (50) matrix (600) matrix (1000) Threshold
Pentium M 760 (Max 2.00 GHz, FSB 533 MHz)
2007. 11. 2.France-Japan PAAP Workshop 28
Summary Challenge for high performance processors:
Memory Wall and Power Wall One solution to memory wall
make good use of on-chip memory with software controllability
Solutions to power wall many cores will relax the problem, but leakage current is getting a big problem new research/approach is required our project “Innovative Power Control for Ultra Low-Power
and High-Performance System LSIs” is introduced
2007. 11. 2.France-Japan PAAP Workshop 29