Agenda InstrucTon&Level&Parallelism& - EECS …cs61c/fa10/lectures/09LecF10...Instr7 IF& ID& ALU MEM...
Transcript of Agenda InstrucTon&Level&Parallelism& - EECS …cs61c/fa10/lectures/09LecF10...Instr7 IF& ID& ALU MEM...
9/17/10
1
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
Instructors: Randy H. Katz
David A. PaHerson hHp://inst.eecs.Berkeley.edu/~cs61c/fa10
1 Fall 2010 -‐-‐ Lecture #9 9/17/10
Agenda
• InstrucTon Stages Revisited • Administrivia
• Technology Break • Rise of the Warehouse-‐Scale Computer
9/17/10 Fall 2010 -‐-‐ Lecture #9 2
Agenda
• InstrucTon Stages Revisited • Administrivia
• Technology Break • Rise of the Warehouse-‐Scale Computer
9/17/10 Fall 2010 -‐-‐ Lecture #9 3
InstrucTon Level Parallelism
9/17/10 Fall 2010 -‐-‐ Lecture #8 4
P 1
Instr 1 IF ID ALU MEM WR
P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 P 10 P 11 P 12
Instr 2 IF ID ALU MEM WR IF ID ALU MEM WR Instr 2
Instr 3 IF ID ALU MEM WR
IF ID ALU MEM WR Instr 4
IF ID ALU MEM WR Instr 5
IF ID ALU MEM WR Instr 6
IF ID ALU MEM WR Instr 7
IF ID ALU MEM WR Instr 8
Conceptual MIPS Datapath
9/17/10 Fall 2010 -‐-‐ Lecture #9 5
Stages of the Datapath (1/5)
• There is a wide variety of MIPS instrucTons: so what general steps do they have in common?
• Stage 1: Instruc(on Fetch – No maHer what the instrucTon, the 32-‐bit instrucTon word must first be fetched from memory (the cache-‐memory hierarchy)
– Also, this is where we Increment PC (that is, PC = PC + 4, to point to the next instrucTon: byte addressing so + 4)
9/17/10 6 Fall 2010 -‐-‐ Lecture #8
9/17/10
2
Stages of the Datapath (2/5)
• Stage 2: Instruc(on Decode – Upon fetching the instrucTon, we next gather data from the fields (decode all necessary instrucTon data)
– First, read the opcode to determine instrucTon type and field lengths
– Second, read in data from all necessary registers • For add, read two registers • For addi, read one register • For jal, no reads necessary
9/17/10 7 Fall 2010 -‐-‐ Lecture #8
Stages of the Datapath (3/5)
• Stage 3: ALU (ArithmeTc-‐Logic Unit) – Real work of most instrucTons is done here: arithmeTc (+, -‐, *, /), shihing, logic (&, |), comparisons (slt)
– What about loads and stores? • lw $t0, 40($t1) • Address we are accessing in memory = the value in $t1 PLUS the value 40
• So we do this addiTon in this stage
9/17/10 8 Fall 2010 -‐-‐ Lecture #8
Stages of the Datapath (4/5)
• Stage 4: Memory Access – Actually only the load and store instrucTons do anything during this phase; the others remain idle during this phase or skip it all together
– Since these instrucTons have a unique step, we need this extra phase to account for them
– As a result of the cache system, this phase is expected to be fast
9/17/10 9 Fall 2010 -‐-‐ Lecture #8
Stages of the Datapath (5/5)
• Stage 5: Register Write – Most instrucTons write the result of some computaTon into a register
– E.g.,: arithmeTc, logical, shihs, loads, slt
– What about stores, branches, jumps? • Don’t write anything into a register at the end • These remain idle during this fihh phase or skip it all together
9/17/10 10 Fall 2010 -‐-‐ Lecture #8
Limits to Performance: Latency vs. Bandwidth
• Latency: the Tme to access first item
• Bandwidth: # of items accessed per unit Tme
• Historically, bandwidth has improved much faster than latency
• Why?
9/17/10 Fall 2010 -‐-‐ Lecture #8 11
Bandwidth improves faster than latency
Latency improves faster than bandwidth
Latency vs. Bandwidth: Physical Analogy
• Time to first drop
• Time to fill glass • Water per Tme
9/17/10 Fall 2010 -‐-‐ Lecture #8 12
Water Tank
Glass Glass Glass Length and diameter of pipes affects
latency and bandwidth
9/17/10
3
Latency vs. Bandwidth: Which is “Faster”?
• SD SF, 1 Truck, 10 hours, 1000 by 1 TByte Disks (1 PByte) – Time to first byte: – Time to last byte: – Bandwidth:
• SDSF, 100 gbps fiber link (10 GB per sec) – Time to first byte: – Time to last byte: – Bandwidth:
9/17/10 Fall 2010 -‐-‐ Lecture #8 13
Latency vs. Bandwidth: Which is “Faster”?
• SD SF, 1 Truck, 10 hours, 1000 by 1 TByte Disks (1 PByte) – Time to first byte: 10 hours – Time to last byte: 10 hours
– Bandwidth: 100 TBytes/hr (222 Gbps) • SDSF, 100 Gbps fiber link (10 GB per sec)
– Time to first byte: 2.6 ms (speed of light @ 500 mi!) – Time to last byte: 28 hours – Bandwidth: 10 GB/s
9/17/10 Fall 2010 -‐-‐ Lecture #8 14
Agenda
• Stages of an InstrucTon Revisited • Administrivia
• Technology Break • Rise of the Warehouse-‐Scale Computer
9/17/10 Fall 2010 -‐-‐ Lecture #9 15
Administrivia
• Due dates for Project 2/First Part (Saturday, 18 September) and Project 2/Second Part (Saturday, 25 September) are @23:59:59
• Midterm ExaminaTon, 6 October, 6-‐9 PM, 1 Pimentel
9/17/10 Fall 2010 -‐-‐ Lecture #9 16
Agenda
• Stages of an InstrucTon Revisited • Administrivia
• Technology Break • Rise of Warehouse-‐Scale Computers
9/17/10 Fall 2010 -‐-‐ Lecture #9 17
Agenda
• Stages of an InstrucTon Revisited • Administrivia
• Technology Break • Rise of Warehouse-‐Scale Computers
9/17/10 Fall 2010 -‐-‐ Lecture #9 18
9/17/10
4
Growth in Access Devices
9/17/10 Fall 2010 -‐-‐ Lecture #9 19
The ARM Inside the iPhone
9/17/10 Fall 2010 -‐-‐ Lecture #8 20
iPhone Innards
9/17/10 Fall 2010 -‐-‐ Lecture #8 21
1 GHz ARM Cortex A8
E.g., Google’s Oregon Datacenter
23 9/17/10 Fall 2010 -‐-‐ Lecture #9
Energy ProporTonal CompuTng
24
Figure 1. Average CPU uTlizaTon of more than 5,000 servers during a six-‐month period. Servers are rarely completely idle and seldom operate near their maximum uTlizaTon, instead operaTng most of the Tme at between 10 and 50 percent of their maximum
It is surprisingly hard to achieve high levels of uTlizaTon of typical servers (and your home PC or laptop is even worse)
“The Case for Energy-‐ProporTonal CompuTng,” Luiz André Barroso, Urs Hölzle, IEEE Computer December 2007
9/17/10 Fall 2010 -‐-‐ Lecture #9
Energy ProporTonal CompuTng
25
Figure 2. Server power usage and energy efficiency at varying uTlizaTon levels, from idle to peak performance. Even an energy-‐efficient server sTll consumes about half its full power when doing virtually no work.
“The Case for Energy-‐ProporTonal CompuTng,” Luiz André Barroso, Urs Hölzle, IEEE Computer December 2007 Doing nothing well …
NOT!
Energy Efficiency = UTlizaTon/Power
9/17/10 Fall 2010 -‐-‐ Lecture #9
9/17/10
5
Energy ProporTonal CompuTng
26 Figure 3. CPU contribuTon to total server power for two generaTons of Google servers at peak performance (the first two bars) and for the later generaTon at idle (the rightmost bar).
“The Case for Energy-‐ProporTonal CompuTng,” Luiz André Barroso, Urs Hölzle, IEEE Computer December 2007
CPU energy improves, but what about the rest of the server architecture?
9/17/10 Fall 2010 -‐-‐ Lecture #9
Energy ProporTonal CompuTng
27
Figure 4. Power usage and energy efficiency in a more energy-‐proporTonal server. This server has a power efficiency of more than 80 percent of its peak value for uTlizaTons of 30 percent and above, with efficiency remaining above 50 percent for uTlizaTon levels as low as 10 percent.
“The Case for Energy-‐ProporTonal CompuTng,” Luiz André Barroso, Urs Hölzle, IEEE Computer December 2007
Design for wide dynamic power range and ac(ve low power modes
Doing nothing VERY well
Energy Efficiency = UTlizaTon/Power
9/17/10 Fall 2010 -‐-‐ Lecture #9
Energy Use In Datacenters
LBNL Michael PaHerson, Intel
Datacenter Energy Overheads
9/17/10 28 Fall 2010 -‐-‐ Lecture #9 31
Datacenter Power
Peak Power %
9/17/10 Fall 2010 -‐-‐ Lecture #9
32
Nameplate vs. Actual Peak
X. Fan, W-‐D Weber, L. Barroso, “Power Provisioning for a Warehouse-‐sized Computer,” ISCA’07, San Diego, (June 2007).
Component CPU Memory Disk PCI Slots Mother Board Fan System Total
Peak Power 40 W 9 W 12 W 25 W 25 W 10 W
Count 2 4 1 2 1 1
Total 80 W 36 W 12 W 50 W 25 W 10 W 213 W
Nameplate peak 145 W Measured Peak
(Power-‐intensive workload) In Google’s world, for given DC power budget, deploy as many machines as possible
9/17/10 Fall 2010 -‐-‐ Lecture #9
Server Innards
9/17/10 Fall 2010 -‐-‐ Lecture #8 33
9/17/10
6
Server Internals
9/17/10 Fall 2010 -‐-‐ Lecture #8 34
Server Internals
9/17/10 Fall 2010 -‐-‐ Lecture #8 35
Google Server
Summary
• Five Stages/Phases of an InstrucTon – InstrucTon Fetch (IF) – InstrucTon Decode (ID) – Execute (ALU) – Memory (MEM) – Write Results (WR)
• Bandwidth vs. Latency – Easier to increase bandwidth than reduce latency
• Rise of the Warehouse-‐Scale Computer – Energy ProporTonal CompuTng
• Power (WaHs), Energy (Power x Time, WaH-‐hours) • Subject to responsiveness goals, drive nodes to higher uTlizaTon to achieve beHer energy efficiency
9/17/10 Fall 2010 -‐-‐ Lecture #9 39