By: Brittany Bass. Laptop computers Hand-held computers Mainframe computers.
Chapter 1 Fundamentals of Quantitative Design and Analysismenasce/cs465/slides/CAQA5e_ch1.pdf ·...
Transcript of Chapter 1 Fundamentals of Quantitative Design and Analysismenasce/cs465/slides/CAQA5e_ch1.pdf ·...
1 Copyright © 2012, Elsevier Inc. All rights reserved.
Chapter 1
Fundamentals of Quantitative Design and Analysis
Computer Architecture A Quantitative Approach, Fifth Edition
2 Copyright © 2013, Daniel A. Menasce. All rights reserved.
1970: The IBM 7044 Computer Introduction
3 Copyright © 2013, Daniel A. Menasce. All rights reserved.
1970: Magnetic Core Memory Introduction
1core = 1 bit
4
How Does a Computer Work?
What is the Von Neumann computer architecture? What are the elements of a Von Neumann
computer? How do these elements interact to execute
programs?
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
5
Von Neumann Architecture
What is the Von Neumann computer architecture? What are the elements of a Von Neumann
computer? How do these elements interact to execute
programs?
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
6
Von Neumann Architecture
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
Stored-program computer
Data
Program
ALU
Registers
system bus
7
Von Neumann Architecture
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
Stored-program computer
Data
Program
ALU
Registers
Sequence of instructions: (1) Load/Store from memory to/from registers, (2) Arithmetic/Logical Ops., (3) Conditional and unconditional branches.
General registers, FP registers, Program counter, etc.
8
Example of Instruction Formats
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
Registers
opcode RX RY RZ RZ RX op RY
opcode RX Address RX Mem(address) Mem(address) RX
opcode Address Unconditional or Conditional Jump to Address
9 Copyright © 2012, Elsevier Inc. All rights reserved.
Computer Technology Performance improvements:
Improvements in semiconductor technology Feature size, clock speed
Improvements in computer architectures Enabled by HLL compilers, UNIX Lead to RISC architectures
Together have enabled: Lightweight computers Productivity-based managed/interpreted
programming languages
Introduction
10 Copyright © 2012, Elsevier Inc. All rights reserved.
Single Processor Performance Introduction
RISC
Move to multi-processor. From ILP to DLP and TLP
11
Moore’s Law
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
Source: Wikimedia Commons
The number of transistors on integrated circuits doubles approximately every two years. (Gordon Moore, 1965)
12 Copyright © 2012, Elsevier Inc. All rights reserved.
Current Trends in Architecture Cannot continue to leverage Instruction-Level
parallelism (ILP) Single processor performance improvement ended in
2003
New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP)
These require explicit restructuring of the application
Introduction
13
Instruction Level Parallelism (ILP)
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction In high-level language: e= a + b; f = c + d; g = e * f;
a b c d + + e f * g
In assembly language: * e = a+b 001 LOAD A,R1 002 LOAD B,R2 003 ADD R1,R2,R3 004 STO R3,E * f = c + d 005 LOAD C,R4 006 LOAD D,R5 007 ADD R4,R5,R6 008 STO R6,F * g = e * f 009 MULT R3,R6,R7 010 STO R7,G
14
Instruction Level Parallelism
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction In assembly language: * e = a+b 001 LOAD A,R1 002 LOAD B,R2 003 ADD R1,R2,R3 004 STO R3,E * f = c + d 005 LOAD C,R4 006 LOAD D,R5 007 ADD R4,R5,R6 008 STO R6,F * g = e * f 009 MULT R3,R6,R7 010 STO R7,G
Instructions that can potentially be executed in parallel:
001, 002, 005, 006 003, 007 004, 008, 009 010
15
Data Level Parallelism (DLP)
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
Task Data Task Data
16
Task Level Parallelism (TLP)
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
Tasks Task
17 Copyright © 2012, Elsevier Inc. All rights reserved.
Classes of Computers Personal Mobile Device (PMD)
e.g. smart phones, tablet computers Emphasis on energy efficiency and real-time
Desktop Computing Emphasis on price-performance
Servers Emphasis on availability, scalability, throughput
Clusters / Warehouse Scale Computers Used for “Software as a Service (SaaS)” Emphasis on availability and price-performance Sub-class: Supercomputers, emphasis: floating-point
performance and fast internal networks Embedded Computers
Emphasis: price
Classes of C
omputers
18 Copyright © 2012, Elsevier Inc. All rights reserved.
Parallelism Classes of parallelism in applications:
Data-Level Parallelism (DLP) Task-Level Parallelism (TLP)
Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/Graphic Processor Units (GPUs) Thread-Level Parallelism Request-Level Parallelism
Classes of C
omputers
19 Copyright © 2012, Elsevier Inc. All rights reserved.
Flynn’s Taxonomy Single instruction stream, single data stream (SISD)
Single instruction stream, multiple data streams (SIMD) Vector architectures Multimedia extensions Graphics processor units
Multiple instruction streams, single data stream (MISD) No commercial implementation
Multiple instruction streams, multiple data streams (MIMD) Tightly-coupled MIMD (thread-level parallelism) Loosely-coupled MIMD (cluster or warehouse-scale computing)
Classes of C
omputers
20 Copyright © 2012, Elsevier Inc. All rights reserved.
Defining Computer Architecture “Old” view of computer architecture:
Instruction Set Architecture (ISA) design i.e. decisions regarding:
Registers (register-to-memory or load-store), memory addressing (e.g., byte addressing, alignment), addressing modes e.g., register, immediate, displacement), instruction operands (type and size), available operations (CISC or RISC), control flow instructions, instruction encoding (fixed vs. variable length)
“Real” computer architecture: Specific requirements of the target machine Design to maximize performance within constraints:
cost, power, and availability Includes ISA, microarchitecture, hardware
Defining C
omputer A
rchitecture
21 Copyright © 2012, Elsevier Inc. All rights reserved.
Trends in Technology Integrated circuit technology
Transistor density: 35%/year Die size: 10-20%/year Integration overall (Moore’s Law): 40-55%/year
DRAM capacity: 25-40%/year (slowing)
Flash capacity (non-volatile semiconductor memory): 50-60%/year 15-20X cheaper/bit than DRAM
Magnetic disk technology: density increase: 40%/year 15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM
Trends in Technology
22 Copyright © 2012, Elsevier Inc. All rights reserved.
Bandwidth and Latency Bandwidth or throughput
Total work done in a given time (e.g., MB/sec, MIPS) 10,000-25,000X improvement for processors 300-1200X improvement for memory and disks
Latency or response time Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks
Trends in Technology
23 Copyright © 2012, Elsevier Inc. All rights reserved.
Bandwidth and Latency
Log-log plot of bandwidth and latency milestones
Trends in Technology
24 Copyright © 2012, Elsevier Inc. All rights reserved.
Transistors and Wires Feature size
Minimum size of transistor or wire in x or y dimension
10 microns in 1971 to .032 microns in 2011 Transistor performance scales linearly
Wire delay does not improve with feature size! Integration density scales quadratically
Trends in Technology
25 Copyright © 2012, Elsevier Inc. All rights reserved.
Power (1 Watt = 1 Joule/sec) and Energy (Joules) Problem: Get power in, get power out
Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power, higher than average power
consumption
Clock rate can be reduced dynamically to limit power consumption
Energy per task is often a better measurement
Trends in Pow
er and Energy
26
Energy and Power Example
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
Processor A Processor B 20% higher average power consumption: 1.2 P
P
Executes a task in 70% of the time needed by B: 0.7 * T
T
Energy consumption: 1.2 * 0.7 * T = 0.84 P * T
Energy consumption = P * T
More energy efficient!
It is better to use energy instead of power for comparing a fixed workload.
27 Copyright © 2012, Elsevier Inc. All rights reserved.
Dynamic Energy and Power Dynamic energy
Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage2
Dynamic power ½ x Capacitive load x Voltage2 x Frequency switched
Reducing clock rate reduces power, not energy
Trends in Pow
er and Energy
28 Copyright © 2012, Elsevier Inc. All rights reserved.
Dynamic Energy and Power: Example
Assume a processor with Dynamic Voltage Scaling (DVS) and that a 15% reduction in voltage results in 15% reduction in frequency. What is the impact on dynamic energy and dynamic power? Assume that the capacitance is unchanged.
Trends in Pow
er and Energy
€
EnergynewEnergyold
=(Voltage × 0.85)2
Voltage2= 0.72
€
PowernewPowerold
=Voltage × 0.85( )2 × (Frequency× 0.85)
Voltage2 ×Frequency= 0.61
29 Copyright © 2012, Elsevier Inc. All rights reserved.
Power Intel 80386
consumed ~ 2 W 3.3 GHz Intel
Core i7 consumes 130 W
Heat must be dissipated from 1.5 x 1.5 cm chip
This is the limit of what can be cooled by air
Trends in Pow
er and Energy
30 Copyright © 2012, Elsevier Inc. All rights reserved.
Reducing Power Techniques for reducing power:
Do nothing well: turn-off clock of inactive modules.
Dynamic Voltage-Frequency Scaling Low power state for DRAM, disks (spin-down) Overclocking some cores and turning off other
cores
Trends in Pow
er and Energy
31 Copyright © 2012, Elsevier Inc. All rights reserved.
Static Power Static power consumption
Leakage current flows even when a transistor is off
Leakage can be as high as 50% due to large SRAM caches that need power to maintain values.
Powerstatic = Currentstatic x Voltage Scales with number of transistors To reduce: power gating (i.e., turn off the
power supply).
Trends in Pow
er and Energy
32 Copyright © 2012, Elsevier Inc. All rights reserved.
Trends in Cost Cost driven down by learning curve
Yield (% of manufactured devices that survive testing).
DRAM: price closely tracks cost
Microprocessors: price depends on volume (less time to get down the learning curve and increase in manufacturing efficiency) 10% less for each doubling of volume
Trends in Cost
33 Copyright © 2012, Elsevier Inc. All rights reserved.
Integrated Circuit Cost Integrated circuit
Bose-Einstein formula:
Defects per unit area = 0.016-0.057 defects per square cm (2010) N = process-complexity factor = 11.5-15.5 (40 nm, 2010)
Trends in Cost
# dies along the wafer’s perimeter
empirical formula
34 Copyright © 2012, Elsevier Inc. All rights reserved.
Integrated Circuit Cost: Example What is the number of dies in a 30 cm-
diameter wafer for a die that is 1.5 cm on a side and for a die that is 1.0 cm on a side?
Trends in Cost
€
Dies per wafer = π × (30/2)2
1.5 ×1.5−
π × 302 ×1.5 ×1.5
= 270
€
Dies per wafer = π × (30/2)2
1.0 ×1.0−
π × 302 ×1.0 ×1.0
= 640
35 Copyright © 2012, Elsevier Inc. All rights reserved.
Integrated Circuit Cost: Example What is the die yield for dies that are 1.5
cm on a side and 1.0 cm on a side, assuming a defect density of 0.031/cm2, N = 13.5, a wafer yield of 100%?
Trends in Cost
€
Die yield = 11 +0.031×1.5 ×1.5( )13.5 = 0.40
Die yield = 11 +0.031×1.0 ×1.0( )13.5 = 0.66
36 Copyright © 2012, Elsevier Inc. All rights reserved.
Dependability Service Level Agreement (SLA) or Service Level
Objectives (SLO)
Module reliability Mean time to failure (MTTF) – reliability measure Mean time to repair (MTTR) – service interruption Mean time between failures (MTBF) = MTTF + MTTR Availability = MTTF / MTBF
Dependability
37 © 2004 D. A. Menascé. All Rights Reserved.
MTTR, MTTF, and MTBF
time
failure failure machine is fixed
MTTR MTTF
MTBF
38 Copyright © 2012, Elsevier Inc. All rights reserved.
Dependability Example A disk subsystem has the following components
and MTTF values:
Assume that lifetimes are exponentially distributed and independent, what is the system’s MTTF?
Dependability
Component MTTF (hours) 10 disks Each at 1,000,000
hours 1 ATA controller 500,000 1 power supply 200,000 1 fan 200,00 1 ATA cable 1,000,000
39 Copyright © 2012, Elsevier Inc. All rights reserved.
Dependability Example Cont’d D
ependability Component MTTF (hours) 10 disks Each at 1,000,000
hours 1 ATA controller 500,000 1 power supply 200,000 1 fan 200,00 1 ATA cable 1,000,000
€
Failure Ratesyst =10 × 11,000,000
×1
500,000×
1200,000
×1
200,000×
11,000,000
= 23,000 /billion hours
€
MTTFsyst =1
Failure Ratesyst
= 43,500 hours ≈ 4.96 yrs
40 Copyright © 2012, Elsevier Inc. All rights reserved.
Measuring Performance Typical performance metrics:
Response time (of interest to users) Throughput (of interest to operators)
Speedup of X relative to Y Execution timeY / Execution timeX
Execution time Wall clock time: includes all system overheads CPU time: only computation time
Benchmarks Kernels (e.g., matrix multiply) Toy programs (e.g., sorting) Synthetic benchmarks (e.g., Dhrystone) Benchmark suites (e.g., SPEC, TPC)
Measuring P
erformance
41 Copyright © 2013, Daniel A. Menasce
SPECRate and SPECRatio SPECrate: a throughput metrics. Measures the number jobs of a
given type that can be processed in a given time interval.
SPECratio = ratio between elapsed time for a given job at a reference machines and the elapsed time of the same job at a given machine.
Comparing machines A and B:
Measuring P
erformance
€
Execution Timereference =10.5secExecution Timemachine A = 5.25secSPECratiomachine A =10.5 /5.25 = 2
€
Execution Timemachine B = 21secExecution Timemachine A = 5.25sec
SPECratiomachine A
SPECratiomachine B
=
Execution Timereference
Execution Timemachine AExecution Timereference
Execution Timemachine B
=Execution Timemachine B
Execution Timemachine A
= 21/5.25 = 4
42 Copyright © 2013, Daniel A. Menasce
Geometric Mean of SPECratios When computing the average of SPECratios one should use the
geometric mean and not the arithmetic mean:
Measuring P
erformance
€
Geometric mean = xii=1
n
∏n
Program SPECRatio A 10.2 B 21.5 C 15.2
Geometric mean 14.94
€
Geometric mean = 10.2 × 21.5 ×15.53 = 3,333.363 =14.94
43 Copyright © 2012, Elsevier Inc. All rights reserved.
Principles of Computer Design Take Advantage of Parallelism
e.g., multiple processors, disks, memory banks, pipelining, multiple functional units
Principle of Locality Reuse of data and instructions
Focus on the Common Case Amdahl’s Law
Principles
44 Copyright © 2013, Daniel Menasce. All rights reserved.
Amdahl’s Law: Example A program takes 60 sec to execute. But, 30% of
its execution time can have its execution time improved by a factor of 5. What is the overall speedup?
Principles
€
Execution Timenew = 0.7 × 60 + 0.3× 60 /5 = 42 +18 /5 = 45.6 secSpeedupoverall = 60 /45.6 =1.316
45 Copyright © 2013, Daniel Menasce. All rights reserved.
Amdahl’s Law: Example There are two options to enhance the
performance of a graphics application: enhance by a factor of 10 FP SQRT or enhance by a factor of 1.6 all FP operations. Which is best?
Principles
Type of Enhancement
% Occurrence Speedup of Enhancement
Overall Speedup
FP SQRT 20% 10 =1/(0.2/10+0.8)=1.22 All FP Ops. 50% 1.6 =1/(0.5+0.5/1.6)=1.23
46 Copyright © 2012, Elsevier Inc. All rights reserved.
Principles of Computer Design The Processor Performance Equation
Principles
(average) Clock cycles per instruction
47 Copyright © 2012, Elsevier Inc. All rights reserved.
Processor Equations: Example P
rinciples
We have the following measurements:
Compare the following alternatives: (a) decrease CPI of FP SQRT to 2 or (b) decrease avg. CPI of all FP ops to 2.5. Which is best?
Measurement Value % of FP operations 25% Avg. CPI of FP operations 4.0 Avg. CPI of other instructions 1.33 % of FP SQRT 2% CPI of FP SQRT 20
48 Copyright © 2012, Elsevier Inc. All rights reserved.
Processor Equations: Example P
rinciples
Original CPI: CPI for enhanced FP SQRT:
CPI for enhanced FP:
Measurement Value % of FP operations 25% Avg. CPI of FP operations 4.0 Avg. CPI of other instructions 1.33 % of FP SQRT 2% CPI of FP SQRT 20
€
CPIorig = 0.25 × 4 + 0.75 ×1.33 = 2.0
€
CPInew FPSQRT = 2.0 − 0.02 × 20 + 0.02 × 2 =1.64
€
CPInew FP = 2.0 − 0.25 × 4 + 0.25 × 2.5 =1.625
49 Copyright © 2012, Elsevier Inc. All rights reserved.
Processor Equations: Example P
rinciples
Original CPI: CPI for enhanced FP:
Speedup:
Most modern processors have counters for instructions executed and for clock cycles.
€
CPIorig = 0.25 × 4 + 0.75 ×1.33 = 2.0
€
CPInew FP = 2.0 − 0.25 × 4 + 0.25 × 2.5 =1.625
€
Speedup =IC ×CPIorig
IC ×CPInew FP
=2.0
1.625=1.23
50 Copyright © 2012, Elsevier Inc. All rights reserved.
Falacies and Pitfalls P
rinciples
Fallacy: Multiprocessors are a silver bullet. Improve performance by replacing high-clock
and inefficient core with several lower-clock-rate efficient cores.
Real performance improvement burden is now shifted to programmers.
Pitfall: Falling prey to Amdahl’s Law. Should check overall improvement using
Amdahl’s law before spending effort on enhancement.
51 Copyright © 2012, Elsevier Inc. All rights reserved.
Falacies and Pitfalls (cont’d) P
rinciples
Pitfall: A single point of failure. Dependability is as strong as the weakest link.
Fallacy: Hardware enhancements that increase performance improve energy efficiency or are at worst energy neutral Running SPEC2006 on an Intel Core i7 n
Turbo mode showed a 7% performance increase at a cost of 37% more energy and 47% more power.
52 Copyright © 2012, Elsevier Inc. All rights reserved.
Falacies and Pitfalls (cont’d) P
rinciples Fallacy: Benchmarks remain valid indefinitely. There is a tendency to optimize performance
for standard benchmarks. Fallacy: The rated MTTF of disks is
1,200,000 hours (almost 140 years). So, disks practically never fail. These numbers are exaggerated due to
manufacturer misleading testing processes. Real-world MTTF is about 2 to 10 times worse than manufacturer’s MTTF.
53 Copyright © 2012, Elsevier Inc. All rights reserved.
Falacies and Pitfalls (cont’d) P
rinciples Fallacy: Peak performance tracks observed performance. Observed performance as a fraction of peak
performance varies significantly (from 5% to 60%) by benchmark and computer.
Pitfall: Fault detection can lower availability. A relatively high percentage of a system’s
components can fail without affecting its correct operation. If the system’s operation is interrupted when these faults are detected, availability is reduced.