Chapter 1 Fundamentals of Quantitative Design and Analysismenasce/cs465/slides/CAQA5e_ch1.pdf ·...

1 Copyright © 2012, Elsevier Inc. All rights reserved.

Chapter 1

Fundamentals of Quantitative Design and Analysis

Computer Architecture A Quantitative Approach, Fifth Edition

2 Copyright © 2013, Daniel A. Menasce. All rights reserved.

1970: The IBM 7044 Computer Introduction

3 Copyright © 2013, Daniel A. Menasce. All rights reserved.

1970: Magnetic Core Memory Introduction

1core = 1 bit

4

How Does a Computer Work?

  What is the Von Neumann computer architecture?   What are the elements of a Von Neumann

computer?   How do these elements interact to execute

programs?

Copyright © 2013, Daniel A. Menasce. All rights reserved.

Introduction

5

Von Neumann Architecture

  What is the Von Neumann computer architecture?   What are the elements of a Von Neumann

computer?   How do these elements interact to execute

programs?


Introduction

6



Introduction

Stored-program computer

Data

Program

ALU

Registers

system bus

7



Introduction

Stored-program computer

Data

Program

ALU

Registers

Sequence of instructions: (1) Load/Store from memory to/from registers, (2) Arithmetic/Logical Ops., (3) Conditional and unconditional branches.

General registers, FP registers, Program counter, etc.

8

Example of Instruction Formats


Introduction

Registers

opcode RX RY RZ RZ RX op RY

opcode RX Address RX Mem(address) Mem(address) RX

opcode Address Unconditional or Conditional Jump to Address


Computer Technology   Performance improvements:

  Improvements in semiconductor technology   Feature size, clock speed

  Improvements in computer architectures   Enabled by HLL compilers, UNIX   Lead to RISC architectures

  Together have enabled:   Lightweight computers   Productivity-based managed/interpreted

programming languages

Introduction


Single Processor Performance Introduction

RISC

Move to multi-processor. From ILP to DLP and TLP

11

Moore’s Law


Introduction

Source: Wikimedia Commons

The number of transistors on integrated circuits doubles approximately every two years. (Gordon Moore, 1965)


Current Trends in Architecture   Cannot continue to leverage Instruction-Level

parallelism (ILP)   Single processor performance improvement ended in

2003

  New models for performance:   Data-level parallelism (DLP)   Thread-level parallelism (TLP)   Request-level parallelism (RLP)

  These require explicit restructuring of the application

Introduction

13

Instruction Level Parallelism (ILP)


Introduction In high-level language: e= a + b; f = c + d; g = e * f;

a b c d + + e f * g

In assembly language: * e = a+b 001 LOAD A,R1 002 LOAD B,R2 003 ADD R1,R2,R3 004 STO R3,E * f = c + d 005 LOAD C,R4 006 LOAD D,R5 007 ADD R4,R5,R6 008 STO R6,F * g = e * f 009 MULT R3,R6,R7 010 STO R7,G

14

Instruction Level Parallelism


Introduction In assembly language: * e = a+b 001 LOAD A,R1 002 LOAD B,R2 003 ADD R1,R2,R3 004 STO R3,E * f = c + d 005 LOAD C,R4 006 LOAD D,R5 007 ADD R4,R5,R6 008 STO R6,F * g = e * f 009 MULT R3,R6,R7 010 STO R7,G

Instructions that can potentially be executed in parallel:

001, 002, 005, 006 003, 007 004, 008, 009 010

15

Data Level Parallelism (DLP)


Introduction

Task Data Task Data

16

Task Level Parallelism (TLP)


Introduction

Tasks Task


Classes of Computers   Personal Mobile Device (PMD)

  e.g. smart phones, tablet computers   Emphasis on energy efficiency and real-time

  Desktop Computing   Emphasis on price-performance

  Servers   Emphasis on availability, scalability, throughput

  Clusters / Warehouse Scale Computers   Used for “Software as a Service (SaaS)”   Emphasis on availability and price-performance   Sub-class: Supercomputers, emphasis: floating-point

performance and fast internal networks   Embedded Computers

  Emphasis: price

Classes of C

omputers


Parallelism   Classes of parallelism in applications:

  Data-Level Parallelism (DLP)   Task-Level Parallelism (TLP)

  Classes of architectural parallelism:   Instruction-Level Parallelism (ILP)   Vector architectures/Graphic Processor Units (GPUs)   Thread-Level Parallelism   Request-Level Parallelism

Classes of C

omputers


Flynn’s Taxonomy   Single instruction stream, single data stream (SISD)

  Single instruction stream, multiple data streams (SIMD)   Vector architectures   Multimedia extensions   Graphics processor units

  Multiple instruction streams, single data stream (MISD)   No commercial implementation

  Multiple instruction streams, multiple data streams (MIMD)   Tightly-coupled MIMD (thread-level parallelism)   Loosely-coupled MIMD (cluster or warehouse-scale computing)

Classes of C

omputers


Defining Computer Architecture   “Old” view of computer architecture:

  Instruction Set Architecture (ISA) design   i.e. decisions regarding:

  Registers (register-to-memory or load-store), memory addressing (e.g., byte addressing, alignment), addressing modes e.g., register, immediate, displacement), instruction operands (type and size), available operations (CISC or RISC), control flow instructions, instruction encoding (fixed vs. variable length)

  “Real” computer architecture:   Specific requirements of the target machine   Design to maximize performance within constraints:

cost, power, and availability   Includes ISA, microarchitecture, hardware

Defining C

omputer A

rchitecture


Trends in Technology   Integrated circuit technology

  Transistor density: 35%/year   Die size: 10-20%/year   Integration overall (Moore’s Law): 40-55%/year

  DRAM capacity: 25-40%/year (slowing)

  Flash capacity (non-volatile semiconductor memory): 50-60%/year   15-20X cheaper/bit than DRAM

  Magnetic disk technology: density increase: 40%/year   15-25X cheaper/bit then Flash   300-500X cheaper/bit than DRAM

Trends in Technology


Bandwidth and Latency   Bandwidth or throughput

  Total work done in a given time (e.g., MB/sec, MIPS)   10,000-25,000X improvement for processors   300-1200X improvement for memory and disks

  Latency or response time   Time between start and completion of an event   30-80X improvement for processors   6-8X improvement for memory and disks



Bandwidth and Latency

Log-log plot of bandwidth and latency milestones



Transistors and Wires   Feature size

  Minimum size of transistor or wire in x or y dimension

  10 microns in 1971 to .032 microns in 2011   Transistor performance scales linearly

  Wire delay does not improve with feature size!   Integration density scales quadratically



Power (1 Watt = 1 Joule/sec) and Energy (Joules)   Problem: Get power in, get power out

  Thermal Design Power (TDP)   Characterizes sustained power consumption   Used as target for power supply and cooling system   Lower than peak power, higher than average power

consumption

  Clock rate can be reduced dynamically to limit power consumption

  Energy per task is often a better measurement

Trends in Pow

er and Energy

26

Energy and Power Example


Introduction

Processor A Processor B 20% higher average power consumption: 1.2 P

P

Executes a task in 70% of the time needed by B: 0.7 * T

T

Energy consumption: 1.2 * 0.7 * T = 0.84 P * T

Energy consumption = P * T

More energy efficient!

It is better to use energy instead of power for comparing a fixed workload.


Dynamic Energy and Power   Dynamic energy

  Transistor switch from 0 -> 1 or 1 -> 0   ½ x Capacitive load x Voltage2

  Dynamic power   ½ x Capacitive load x Voltage2 x Frequency switched

  Reducing clock rate reduces power, not energy

Trends in Pow

er and Energy


Dynamic Energy and Power: Example

  Assume a processor with Dynamic Voltage Scaling (DVS) and that a 15% reduction in voltage results in 15% reduction in frequency. What is the impact on dynamic energy and dynamic power? Assume that the capacitance is unchanged.

Trends in Pow

er and Energy

€

EnergynewEnergyold

=(Voltage × 0.85)2

Voltage2= 0.72

€

PowernewPowerold

=Voltage × 0.85( )2 × (Frequency× 0.85)

Voltage2 ×Frequency= 0.61


Power   Intel 80386

consumed ~ 2 W   3.3 GHz Intel

Core i7 consumes 130 W

  Heat must be dissipated from 1.5 x 1.5 cm chip

  This is the limit of what can be cooled by air

Trends in Pow

er and Energy


Reducing Power   Techniques for reducing power:

  Do nothing well: turn-off clock of inactive modules.

  Dynamic Voltage-Frequency Scaling   Low power state for DRAM, disks (spin-down)   Overclocking some cores and turning off other

cores

Trends in Pow

er and Energy


Static Power   Static power consumption

  Leakage current flows even when a transistor is off

  Leakage can be as high as 50% due to large SRAM caches that need power to maintain values.

  Powerstatic = Currentstatic x Voltage   Scales with number of transistors   To reduce: power gating (i.e., turn off the

power supply).

Trends in Pow

er and Energy


Trends in Cost   Cost driven down by learning curve

  Yield (% of manufactured devices that survive testing).

  DRAM: price closely tracks cost

  Microprocessors: price depends on volume (less time to get down the learning curve and increase in manufacturing efficiency)   10% less for each doubling of volume

Trends in Cost


Integrated Circuit Cost   Integrated circuit

  Bose-Einstein formula:

  Defects per unit area = 0.016-0.057 defects per square cm (2010)   N = process-complexity factor = 11.5-15.5 (40 nm, 2010)

Trends in Cost

# dies along the wafer’s perimeter

empirical formula


Integrated Circuit Cost: Example   What is the number of dies in a 30 cm-

diameter wafer for a die that is 1.5 cm on a side and for a die that is 1.0 cm on a side?

Trends in Cost

€

Dies per wafer = π × (30/2)2

1.5 ×1.5−

π × 302 ×1.5 ×1.5

= 270

€

Dies per wafer = π × (30/2)2

1.0 ×1.0−

π × 302 ×1.0 ×1.0

= 640


Integrated Circuit Cost: Example   What is the die yield for dies that are 1.5

cm on a side and 1.0 cm on a side, assuming a defect density of 0.031/cm2, N = 13.5, a wafer yield of 100%?

Trends in Cost

€

Die yield = 11 +0.031×1.5 ×1.5( )13.5 = 0.40

Die yield = 11 +0.031×1.0 ×1.0( )13.5 = 0.66


Dependability   Service Level Agreement (SLA) or Service Level

Objectives (SLO)

  Module reliability   Mean time to failure (MTTF) – reliability measure   Mean time to repair (MTTR) – service interruption   Mean time between failures (MTBF) = MTTF + MTTR   Availability = MTTF / MTBF

Dependability


Dependability Example   A disk subsystem has the following components

and MTTF values:

  Assume that lifetimes are exponentially distributed and independent, what is the system’s MTTF?

Dependability

Component MTTF (hours) 10 disks Each at 1,000,000

hours 1 ATA controller 500,000 1 power supply 200,000 1 fan 200,00 1 ATA cable 1,000,000


Dependability Example Cont’d D

ependability Component MTTF (hours) 10 disks Each at 1,000,000

hours 1 ATA controller 500,000 1 power supply 200,000 1 fan 200,00 1 ATA cable 1,000,000

€

Failure Ratesyst =10 × 11,000,000

×1

500,000×

1200,000

×1

200,000×

11,000,000

= 23,000 /billion hours

€

MTTFsyst =1

Failure Ratesyst

= 43,500 hours ≈ 4.96 yrs


Measuring Performance   Typical performance metrics:

  Response time (of interest to users)   Throughput (of interest to operators)

  Speedup of X relative to Y   Execution timeY / Execution timeX

  Execution time   Wall clock time: includes all system overheads   CPU time: only computation time

  Benchmarks   Kernels (e.g., matrix multiply)   Toy programs (e.g., sorting)   Synthetic benchmarks (e.g., Dhrystone)   Benchmark suites (e.g., SPEC, TPC)

Measuring P

erformance

41 Copyright © 2013, Daniel A. Menasce

SPECRate and SPECRatio   SPECrate: a throughput metrics. Measures the number jobs of a

given type that can be processed in a given time interval.

  SPECratio = ratio between elapsed time for a given job at a reference machines and the elapsed time of the same job at a given machine.

  Comparing machines A and B:

Measuring P

erformance

€

Execution Timereference =10.5secExecution Timemachine A = 5.25secSPECratiomachine A =10.5 /5.25 = 2

€

Execution Timemachine B = 21secExecution Timemachine A = 5.25sec

SPECratiomachine A

SPECratiomachine B

=

Execution Timereference

Execution Timemachine AExecution Timereference

Execution Timemachine B

=Execution Timemachine B

Execution Timemachine A

= 21/5.25 = 4

42 Copyright © 2013, Daniel A. Menasce

Geometric Mean of SPECratios   When computing the average of SPECratios one should use the

geometric mean and not the arithmetic mean:

Measuring P

erformance

€

Geometric mean = xii=1

n

∏n

Program SPECRatio A 10.2 B 21.5 C 15.2

Geometric mean 14.94

€

Geometric mean = 10.2 × 21.5 ×15.53 = 3,333.363 =14.94


Principles of Computer Design   Take Advantage of Parallelism

  e.g., multiple processors, disks, memory banks, pipelining, multiple functional units

  Principle of Locality   Reuse of data and instructions

  Focus on the Common Case   Amdahl’s Law

Principles

44 Copyright © 2013, Daniel Menasce. All rights reserved.

Amdahl’s Law: Example   A program takes 60 sec to execute. But, 30% of

its execution time can have its execution time improved by a factor of 5. What is the overall speedup?

Principles

€

Execution Timenew = 0.7 × 60 + 0.3× 60 /5 = 42 +18 /5 = 45.6 secSpeedupoverall = 60 /45.6 =1.316

45 Copyright © 2013, Daniel Menasce. All rights reserved.

Amdahl’s Law: Example   There are two options to enhance the

performance of a graphics application: enhance by a factor of 10 FP SQRT or enhance by a factor of 1.6 all FP operations. Which is best?

Principles

Type of Enhancement

% Occurrence Speedup of Enhancement

Overall Speedup

FP SQRT 20% 10 =1/(0.2/10+0.8)=1.22 All FP Ops. 50% 1.6 =1/(0.5+0.5/1.6)=1.23


Principles of Computer Design   The Processor Performance Equation

Principles

(average) Clock cycles per instruction


Processor Equations: Example P

rinciples

  We have the following measurements:

  Compare the following alternatives: (a) decrease CPI of FP SQRT to 2 or (b) decrease avg. CPI of all FP ops to 2.5. Which is best?

Measurement Value % of FP operations 25% Avg. CPI of FP operations 4.0 Avg. CPI of other instructions 1.33 % of FP SQRT 2% CPI of FP SQRT 20



rinciples

  Original CPI:   CPI for enhanced FP SQRT:

  CPI for enhanced FP:

Measurement Value % of FP operations 25% Avg. CPI of FP operations 4.0 Avg. CPI of other instructions 1.33 % of FP SQRT 2% CPI of FP SQRT 20

€

CPIorig = 0.25 × 4 + 0.75 ×1.33 = 2.0

€

CPInew FPSQRT = 2.0 − 0.02 × 20 + 0.02 × 2 =1.64

€

CPInew FP = 2.0 − 0.25 × 4 + 0.25 × 2.5 =1.625



rinciples

  Original CPI:   CPI for enhanced FP:

  Speedup:

  Most modern processors have counters for instructions executed and for clock cycles.

€

CPIorig = 0.25 × 4 + 0.75 ×1.33 = 2.0

€

CPInew FP = 2.0 − 0.25 × 4 + 0.25 × 2.5 =1.625

€

Speedup =IC ×CPIorig

IC ×CPInew FP

=2.0

1.625=1.23


Falacies and Pitfalls P

rinciples

  Fallacy: Multiprocessors are a silver bullet.   Improve performance by replacing high-clock

and inefficient core with several lower-clock-rate efficient cores.

  Real performance improvement burden is now shifted to programmers.

  Pitfall: Falling prey to Amdahl’s Law.   Should check overall improvement using

Amdahl’s law before spending effort on enhancement.


Falacies and Pitfalls (cont’d) P

rinciples

  Pitfall: A single point of failure.   Dependability is as strong as the weakest link.

  Fallacy: Hardware enhancements that increase performance improve energy efficiency or are at worst energy neutral   Running SPEC2006 on an Intel Core i7 n

Turbo mode showed a 7% performance increase at a cost of 37% more energy and 47% more power.



rinciples   Fallacy: Benchmarks remain valid indefinitely.   There is a tendency to optimize performance

for standard benchmarks.   Fallacy: The rated MTTF of disks is

1,200,000 hours (almost 140 years). So, disks practically never fail.   These numbers are exaggerated due to

manufacturer misleading testing processes. Real-world MTTF is about 2 to 10 times worse than manufacturer’s MTTF.



rinciples   Fallacy: Peak performance tracks observed performance.   Observed performance as a fraction of peak

performance varies significantly (from 5% to 60%) by benchmark and computer.

  Pitfall: Fault detection can lower availability.   A relatively high percentage of a system’s

components can fail without affecting its correct operation. If the system’s operation is interrupted when these faults are detected, availability is reduced.

Chapter 1 Fundamentals of Quantitative Design and Analysismenasce/cs465/slides/CAQA5e_ch1.pdf ·...

Documents

Transcript of Chapter 1 Fundamentals of Quantitative Design and Analysismenasce/cs465/slides/CAQA5e_ch1.pdf ·...