Rosetta Demostrator Project MASC, Adelaide University and ... · Rosetta Demostrator Project MASC,...

LECTURE 1:UNDERSTANDING COMPUTER

ARCHITECTURE

Abridged version of Patterson & Hennessy (2017):Ch.1

Hennessy & Patterson (2017): Ch.1

1

The Computer Revolution

Progress in computer technology Underpinned by Moore’s Law

Makes novel applications feasible Computers in automobiles Cell phones Human genome project World Wide Web Search Engines

Computers are pervasive

2

Classes of Computers Personal Mobile Device (PMD)

e.g. start phones, tablet computers Emphasis on energy efficiency and real-time

Desktop Computing Emphasis on price-performance

Servers Emphasis on availability, scalability, throughput

Clusters / Warehouse Scale Computers Used for “Software as a Service (SaaS)” Emphasis on availability and price-performance Sub-class: Supercomputers, emphasis: floating-point

performance and fast internal networks Embedded Computers

Emphasis: price

3

The PostPC Era

4

Opening the BoxCapacitive multitouch LCD screen

3.8 V, 25 Watt-hour battery

Computer board

5

Inside the Processor

E.g., Apple A5 Contain

Datapath Control Cache I/O

6

Below Your Program Application software

Written in high-level language System software

Compiler: translates HLL code to machine code

Operating System: service code Handling input/output Managing memory and storage Scheduling tasks & sharing resources

Hardware Processor, memory, I/O controllers

7

Levels of Program Code High-level language

Level of abstraction closer to problem domain

Provides for productivity and portability

Assembly language Textual representation of

instructions Hardware representation

Binary digits (bits) Encoded instructions and

data

8

Defining Computer Architecture

“Old” view of computer architecture: Instruction Set Architecture (ISA) design i.e. decisions regarding:

registers, memory addressing, addressing modes, instruction operands, available operations, control flow instructions, instruction encoding

“Real” computer architecture: Specific requirements of the target machine Design to maximize performance within

constraints: cost, power, and availability Includes ISA, microarchitecture, hardware

9

Understanding Performance

Algorithm Determines number of operations executed

Programming language, compiler, architecture Determine number of machine instructions executed

per operation Processor and memory system

Determine how fast instructions are executed I/O system (including OS)

Determines how fast I/O operations are executed

10

Eight Great Ideas

Design for Moore’s Law

Use abstraction to simplify design

Make the common case fast

Performance via parallelism

Performance via pipelining

Performance via prediction

Hierarchy of memories

Dependability via redundancy

11

Abstractions

Abstraction helps us deal with complexity Hide lower-level detail

Instruction set architecture (ISA) The hardware/software interface

Application binary interface The ISA plus system software interface

Implementation The details underlying and interface

The BIG Picture

12

Intel Core i7 Wafer

300mm wafer, 280 chips, 32nm technology Each chip is 20.7 x 10.5 mm

13

Transistors and Wires

Feature size Minimum size of transistor or wire in x or y

dimension 10 microns in 1971 to .011 microns in 2017 Transistor performance scales linearly

Wire delay does not improve with feature size! Integration density scales quadratically

14

Technology Trends Electronics

technology continues to evolve

Increased capacity and performance

Reduced cost

Year Technology Relative performance/cost

1951 Vacuum tube 1

1965 Transistor 35

1975 Integrated circuit (IC) 900

1995 Very large scale IC (VLSI) 2,400,000

2013 Ultra large scale IC 250,000,000,000

DRAM capacity

15

Trends in Technology

Integrated circuit technology Transistor density: 35%/year Die size: 10-20%/year Integration overall: 40-55%/year

DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year

15-20X cheaper/bit than DRAM Magnetic disk technology: 40%/year

15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM

16

Measuring Performance Typical performance metrics:

Response time Throughput

Speedup of X relative to Y Execution timeY / Execution timeX

Execution time Wall clock time: includes all system overheads CPU time: only computation time

Benchmarks Kernels (e.g. matrix multiply) Toy programs (e.g. sorting) Synthetic benchmarks (e.g. Dhrystone) Benchmark suites (e.g. SPEC06fp, TPC-C)

17

Bandwidth and Latency

Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors 300-1200X improvement for memory and disks

Latency or response time Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks

18

Bandwidth and Latency

Log-log plot of bandwidth and latency milestones

19

Relative Performance

Define Performance = 1/Execution Time “X is n time faster than Y”

n XY

YX

time Executiontime Execution

ePerformancePerformanc

Example: time taken to run a program 10s on A, 15s on B Execution TimeB / Execution TimeA

= 15s / 10s = 1.5 So A is 1.5 times faster than B

20

Measuring Execution Time

Elapsed time Total response time, including all aspects

Processing, I/O, OS overhead, idle time Determines system performance

CPU time Time spent processing a given job

Discounts I/O time, other jobs’ shares Comprises user CPU time and system CPU

time Different programs are affected differently by

CPU and system performance

21

Performance Summary

Performance depends on Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Tc

The BIG Picture

cycle Clock

Seconds

nInstructio

cycles Clock

Program

nsInstructioTime CPU

22

SPEC CPU Benchmark Programs used to measure performance

Supposedly typical of actual workload Standard Performance Evaluation Corp (SPEC)

Develops benchmarks for CPU, I/O, Web, … SPEC CPU2006

Elapsed time to execute a selection of programs Negligible I/O, so focuses on CPU performance

Normalize relative to reference machine Summarize as geometric mean of performance ratios

CINT2006 (integer) and CFP2006 (floating-point)

n

n

1iiratio time Execution

23

CINT2006 for Intel Core i7 920

24

Pitfall: Amdahl’s Law

Improving an aspect of a computer and expecting a proportional improvement in overall performance

unaffectedaffected

improved Tfactor timprovemen

TT

Corollary: make the common case fast Speed-up S(N) = [(1 – f) + f/N]-1

25

Pitfall: MIPS as a Performance Metric

MIPS: Millions of Instructions Per Second Doesn’t account for

Differences in ISAs between computers Differences in complexity between instructions

66

6

10CPI

rate Clock

10rate Clock

CPIcount nInstructiocount nInstructio

10time Execution

count nInstructioMIPS

CPI varies between programs on a given CPU

26

Power and Energy Problem: Get power in, get power out Thermal Design Power (TDP)

Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power (1.5X higher), higher than

average power consumption Clock rate can be reduced dynamically to limit

power consumption Energy per task is often a better measurement

27

Power Intel 80386

consumed ~ 2 W 3.3 GHz Intel

Core i7 consumes 130 W

Heat must be dissipated from 1.5 x 1.5 cm chip

This is the limit of what can be cooled by air

28

Dynamic Energy and Power Dynamic energy

Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage2

Dynamic power ½ x Capacitive load x Voltage2 x Frequency switched

Reducing clock rate reduces power, not energy

29

Reducing Power Techniques for reducing power:

Do nothing well Dynamic Voltage-Frequency Scaling

Low power state for DRAM, disks Overclocking, turning off cores

30

Static Power Static power consumption

25-50% of total power Currentstatic x Voltage Scales with number of transistors To reduce: power gating

31

Power Trends

In CMOS IC technology

FrequencyVoltageload CapacitivePower 2

×1000×30 5V → 1V

32

SPEC Power Benchmark

Power consumption of server at different workload levels Performance: ssj_ops/sec Power: Watts (Joules/sec)

10

0ii

10

0ii powerssj_ops Wattper ssj_ops Overall

33

SPECpower_ssj2008 for Xeon X5650

34

Reducing Power

Suppose a new CPU has 85% of capacitive load of old CPU 15% voltage and 15% frequency reduction

0.520.85FVC

0.85F0.85)(V0.85C

P

P 4

old2

oldold

old2

oldold

old

new

The power wall We can’t reduce voltage further We can’t remove more heat

How else can we improve performance?

35

Fallacy: Low Power at Idle

Look back at i7 power benchmark At 100% load: 258W At 50% load: 170W (66%) At 10% load: 121W (47%)

Google data center Mostly operates at 10% – 50% load At 100% load less than 1% of the time

Consider designing processors to make power proportional to load

Can it be done?36

Single Processor Performance

RISC

Move to multi-processor

37

Current Trends in Architecture

Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in

2003 New models for performance:

Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP)

These require explicit restructuring of the application

38

Multiprocessors

Multicore microprocessors More than one processor per chip

Requires explicitly parallel programming Compare with instruction level parallelism

Hardware executes multiple instructions at once Hidden from the programmer

Hard to do Programming for performance Load balancing Optimizing communication and synchronization

39

Fallacies and Pitfalls

Microprocessors are a silver bullet Performance is now a programmer’s burden

Falling prey to Amdahl’s Law A single point of failure Hardware enhancements that increase

performance also improve energy efficiency, or are at worst energy neutral

Benchmarks remain valid indefinitely Compiler optimizations target benchmarks

40

Fallacies and Pitfalls

All exponential laws must come to an end Dennard scaling (constant power density)

Stopped by threshold voltage Disk capacity

30-100% per year to 5% per year Moore’s Law

Most visible with DRAM capacity ITRS disbanded Only four foundries left producing state-of-the-art

logic chips 11 nm, 3 nm might be the limit

41

Rosetta Demostrator Project MASC, Adelaide University and ... · Rosetta Demostrator Project MASC,...

Documents

Transcript of Rosetta Demostrator Project MASC, Adelaide University and ... · Rosetta Demostrator Project MASC,...