Advanced Computer Architectures – Part 2.1

Advanced Computer

Architectures

– HB49 –

Part 2.1

Vincenzo De Florio

K.U.Leuven / ESAT / ELECTA

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/2

Course contents

• Basic Concepts

Computer Design

• Computer Architectures for AI

• Computer Architectures in Practice

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/3

Computer Design

Quantitative assessments

• Instruction sets

• Pipelining

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/4

Computer design

• First part of the course: a survey of

computer history

• Key aspect of this history:

In the last 60 years computers have

experienced a formidable growth in

performance and a huge costs decrease

A 1000¤ PC today provides its user with more

performance, memory, and disk space of a 1M$

mainframe of the Sixties

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/5

Computer design

• How this could be possible?

• Through

Advances in computer technology

Advances in computer design

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/6

Computer design

• The tasks of a computer designer:

Determine key attributes for a new machine

E.g., design a machine that maximize

performance keeping costs under control

Aspects:

Instruction set design

Functional organization

Logic design

Implementation

(To be defined later)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/7

Significant improvements

• First 25 years:

From both technology and design

• From the Seventies:

Mainly from IC technology

Main concern = compatibility with the past

(to save investments)

Compatibility at ML

No room for design improvements

20-30% per year for mainframes and minis

• Late Seventies: advent of the mP

Higher rate (35% per year)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/8

Significant improvements: the mP

• The mP

Mass-produced lower costs

Significant changes in computer

marketplace

Higher level language compatibility (no need

for object code compatibility)

Availability of standard, vendor-independent

OS (less risks and costs in producing a new

architecture)

allowed to develop a new concept:

RISC architectures

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/9

Significant improvements: RISC

RISC architectures

Designed in the Eighties, on the market ca.‘85

Since then, a 50% improvement per year

Sun-4/260

MIPS M/120

MIPS M2000

IBM RS6000/540

HP 9000/750

DEC AXP 3000

0

50

100

150

200

250

300

1987 1988 1989 1990 1991 1992 1993 1994 1995

Year

P

e

r

f

o

r

m

a

n

c

e

IBM Power 2/590

1.54X/yr

1.35X/yr

DEC 21064a

Sun UltraSparc

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/10

Technology Trends

Microprocessors

Minicomputers

Mainframes

Supercomputers

Year

0.1

1

10

100

1000

1965 1970 1975 1980 1985 1990 1995 2000

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/11

Computer design

• The mP allowed a 50% of performance

increase. How was that possible?

Enhanced capability for users

IBM Power 21993

Cray Y-MP1988

The fastest supercomputer in 1988 has approx.

the same performance of the fastest 1993

workstation

Price: 1/10

Computers became more and more mP-based

Mainframes were disappearing or becoming

based on off-the-shelf mPs

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/12

Computer design

• Big consequence

No more market urge for

object code compatibility

Freedom from compatibility with old designs

Renaissance in computer design

Again, significant improvements from both

technology and design

50% of performance growth!

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/13

Computer design

• The highest performance mP in ’95 is

mainly a result of design improvements

(1-to-5)

• In this section we focus on the design

techniques that allowed this state of

facts

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/14

Performance

• What are the aspects to be taken into

account in order to reach a higher

performance?

• How to choose between different

alternatives?

Amdhal’s law

Quantitative assessment

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/15

Amdhal’s law

• Speed-up:

• Amdhal’s law on speed-up:

• Speed up depends on the fraction of time

that may be affected by the enhancement

Execution time for entire task w/o using the “enhancement”

Execution time for entire task using enhancement when possible

S =

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/16

Amdhal’s law

Let us call F the fraction of time

affected by the enhancement

For instance, F=0.40 means that the original program would

benefit of the enhancement for 40% of the time of execution

What do we gain by introducing

the enhancement?

Exec-timeNEW = Exec-timeOLD ((1 -F) + F/SENH)

Where SENH is the speedup in the enhanced mode. Hence,

S = Exec-timeNEW

Exec-timeOLD (1 - F) + F / SENH

1 =

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/17

Amdhal’s law

F = 40%

SENH

grows, but

SOVER

does not

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/18

Amdhal’s law

• Law of diminishing returns

the incremental improvement in speedup

gained by an additional improvement in the

performance of just a portion of the

computation

diminishes as improvements are added

lim SENH S = (1 - F) + F / SENH

1 lim SENH

(1 - F)

1 =

= SMAX

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/19

Amdhal’s law

To reach a maximum speedup = 3,

F must be at least 66%

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/20

Amdhal’s law…

• “…can serve as a guide to how much an

enhancement will improve performance

and how to distribute resources to

improve cost/performance.

• The goal, clearly, is to spend resources

proportional to where time is spent.’’

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/21

• Example 1 (p.30 P&H)

Method allows an improvement by factor 10

That can be exploited for 40% of the time

speeup

fract.fract.

speedup

00.4

10

overal

enhancedenhanced

enhanced

1

1

1

1 4

1 56

.

.

Amdhal’s law

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/22

Amdhal’s law

Example 2 (p.31 P&H)

50% of the instructions of a given benchmark

are floating point instructions

FPSQR applies to 20% of the same benchmark

Alternative 1: extra hardware: FPSQR is 10

times faster

Alternative 2: all the FP instructions go 2 times

faster

speedup

fract.fract.

speedup

speedup

00.2

10

speedup

00.5

2.0

overal

enhancedenhanced

enhanced

FPSQR

FP

1

1

1

1 2

122

1

1 5

133

.

.

.

.

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/23


• CPUTIME

(p) = Time spent by the CPU to run

program p

• Clock cycle time = tcc

, clock rate = 1/ tcc

• CPUTIME

(p) = # clock cycles tcc

= # clock cycles / clock rate

• E.g.: clock cycle time = 2ns

clock rate = 500 MHz

• #CC(p) = number of clock cycles spent in

the execution of p

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/24


• Instruction count

• IC(c,p) = number of instructions that CPU

c executed during the activity of program

p

• Often, IC(p)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/25

• Clock cycles per instruction

• CPI(p) = #CC(p) / IC(p)

average number of clock cycles needed

to execute one instruction of p


part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/26

• CPUTIME

(p) =

= #clock cycles clock cycle time

= #CC(p) tcc

= IC(p) CPI(p) tcc

= IC(p) CPI(p)

clock rate

We can influence the performance of a given

program p by optimizing the three key

variables IC(p), CPI(p), and clock rate.


part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/27


• CPU performance is equally dependent

upon three characteristics

Clock rate (the higher, the better)

Clock cycles per instructions (the lesser, the

better)

Instruction count (the lesser, the better)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/28


• CPU performance is equally dependent

upon three characteristics

Clock rate (HW technology & organization)

Clock cycles per instruction

(organization & instruction set architecture)

Instruction count

(instruction set architecture &

compiler technology)

• Note: technologies are not independent of

each other!

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/29


CPU time = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Inst Count CPI Clock Rate

Program X Compiler X (X) Inst. Set. X X Organization X X Technology X

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/30


• Decades long challenge: optimizing

CPUTIME(p) = IC(p) CPI(p)

clock rate

• This is a function of p!

• The choice of benchmarks is

important

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/31


• Which methods to use?

CPUTIME(p) = IC(p) CPI(p)

clock rate

• Method 1: increasing the clock rate

(Note: independent of p!)

• Methods 2: those trying to decrease

IC(p)

• Methods 3: those trying to decrease

CPI(p)

• Each method is equally important

• Some methods are more effective

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/32

Quantitative assessment:

how to calculate CPI?

CPI =

CPI IC

Instr. count CPI

IC

Instr. Count

i i

ii

i

n

i

n1

1

CPIi needs to be measured and not just

read from a table in the Reference

Manual!

That is, we need to take into account

the memory access time! (Cache

misses do count… a lot)

ICi = number of times instruction i is

executed by p

CPIi = average number of clock cycles

for instruction i

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/33


• Example 3: 2 alternatives for a

conditional branch instruction

A: a CMP that sets a condition code (Z bit)

followed by a JZ

B: a single instruction to do CMP and JZ

LD R1, 0 LD R1, 0

L: INC R1 L: INC R1

CMP R1, 5 JRZ R1, 5, L

JZ L RET

RET

Arch. A Arch. B

We assume that JZ and JRZ take 2 cycles,

all the other instructions take 1 cycle

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/34


LD R1, 0 LD R1, 0

L: INC R1 L: INC R1

CMP R1, 5 JRZ R1,5,L

JZ L RET

RET

Arch. A Arch. B

• 20% of the instructions are c.jumps

(instructions such as JZ or JRZ)

• 80% are other instructions

• On A, for each c.jump there is a CMP on

A, 20% are c.jumps and 20% are CMP’s

• 60% are other instructions

Because of the extra complexity in B, the

clock of A is faster (CTB = 1.25 CT

A)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/35


• CPIA = S

i instr

i x cycles

i / #CC

A =

= #BRA x cycles

BR + #BR

A x cyclesBR

#CC

A #CC

A

= nBRA

x cyclesBR

+ nBRA x cyclesBR

= 20% x 2 + 80% x 1 = 1.2

• CPUA = IC

A x CPI

A x CT

A = IC

A x 1.2 x CT

A

• CPIB = S

i instr

i x cycles

i / #CC

B =

= #BRB x cycles

BR + #BR

B x cyclesBR

#CC

B #CC

B

= nBRB

x cyclesBR

+ nBRB x cyclesBR

n n

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/36


• Now, on B:

One spares 20% of the instructions (the extra

cmp’s), hence:

nBRB

= 20 / (100 – 20) = 0.25 (25%)

Furthermore, ICB = 0.8 IC

A

• Hence CPIB = 0.25 x 2 + 0.75 x 1 = 1.25

• CPUB = IC

B x CPI

B x CT

B =

= 0.8 ICA x 1.25 x 1.25 CT

A

So CPUB = 1.25 x ICA x CTA

CPUA = 1.2 x IC

A x CT

A

So A is faster

(for which P?)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/38

Performance

• A straightforward enhancement is given

by increasing the clock rate

• The entire program benefits

• Also, independent of the particular

program

• Dependent on the efficiency of the

compiler etc.

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/39

Clock Frequency Growth Rate

• 30% per year

0.1

1

10

100

1,000

19701975

19801985

19901995

20002005

Clo

ck r

ate

(M

Hz)

i4004i8008

i8080

i8086 i80286i80386

Pentium100

R10000

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/40

Transistor Count Growth Rate

• 100 million transistors on chip in early year 2000.

• Transistor count grows much faster than clock rate

Tra

nsis

tors

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

19701975

19801985

19901995

20002005

i4004i8008

i8080

i8086

i80286i80386

R2000

Pentium

R10000

R3000

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/43

Performance

• Another important factor for performance

is given by

Memory accesses

I/O (disk accesses)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/44

Memory

• Semiconductor DRAM technology

Density: increase of 60% per year

(quadruplicate in 3 years)

Cycle time: much less increase than this!

Capacity Speed

Logic 2x in 3 years 2x in 3 years

DRAM 4x in 3 years 1.4x in 10 years

Disk 2x in 3 years 1.4x in 10 years

Speed increases of memory and I/O have not

kept pace with processor speed increases.

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/45

Memory

size

Year

Bit

s

1000

10000

100000

1000000

10000000

100000000

1000000000

1970 1975 1980 1985 1990 1995 2000

year size(Mb) cyc time

1980 0.0625 250 ns

1983 0.25 220 ns

1986 1 190 ns

1989 4 165 ns

1992 16 145 ns

1996 64 120 ns

2000 256 100 ns

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/46

Basic definitions

1. Bandwidth: the rate at which data can be

transferred. Bandwidth is typically measured in

bytes per second.

2. Block size: the amount of data transferred per

request. Block size is typically measured in bytes.

3. Latency: the time between making a request (e.g.

to read or write a block of data) and completing the

request. Latency is typically measured in seconds.

4. Throughput: The number of requests that can be

completed per unit time. Throughput is typically

measured in requests per second.

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/47

Memory

• DRAM: main memory of all computers

Commodity chip industry: no company >20% share

Packaged in SIMM or DIMM (e.g.,16 DRAMs/SIMM)

• Capacity: 4X/3 years (60%/year)

Moore’s Law

• MB/$: + 25%/year

• Latency: – 7%/year,

Bandwidth: + 20%/year (so far)

source: www.pricewatch.com, 5/21/98

SIMM = single in-line memory chip, a small circuit board that

can hold a group of memory chips. Measured in bytes vs bits

32-bit path to memory

DIMM = dual in-line memory chip. 64-bit to memory

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/48

Processor Limit: DRAM Gap

µProc

60%/yr..

DRAM

7%/yr..

1

10

100

1000

19

80

19

81

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

1998

19

99

20

00

DRAM

CPU

19

82

Processor-Memory

Performance Gap:

(grows 50% / year)

Pe

rfo

rma

nc

e “Moore’s Law”

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/49

Memory Summary

• DRAM:

rapid improvements in capacity, MB/$, bandwidth;

slow improvement in latency

Processor-memory interface

is a bottleneck to delivered bandwidth

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/50

Disk Components

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/51

Disk Components: Platters

• Platters: the recording surfaces.

i. 1 to 8 inches in diameter (2.5 to 20 cm).

ii. Stacked on a spindle: typical disks have 1-12

platters.

iii. Data can be stored on one or both surfaces.

iv. Spindle and platters rotate at 3600 - 10000 rpm

(60-165 Hz).

v. Recording density depends on applying a

magnetic film with few defects.

vi. Rotation rate limited by bearings and power

consumption.

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/52

Disk Components: Heads

• Heads: write and read data to and from platters.

i. Data stored as presence or absence of

magnetization.

ii. Head “floats” on air-film that rotates with the disk.

Bernoulli effect pulls head toward disk but not into

it. A dust particle can cause a “head crash” where

the disk surface is scratched and any data on it is

lost.

iii. Disk heads are manufactured using thin film

technology. Advancing technology allows smaller

heads and therefore more closely spaced tracks

and bits.

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/53

Disk Components: Actuators

• Actuators: move heads radially over the platters.

i. Actuator arm needs to be light to move quickly.

ii. Actuator arm needs to stiff to prevent flexing.

iii. Smaller platters allow shorter arms: therefore

lighter and stiffer.

iv. Actuators limited by

• power of actuator motor and

• weight and strength of actuator components

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/54

Disks: Data Layout

• Each surface consists of concentric rings called

tracks

• The set of tracks that are a the same relative

position on each surface form a cylinder

• Each track is divided into sectors. Data is written to

and read from the disk a whole sector at a time

cylinder

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/55

Three Components of Disk Access Time

1. Seek time: the time to move the heads to the

desired cylinder

Advertised to be 8 to 12 ms. May be lower in real life

2. Rotational latency: the time for the desired sector

to arrive under the head

4.1 ms at 7200 RPM and 8.3 ms at 3600 RPM

3. Transfer time: the time to read the data from the

disk and send it over the I/O bus to the processor

2 to 12 MB per second

Response time = Queue + Ctrl + Device Service time

Proc

Queue

IOC Device

Ctrl Disk Access Time

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/56

Disk Latency = Queueing Time + Controller time + Seek Time + Rotation Time + Xfer Time

Order of magnitude times for 4K byte transfers:

Average Seek: 8 ms or less Rotate: 4.2 ms @ 7200 rpm Xfer: 1 ms @ 7200 rpm

Hard Disks

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/57

Hard Disks

• Capacity

+ 60%/year (2X / 1.5 yrs)

• Transfer rate (BW)

+ 40%/year (2X / 2.0 yrs)

• Rotation + Seek time

– 8%/ year (1/2 in 10 yrs)

• MB/$

> 60%/year (2X / <1.5 yrs)

Latency = Queuing Time + Controller time + Seek Time + Rotation Time + Size / Bandwidth

source: Ed Grochowski, 1996,

“IBM leadership in disk drive technology”;

www.storage.ibm.com/storage/technolo/grochows/grocho01.htm,

per access

per byte

{ +

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/58

Hard disks

1973:

1. 7 Mbit/sq. in

140 MBytes

1979:

7. 7 Mbit/sq. in

2,300 MBytes

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/59

1

10

100

1000

10000

1970 1980 1990 2000

Year

Are

al D

en

sit

y

Hard Disks

1989:

63 Mbit/sq. in

60,000 MBytes

1997:

1450 Mbit/sq. in

1600 MBytes

1997:

3090 Mbit/sq. in

8100 MBytes

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/60

Hard Disks

• Continued advance in capacity (60%/yr)

and bandwidth (40%/yr.)

• Slow improvement in seek, rotation

(8%/yr)

• Time to read whole disk

Year Sequentially Randomly

1990 4 minutes 6 hours

2000 12 minutes 1 week

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/61

Memory/Disk Summary

• Memory:

DRAM rapid improvements in capacity, MB/$,

bandwidth; slow improvement in latency

• Disk:

Continued advance in capacity, cost/bit,

bandwidth; slow improvement in seek,

rotation

• Huge gap between CPU and external

memories

• How to address this problem?

• Classical way: memory hierarchies

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/62

Memory hierarchies

• Axiom of HW designer: smaller is faster

Larger memories => larger signal delay

More levels are required to encode addresses

In a smaller memory the designer can use more

power per cell => shorter access times

• Crucial features for performance

Huge bandwidth (in MB/sec.)

Short access times

• Principle of locality

The data most recently used is very likely to be

accessed again in the near future (temporal l.)

Memory cells close to the most recently used one

are likely to be accessed in the near future (spatial)

• Combining the above with the Amdhal law, the

“best” enhancement is using hierarchies of

memories

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/63

CPU

Typical memory hierarchy (`95)

Registers

Ca

ch

e

Memory I/O devices

Size: 200B 64KB 32 MB 2 GB

Speed: 5 ns 10 ns 100 ns 5 ms

I/O bus Memory bus

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/64

Memory hierarchies

Instruction Set Architecture

Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, DSP

Addressing, Protection, Exception Handling

L1 Cache

L2 Cache

DRAM

Disks, WORM, Tape

Coherence, Bandwidth, Latency

Emerging Technologies Interleaving Bus protocols

RAID

VLSI

Input/Output and Storage

Memory

Hierarchy

Pipelining and Instruction

Level Parallelism

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/65

Memory hierarchies

• Registers: smallest and fastest memory

• Size: less than 1KB

• Access time: 2-5 ns

• Bandwidth: 4000-32000 MB/sec

• Managed by the compiler (or the

assembly programmer)

register int a;

• Special purpose vs. general purpose

• Monolithic or double-shaped

Rx = Rl + Rh

• Backed in cache

• Implemented via custom memory with

multiple ports

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/66

Memory hierarchies

• Cache = small, fast memory located close

to the CPU

• The cache holds the most recently

accessed code or data

Managed by HW

No way to tell “put these data in cache” at SW

New research: cache-conscious data

structures

• Size: less than 4 MB

• Access time: 3-10 ns

• Bandwidth: 800-5000 MB/sec

• Backed in main memory

• Implemented with (on- or off-chip) CMOS

SRAM

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/67

• Cache terminology: cache hit, cache

miss, cache block

Cache hit: the CPU has been able to find in

cache the requested data

Cache miss: Cache hit

Cache block: the fixed-size buffer used to load

a portion of memory into the cache

• A cache miss blocks the CPU until the

corresponding memory block gets cached

Memory hierarchies

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/68

Memory hierarchies

• Virtual memory: same principles behind

the use of cache, but implemented

between main memory and disk storage

• At any point in time, not all the data

referenced by p need to be in main

memory

• Address space is partitioned into fixed-

size blocks: pages

• A page is either in memory or on disk

• When CPU references an item within a

page

if ( Check-if-in-cache() == CACHE_MISS )

if ( Check-if-in-memory() == MEM_MISS)

PageFault(); // Loads page in memory

CPU doesn’t stall – switches to other tasks

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/69

Cache performance

• Example: speedup using a cache

Cache 10 times faster than main memory

Cache is used 90% of the cases

speedup

fract.fract.

speedup

00.9

10

enhancedenhanced

enhanced

1

1

1

1 9

5 3

.

.

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/70

Cache performance

CPUtime = (CPU clock cycles + memory

stall cycles) x clock cycle time

Memory stall cycles = #(misses) £(miss)

= IC #(misses per instruction) £(miss)

= IC #(memory references per instr.)

miss-frequency £(miss)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/71

Cache performance

• Example (P&H, p.43)

A computer has a CPI = 2 when data is in cache

Memory access is only required by load and

store instructions (40% of total #)

£(miss) = 25 clock cycles

Cache misses frequency = 2%

? How faster would the machine be when no

cache miss occurs?

CPU"-hit = (CPU clock cycles + memory stall cycles)

clock cycle time

= (IC CPI + 0) clock cycle time

= IC 2 clock cycle time

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/72

Cache performance

? How fast would the machine be when

cache misses do occur?

1. Compute the memory stall cycles (msc)

msc = IC memory references per instruction

miss rate miss penalty

= IC (1 + 0.4) 0.02 25

= IC 0.7

Instruction access

Data access

2. Compute total performance:

CPUcache

=(CPU clock cycle + msc) clock cycle time

= (IC 2 + IC 0.7) clock cycle time

= 2.7 IC clock cycle time

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/73

Computer Design

• Quantitative assessments

Instruction sets

• Pipelining

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/74

Computer design

• Instruction-set architecture:

The architecture of the machine level

The boundary between SW and HW

• Organization:

High level aspects: memory system, bus

structure, internal CPU design

• Hardware:

The specifics of a machine: detailed logic

design, packaging technology…

• Architecture = I + O + H

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/75

Instruction Sets

• IS = Instruction sets = The architecture of

the machine language

• IS Classification

• Roles of the compilers

• DLX

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/76

Computer Design IS

IS Classification

• Role of the compilers

• DLX

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/77

Computer Design IS

IS Classification

• Key: type of internal storage in the CPU

• Three main classes

Stack architectures

Accumulator architectures

General-purpose register architectures

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/78

Computer Design IS

IS Classification Stack A.

• Stack architecture:

• Operands are implicitly referred to

• Top two items on the system stack

• Example: C = A + B

1. PUSH A A

2. PUSH B B

3. ADD

ADD = PUSH (POP + POP)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/79

Computer Design IS






1. PUSH A A

2. PUSH B

3. ADD


ADD = PUSH (B + POP)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/80

Computer Design IS






1. PUSH A

2. PUSH B

3. ADD


ADD = PUSH (B + POP)

ADD = PUSH (B + A)

B+A

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/81

Computer Design IS






1. PUSH A

2. PUSH B

3. ADD

C = TOP STACK = A+B

4. POP C

An example: the ARIEL virtual machine (Part 1, Slides 91 –)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/82

Computer Design IS

IS Classification Accumulator A.

• Accumulator Architectures

• A special register (the accumulator)

plays the role of an implicit argument


1. LOAD A ; let Acml = A

2. ADD B ; let Acml = Acml + B

3. STORE C ; let C = Acml

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/83

Computer Design IS

IS Classification Register A.

• General-purpose Register Architecture

• Explicit operands only

• Either registers or memory locations

• Two flavors:

Register-memory architectures (RMA)

Register-register architectures (RRA)


RMA: Load R1, A

Add R1, B ; in C, R1 += B

Store C, R1

RRA: Load R1, A

Load R2, B

Add R3, R1, R2

Store C, R3

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/84

Computer Design IS

IS Classification RRA

• Some old machines used stack or

accumulator architectures

For instance, T800 and 6502/6510

• Today the de facto standard is RRA

Regs are fast

Regs are easier to use (compiler writers)

Do not require to deal with associativity issues

Stacks do!

Regs can hold variables

register int I;

for (I=0; I<1000000;I++)

{ do-stgh(I); … }

Using regs you don’t need a memory address

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/85

Computer Design IS

IS Classification Register A.

• RRA: no memory operands

All instructions are similar in size -> take

similar number of clocks to execute (very

useful property… see later)

No side effect

Higher instruction count

• RMA: one memory operand

One load can be spared

A register operand is destroyed ( R += B )

Clocks per instruction varies by operand

location

• Memory-memory:

Compact

Large variation of work per instruction

Large variation in instruction size

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/86

Computer Design IS

Memory addressing

• How is memory organized?

• What does it mean, e.g., read memory at

address 512?

• What do we read?

Bytes, half words, words, double words

• How are consecutive bytes stored in a

word? (Assumption: word is 4 bytes)

Little endian: &word = &LSB

Big endian: &word = &MSB

XDR routines are needed to exchange data

(&word address of word)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/87

A memory model for didactics

• Memory can be thought as finite, long

array of cells, each of size 1 byte

0 1 2 3 4 5 6 7 …

• Each cell has a label, called address, and

a content, i.e. the byte stored into it

• Think of a chest of drawers, with a label

on each drawer and possibly something

into it

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/88

1

2

3

4

Address

Content


part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/89

• The character * has a special meaning

• It refers to the contents of a cell


• For instance:

*(1)

This character means we’re inspecting the contents

of a cell (we open a drawer and see what’s in it)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/90

• The character * has a special meaning

• It refers to the contents of a cell


• For instance:

*(1)

This character means we’re writing new contents

into a cell (we open a drawer and change its contents)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/91


• Memory is (often) byte addressable,

though it is organized into small groups of

bytes: the machine word

• A common size for the machine word is 4

bytes (32 bits)

• Two possible organizations for the bytes

in a word

Little endian

Big endian

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/92

Little endian versus Big endian

MSB LSB

MSB LSB

Big endian (Motorola)

LSB MSB

LSB MSB

Little endian (Intel)

0 1 2 3

4 5 6 7

MSB0

LSB0

MSB1

LSB1

Big endian

0

1

2

3

4

5

6

7

LSB0

MSB0

LSB1

MLSB1

Little endian

3 2 1 0

7 6 5 4

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/93

Little endian versus Big endian

MSB LSB

MSB LSB

Big endian (Motorola)

LSB MSB

LSB MSB

Little endian (Intel)

MSB0

LSB0

MSB1

LSB1

Big endian

0

1

2

3

4

5

6

7

LSB0

MSB0

LSB1

MLSB1

Little endian

Problem: communication

between the two

00 00 00 01

10 00 00 00

00

00

00

01

10

00

00

00

00

00

00

01

10

00

00

00

=1

=268435456

00 00 00 01

10 00 00 00

=16777216

=16

01 00 00 00

00 00 00 10

So they are the same; though, interpreted as if they were…

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/94

Computer Design IS

Memory addressing

• Alignment is mandatory on some

machines

Object O; int t = sizeof(O);

ALIGNED(O) means &O modulo t is 0

“access to O is aligned”

For instance if access to integers (4 bytes) is

aligned, then an integer can only be stored in

addresses divisible by 4

Alignment is sometimes necessary because

prevents hardware complications

Alignment implies faster access

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/95

Computer Design IS

Memory addressing

• Addressing modes: ways to specify the

address of an object in memory

• An addressing mode can specify

A constant

A register

A memory location

In what follows,

A += B means A = A + B

* (x) means return the contents of memory at

address x

x++ means “at the end, let x = x + 1”

--x means “at the beginning, let x = x – 1”

Rx means register x

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/96

Computer Design IS

Memory addressing

Mode Example Meaning

Register Add R4, R3 R4 += R3

Immediate Add R4, #3 R4 += 3

Displacement Add R4, 100(R1) R4 += *(100+R1)

Indirect Add R4, (R1) R4 += *(R1)

Indexed Add R4, (R1 + R2) R4 += *(R1 + R2)

Absolute Add R4, (100) R4 += *(100)

Deferred Add R4, @(R3) R4 += *(*(R3))

Autoincrement Add R4, (R3)+ Indirect, R3++

Autodecrement Add R4, -(R2) R2--, indirect

Scaled Add R4,

100(R2)[R3]

R4 += * ( 100 + R2 +

R3 * d )

d = size of the addressed data (1, 2, 4, 8, or 16)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/97

Computer Design IS

Memory addressing

• Addressing mode can reduce IC

• Complex addressing modes increase the

complexity of the hardware can

increase CPI

• Displacement, immediate and deferred

represent b/w 75% and 99% of addressing

modes (experiments done with TeX,

spice, and gcc)

• IC(p) = number of instructions that the CPU executed

during the activity of program p

• CPI(p) = clock cycles per instruction = #CC(p) / IC(p)

average number of clock cycles needed to execute one

instruction of p

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/98

Computer Design IS

Operations

• Arithmetical and logical (add, and, sub...)

• Data transfer (move, store)

• Control (br, jmp, call, ret, iret…)

• System (virtual memory mngt…)

• Floating point (add, mul, …)

• Decimal (decimal add, decimal mul…)

• String (str move, str cmp, str search)

• Graphics (pixel operations)

• Benchmarks show that often a small set

of simple instructions account for stg like

95% of instructions executed

(see Fig. 2.11, P&H p.81)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/99

Computer Design IS

Operations

• Control Flow Instructions

Branch (conditional change)

Jump (unconditional change)

Procedure calls

Procedure returns

• Most of the comparisons in conditional

branches are simple “==“, “!=“ with 0!

• In some cases, the address to go to

is only known at run-time

“Return” uses a stack

Switch statements

Dynamic libraries

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/100

Computer Design IS

Operands

• When we say, e.g.,

“Add R1, #5”

do we work with bytes? Half-words?

Words?

• How do we specify the type of the

operand?

1. Classical method: the type of operand is

part of the opcode

• Add family is coded as ffff…fffvv

where f are fixed bits and v are bits

that specify the type

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/101

Computer Design IS

Operands and types

• 1011010100010000 = Add float words

1011010100010001 = Add words

1011010100010010 = Add half-words

1011010100010011 = Add bytes

• Example: Add family =

10110101000100vv

• Old fashioned method:

operand = data + tag

• Tag describes a type

• Tag is interpreted by HW

• Operation is chosen accordingly

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/102

Computer Design IS

Operands and types

• Which types to support?

• Old fashioned solution: all (bytes, semi-

words, words, f.p., double words, double

precision f.p., …)

• Current trend: Only operations on items

greater than or equal to 32 bits

• On the DEC Alpha one needs multiple

instructions to access objects smaller

than 32 bits

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/103

Computer Design IS

Operands and types

• Floating point numbers:

IEEE standard 754

• In the early ’80, each manufacturer had

its own f.p. representation

• Sometimes string operations are available

(strcmp, strcpy…)

• Sometimes BCD is used to code numbers

Four bits are used to code a decimal digit

A byte codes two decimal digits

Functions for “packing” and “unpacking” are

required

It is unclear if this will stay in the future

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/104

Computer Design IS

• IS Classification

Role of the compilers

• DLX

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/105

Computer Design IS

Role of the compiler

• In the past, the role of Assembly language

was crucial

• Architectural decisions aimed at easing

assembly language programming

• Now, the user interface is a high level

language (C, C++, Java…)

• The user interfaces the machine via the

HLL, though the machine actually

executes some lower level code

• This lower level code is produced by a

compiler

The role of the compiler is fundamental

The IS architecture needs to take the

compiler into strong account

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/106

Computer Design IS


• Goals of the compiler writer

Correctness

Performance

…Fast compilation, debugging support, …

• Strategy for writing a compiler

Use a number of “passes”

From high level structures down to

lower levels, until machine level

This way complexity is decomposed in

smaller blocks

Optimizing becomes more difficult

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/107

Computer Design IS


Code

generator

Front-end

Dependencies Function

D(language)

D(machine)

Language common

intermediate form

HL Opt

Global Opt

Loop transformations,

function inlining…

Register allocation…

D(language)

D(machine)

Instruction selection,

D(machine) opt.

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/108

Computer Design IS


• HL Optimizations: source-level

optimizations (code code’)

• Local optimizations: basic block

optimizations

• Global optimizations: loop optimization

and basic blocks optimizations

• Machine-dependent optimization: using

low level architectural knowledge

• Basic Block = a straight-line code fragment

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/109

Computer Design IS


• Compilers have different optimization

levels

-O1 .. -On

• Optimization can have a big impact on

instruction count on performance

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/110

Computer Design IS


part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/111

Computer Design IS


• In some cases, though, optimization may

be counterproductive!

• This happens because there might be

conflicts between local and global

optimization tasks

• Example:

a = sqrt(x*x + y*y) + f()… ;

b = sqrt(x*x + y*y) + g()…;

SAME EXPRESSION

• Idea:

tmp = sqrt(x*x + y*y);

a = tmp + f() …;

b = tmp + g() …;

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/112

Computer Design IS


• Effective, but only if tmp can be stored in

a register

• No register in memory cache misses

… bad performance

• Problem is

When the compiler performs, e.g., code

transformations like in the example, it does not

know whether a register will actually be

available

This will only become clear later (at global

optimization level)

• (Phase ordering problem)

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/113

Computer Design IS


• Key resource is the register file

• “Intelligent” register allocation

techniques are a must

• Current solution: graph coloring (graph

with possible candidates for allocation to

a register)

• NP-complete, though effective heuristic

algorithms exist

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/114

Computer Design IS


• A special class of compilers – Algorithm-

driven software generation

FFTW approach: Software generation system

based on symbolic computation

Objective CamL

Sort of FFT compiler that generates optimal C

code via symbolic computing

Possible future steps (project works, theses…):

Extending the approach going down to code

generation for, e.g., the TI ‘C67 DSP and other

VLIW CPUs

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/115

Exam of 16 Jan 2002

• A program is composed of three classes of

instructions: i1 (integer instructions), i2 (load-

store instructions), and i3 (floating point

instructions)

• The three classes are responsible of r1 = 60%, r

2 =

30% and r3 = 10% of the overall execution time,

respectively

• You can choose between three levels of

optimisation on your computer: O1, O2, and O3:

O1 optimises i1, O2 optimises i

2, and O3 optimises

i3

• The corresponding enhancements would be

e1 = 2, e

2 = 3, e

3 = 10

• Suppose you can only choose one of the three

levels of optimisation. Which one would you

choose? Justify your choice

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

© V. De Florio

KULeuven 2003

Basic

Concepts

Computer

Design

Computer

Architectures

for AI

Computer

Architectures

In Practice

2.1/116

Solution

• r1 = 60% e

1 = 2

r2 = 30% e

2 = 3

r3 = 10% e

3 = 10

S = Exec-timeNEW

Exec-timeOLD (1 - r) + r / e

1 =

• s1 = 1.42857

s2 = 1.25

s3 = 1.0989

part1.ppt

CApart1.ppt

CApart2.ppt

CApart4.ppt

CApart3.ppt

Advanced Computer Architectures – Part 2.1

Technology

Transcript of Advanced Computer Architectures – Part 2.1