Computer Architecture 2009 – Introduction 1 MAMAS – Computer Architecture 234267 Lecturer: Dr....

Computer Architecture 2009 – Introduction1

MAMAS – Computer MAMAS – Computer ArchitectureArchitecture

234267234267Lecturer: Dr. Lihu Rappoport

Some of the slides were taken from Avi Mendelson, Randi Katz, Patterson, Gabriel Loh


General Course InformationGeneral Course Information Grade

20% Exercise (mandatory) תקף 80% Final exam No midterm exam

Textbooks Computer Architecture a Quantitative Approach:

Hennessy & Patterson

Other course information Course web site:

http://webcourse.cs.technion.ac.il/234267 Foils will be on the web several days before the

class


Lecturer detailsLecturer details Name: Lihu Rappoport Phone: 04-865-1554 Email: [email protected]


Class FocusClass Focus CPU

Introduction: performance, instruction set (RISC vs. CISC)

Pipeline, hazards Branch prediction Out-of-order execution

Memory Hierarchy Cache Main memory Virtual Memory

Advanced Topics PC Architecture

Motherboard & chipset, DRAM, I/O, Disk, peripherals


Computer System StructureComputer System Structure

CPU

PCI

North BridgeDDRII

Channel 1

mouse

LAN

LanAdap

External Graphics

Card

Mem BUSCPU BUS

Cache

SoundCard

speakers

South Bridge

PCI express ×16

IDEcontroller

IO Controller

DVDDrive

HardDisk

Pa

rall

el

Po

rt

Se

ria

l P

ort Floppy

Drivekeybrd

DDRIIChannel 2

USBcontroller

SATAcontroller

PCI express ×1

Memory controller

On-board Graphics


Architecture & Architecture & MicroarchitectureMicroarchitecture

ArchitectureThe processor features seen by the “user” Instruction set, addressing modes, data width, …

Micro-architectureThe way of implementation of a processor Caches size and structure, number of execution

units, … Timing is considered uArch (though it is user

visible)

Processors with different uArch can support the same Architecture


CompatibilityCompatibility Backward compatibility

New hardware can run existing software• Core2 Duo can run SW written for Pentium4,

PentiumM, Pentium III, Pentium II, Pentium, 486, 386, 268

Forward compatibility New software can run on existing hardware Example: new software written with SSE2TM runs on

older processor which does not support SSE2TM Commonly supports one or two generations behind

Architecture independent SW JIT – just in time compiler: Java and .NET Binary translation


PerformancePerformance


Technology Trends and Technology Trends and PerformancePerformance

Computing capacity: 4× per 3 years If we could keep all the transistors busy all the time Actual: 3.3× per 3 years

Moore’s Law: Performance is doubled every ~18 months Trend is slowing: process scaling declines, power is up

Speed

1

10

100

1000

Logic

DRAM

Capacity

1

10

100

1000

10000

100000

1000000

Logic

DRAM

2× in 3 years

1.1× in 3 years

CPU speed and Memory speed grow apart

2× in 3 years

4× in 3 years


Moore’s LawMoore’s Law

Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm


CPI – Cycles Per InstructionCPI – Cycles Per Instruction CPUs work according to a clock signal

Clock cycle is measured in nsec (10-9 of a second) Clock frequency (= 1/clock cycle) measured in GHz

(109cyc/sec)

Instruction Count (IC) Total number of instructions executed in the program

CPI – Cycles Per Instruction Average #cycles per Instruction (in a given program)

IPC (= 1/CPI) : Instructions per cycles

CPI =#cycles required to execute the program

IC


CPU TimeCPU Time CPU Time - time required to execute a

program

CPU Time = IC CPI clock cycle

Our goal: minimize CPU Time Minimize clock cycle: more GHz (process, circuit,

uArch)

Minimize CPI: uArch (e.g.: more

execution units)

Minimize IC: architecture (e.g.: SSETM)


Speedupoverall =ExTimeold

ExTimenew

=1

Speedupenhanced

Fractionenhanced(1 - Fractionenhanced) +

Suppose enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then:

Amdahl’s LawAmdahl’s Law

ExTimenew = ExTimeold ×Speedupenhanced

Fractionenhanced(1 – Fraction enhanced) +


• Floating point instructions improved to run at 2×, but only 10% of executed instructions are FP

Speedupoverall =1

0.95= 1.053

ExTimenew = ExTimeold × (0.9 + 0.1 / 2) = 0.95 × ExTimeold

Corollary:

Make The Common Case Fast

Amdahl’s Law: ExampleAmdahl’s Law: Example


Calculating the CPI of a Calculating the CPI of a ProgramProgram

ICi: #times instruction of type i is executed in the program

IC: #instruction executed in the program:

Fi: relative frequency of instruction of type i : Fi = ICi/IC

CPIi – #cycles to execute instruction of type i e.g.: CPIadd = 1, CPImul = 3

#cycles required to execute the program:

CPI: CPI

cyc

IC

CPI IC

ICCPI

IC

ICCPI F

i ii

n

ii

i

n

i ii

n

# 1

1 1

# *cyc CPI IC CPI ICi ii

n

1

IC ICii

n

1


-2%

0%

2%

4%

6%

Evaluating PerformanceEvaluating Performance Use a performance simulator to evaluate

the performance of a new feature / algorithm Models the uarch to a great detail Run 100’s of representative applications

Produce the performance s-curve Sort the applications according to the IPC increase Baseline (0) is the processor without the new

feature

-4%

-3%

-2%

-1%

0%

1%

2%

3%

Negativeoutliers

Positiveoutliers

Bad S-curve

Small negativeoutliers

Positiveoutliers

Good S-curve


Comparing PerformanceComparing Performance Peak Performance

MIPS, MFLOPS Often not useful: unachievable / unsustainable in

practice Benchmarks

Real applications, or representative parts of real apps Targeted at the specific system usages

SPEC INT – integer applications Data compression, C complier, Perl interpreter,

database system, chess-playing, Text-processing, … SPEC FP – floating point applications

Mostly important scientific applications TPC Benchmarks

Measure transaction-processing throughput


The ISA is what the user / compiler see

The HW implements the ISA

instruction set

software

hardware

Instruction Set DesignInstruction Set Design


ISA ConsiderationsISA Considerations Code size

Long instructions take more time to fetch Longer instructions require a larger memory

• Important in small devices, e.g., cell phones

Number of instructions (IC) Reducing IC reduce execution time

• At a given CPI and frequency

Code “simplicity” Simple HW implementation

• Higher frequency and lower power Code optimization can better be applied to “simple

code”


Architectural Consideration Architectural Consideration ExampleExample

Immediate data size

1% of data values > 16-bits 12 – 16 bits of needed

0%

10%

20%

30%

0 1 2 3 4 5 6 7 8 9

10

11 12

13

14

15

Immediate data bits

Int. Avg.

FP Avg.


CISC ProcessorsCISC Processors CISC - Complex Instruction Set Computer

The idea: a high level machine language Example: x86

Characteristic Many instruction types, with a many addressing

modes Some of the instructions are complex

• Execute complex tasks• Require many cycles

ALU operations directly on memory• Only a few registers, in many cases not orthogonal

Variable length instructions• common instructions get short codes save code

length


Rank instruction % of total executed

1 load 22%

2 conditional branch 20%

3 compare 16%

4 store 12%

5 add 8%

6 and 6%

7 sub 5%

8 move register-register 4%

9 call 1%

10 return 1%

Total 96%

Simple instructions dominate instruction frequency

Top 10 x86 InstructionsTop 10 x86 Instructions


CISC DrawbacksCISC Drawbacks Complex instructions and complex addressing modes

complicates the processor slows down the simple, common instructions contradicts Make The Common Case Fast

Compilers don’t use complex instructions / indexing

methods

Variable length instructions are real pain in the neck Difficult to decode few instructions in parallel

• As long as instruction is not decoded, its length is unknown It is unknown where the instruction ends It is unknown where the next instruction starts

An instruction may be over more than a single cache line An instruction may be over more than a single page


RISC ProcessorsRISC Processors RISC - Reduced Instruction Set Computer

The idea: simple instructions enable fast hardware Characteristic

A small instruction set, with only a few instructions formats

Simple instructions• execute simple tasks• Most of them require a single cycle (with pipeline)

A few indexing methods ALU operations on registers only

• Memory is accessed using Load and Store instructions only

• Many orthogonal registers • Three address machine: Add dst, src1, src2

Fixed length instructions

Examples: MIPSTM, SparcTM, AlphaTM, PowerTM


RISC Processors (Cont.)RISC Processors (Cont.) Simple architecture Simple micro-

architecture Simple, small and fast control logic Simpler to design and validate Room for large on die caches Shorten time-to-market

Using a smart compiler Better pipeline usage Better register allocation

Existing RISC processor are not “pure” RISC e.g., support division which takes many cycles


Compilers and ISACompilers and ISA Ease of compilation

Orthogonality: • no special registers• few special cases • all operand modes available with any data type or

instruction type Regularity:

• no overloading for the meanings of instruction fields streamlined

• resource needs easily determined

Register Assignment is critical too Easier if lots of registers


CISC Is DominantCISC Is Dominant The x86 architecture, which is a CISC

architecture, dominates the processor market A vast amount of existing software Intel, AMD, Microsoft and others benefit from this

• Intel and AMD put a lot of money to make high performance x86 processors, despite the architectural disadvantage

• Current x86 processor give the best cost/performance CISC processors use arch ideas from the RISC world Starting at Pentium II and K6, x86 processors

translate CISC instructions into RISC-like operations internally

• the inside core looks much like that of a RISC processor


Software Specific ExtensionsSoftware Specific Extensions Extend arch to accelerate exec of specific

apps

Example: SSETM – Streaming SIMD Extensions 128-bit packed (vector) / scalar single precision FP

(4×32) Introduced on Pentium® III on ’99 8 new 128 bit registers (XMM0 – XMM7) Accelerates graphics, video, scientific calculations,

…

Packed: Scalar:

x0x1x2x3

y0y1y2y3

x0+y0x1+y1x2+y2x3+y3

+

128-bits

x0x1x2x3

y0y1y2y3

x0+y0y1y2y3

+

128-bits

Computer Architecture 2009 – Introduction 1 MAMAS – Computer Architecture 234267 Lecturer: Dr....

Documents

Transcript of Computer Architecture 2009 – Introduction 1 MAMAS – Computer Architecture 234267 Lecturer: Dr....