Download - Lecture 26 Emerging Architectures

15-447 Computer Architecture Fall 2008 ©

November 19, 2007Nael Abu-Ghazaleh

[email protected]://www.qatar.cmu.edu/~msakr/15447-f08

Lecture 26Emerging Architectures

CS 15-447: Computer Architecture

2


Last Time: Buses and I/O

• Buses: Bunch of wires• Shared Interconnect: multiple “devices” connect

to the same bus• Versatile: new devices can connect (even ones

we didn’t know existed when bus was designed)• Can become a bottleneck

– Shorter->faster; less devices->faster

• Have to:– Define the protocol to make devices communicate– Come up with an arbitration mechanism

Data Lines

Control Lines

3


Types of Buses

• System bus– Connects processor and memory– Short, fast, synchronous, design specific

• I/O Bus – Usually is lengthy and slower; industry standard– Need to match a wide range of I/O devices– Connects to the processor-memory bus or backplane bus

Processor Memory

Processor Memory Bus

BusAdaptor

BusAdaptor

BusAdaptor

I/O BusBackplane Bus

I/O Bus

4


Bus “Mechanics”

• Master Slave• Have to define how we hand-shake

– Depends on whether its synchronous or not

• Bus arbitration protocol– Contention vs. reservation; centralized vs. distributed

• I/O Model– Programmed I/O; Interrupt driven I/O; DMA

• Increasing performance (mainly bandwidth)– Shorter; closer; wider– Block transfers (instead of byte transfers)– Split transaction buses– …

5


Today—Emerging Architectures

• We are at an interesting point in computer architecture evolution

• What is emerging and why is it emerging?

6


1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Pe

rfo

rma

nce

(vs

. V

AX

-11

/78

0)

25%/year

52%/year

Uniprocessor Performance (SPECint)

• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, Sept. 15, 2006

Sea change in chip design—what is emerging?

3X

??%/year

7


How did we get there?

• First, what allowed the ridiculous 52% improvement per year to continue for around 20 years?– If cars improved as much we would have 1 million Km/hr cars!

• Is it just the number of transistors/clock rate?• No! Its also all the stuff that we’ve been learning about!

8


Walk down memory lane

• What was the first processor organization we looked at?– Single cycle processors

• How did multi-cycle processors improve those?• What did we do after that to improve

performance?– Pipelining; why does that help? What are the

limitations?

• From there we discussed superscalar architectures– Out of order execution; multiple ALUs– This is basically state of the art in uniprocessors– What gave us problems there?

9


Detour: couple of other design points

• Very Large Instruction Word Architectures; let the compiler do the work• Great for energy efficiency—less Instruction Level Parallelism• Not binary compatible? Trasnmeta Crusoe Processor

10


SIMD ISA Extensions—Parallelism from the Data?

• Same Instruction applied to multiple Data at the same time– How can this help?

• MMX (Intel) and 3DNow! (AMD) ISA extensions• Great for graphics; originally invented for scientific codes

(vector processors)– Not a general solution

• End of detour!

11


Back to Moore’s law

• Why are the “good times” over?– Three walls

1. “Instruction Level Parallelism” (ILP) Wall– Less parallelism available in programs (2->4->8->16)– Tremendous increase in complexity to get more– Does VLIW help?– What can help?– Conclusion: standard architectures cannot continue

to do their part of sustaining Moore’s law

12


Wall 2: Memory Wall

• What did we do to help this?– Still very very expensive to access memory

• How do we see the impact in practice?• Very different from when I learned architecture!

µProc 52%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)1

10

100

1000198

0198

1 198

3198

4198

5 198

6198

7198

8198

9199

0199

1 199

2199

3199

4199

5199

6199

7199

8 199

9200

0

DRAM

CPU

198

2Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

“Moore’s Law”

13


Ways out? Multithreaded Processors

• Can we switch to other threads if we need to access memory?– When do we need to access memory?

• What support is needed?• Can I use it to help with the ILP wall as well?

14


Symmetric Multithreaded Processors

• How do I switch between threads?

• Hardware support for that

• How does this help?

• But, increased contention for everything (BW, TLB, caches…)

15


Third Wall: Physics/Power wall

• We’re down to the level of playing with a few atoms

• More error prone; lower yield• But also soft-errors and wear out

– Logic that sometimes works!– Can we do something in architecture to recover?

16


Power! Our topic next class

17


So, what is our way out? Any ideas?

• Maybe architecture becomes commodity; this is the best we can do– This happens to a lot of technologies: why don’t we

have the million km/hr car?

• Do we actually need more processing power?– 8 bit embedded processors good enough for

calculators; 4 bit ones probably good enough for elevators

– Is there any sense to continue investing so much time and energy into this stuff?

Power Wall + Memory Wall + ILP Wall = Brick Wall

18


A lifeline? Multi-core architectures

• How does this help?

• Think of the three walls

• The new Moore’s law: – the number of cores will double every 3 years!

– Many-core architectures

19


Overcoming the three walls

• ILP Wall? – Don’t need to restrict myself to a single thread– Natural parallelism available across threads/programs

• Memory wall?– Hmm, that is a tough one; on the surface, seems like

we made it worse– Maybe help coming from industry

• Physics/power wall?– Use less aggressive core technology

• Simpler processors, shallower pipelines• But more processors

– Throw-away cores to improve yield

• Do you buy it?

20


7 Questions for Parallelism

• Applications:1. What are the apps?2. What are kernels of apps?• Hardware:3. What are the HW building blocks?4. How to connect them?• Programming Models:5. How to describe apps and

kernels?6. How to program the HW?• Evaluation: 7. How to measure success?

(Inspired by a view of the Golden Gate Bridge from Berkeley)

21


Sea Change in Chip Design

• Intel 4004 (1971): 4-bit processor,2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm2 chip

Processor is the new transistor!

• RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip

• 125 mm2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache

– RISC II shrinks to 0.02 mm2 at 65 nm

22


Architecture Design space

• What should each core look like?• Should all cores look the same?• How should the chip interconnect between them

look?• What level of the cache should they share?

– And what are the implications of that?

• Are there new security issues?– Side channel attacks; denial of service attacks

• Many other questions…

Brand new playground; exciting time to do architecture research

23


Hardware Building Blocks: Small is Beautiful

• Given difficulty of design/validation of large designs• Given power limits what can build, parallel is energy efficient

way to achieve performance– Lower threshold voltage means much lower power

• Given redundant processors can improve chip yield– Cisco Metro 188 processors + 4 spares– Sun Niagara sells 6 or 8 processor version

• Expect modestly pipelined (5- to 9-stage) CPUs, FPUs, vector, SIMD PEs

• One size fits all?

– Amdahl’s Law a few fast cores + many small cores

24


Elephant in the room

• We tried this parallel processing thing before– Very difficult

• It failed, pretty much– A lot of academic progress and neat algorithms, but little

impact commercially

• We actually have to do new programming– A lot of effort to develop; error prone; etc..– La-Z-boy programming era is over– Need new programming models

• Amdahl’s law• Applications: What will you use 1024 cores for?• These concerns are being voiced by a substantial

segment of academia/industry– What do you think?– Its coming, no matter what