Copyright Gordon Bell & Jim Gray ISCA2000 All the chips outside… and around the PC what new...

Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000

All the chips outside… and around the PC

what new platforms? Apps?

Challenges, what’s interesting, and what needs doing?

Gordon Bell

Bay Area Research Center

Microsoft Corporation

Architecture changes when everyone and everything is mobile!

Power, security, RF, WWW, display, data-types e.g. video & voice…

it’s the application of architecture!


The architecture problem The apps

– Data-types: video, voice, RF, etc.– Environment: power, speed, cost

The material: clock, transistors… Performance… it’s about parallelism

– Program & programming environment– Network e.g. WWW and Grid– Clusters– Multiprocessors– Storage, cluster, and network interconnect– Processor and special processing– Multi-threading and multiple processor per chip– Instruction Level Parallelism vs– Vector processors


IP On Everything


poochi


Sony Playstation export limiits


PC At An Inflection Point?

PCsPCsNon-PCNon-PCdevices and Internetdevices and Internet

It needs to continue to be upward. These scalable systems provide the highest technical (Flops) and commercial (TPC) performance.

They drive microprocessor competition!

The Dawn Of The PC-Plus Era, The Dawn Of The PC-Plus Era, NotNot The Post-PC Era… The Post-PC Era…

devices aggregate via PCs!!! devices aggregate via PCs!!!

Consumer Consumer PCsPCsTV/AVTV/AV MobileMobile

CompanionsCompanions

Household Household ManagementManagement

CommunicationsCommunications Automation Automation & Security & Security

PC will prevail for the next decade as a dominant platform … 2nd to smart, mobile devices Moore’s Law increases performance; and

alternatively reduces prices PC server clusters with low cost OS beat

proprietary switches, smPs, and DSMs Home entertainment & control …

– Very large disks (1TB by 2005) to “store everything”– Screens to enhance use

Mobile devices, etc. dominate WWW >2003! Voice and video become important apps!

C = Commercial; C’ = Consumer

Where’s the action? Problems? Constraints: Speech, video, mobility, RF, GPS, security…

Moore’s Law, including network speed Scalability and high performance processing

– Building them: Clusters vs DSM– Structure: where’s the processing, memory, and switches (disk and

ip/tcp processing)– Micros: getting the most from the nodes

Not ISAs: Change can delay Moore Law effect … and wipe out software investment! Please, please, just interpret my object code!

System on a chip alternatives… apps drive– Data-types (e.g. video, video, RF) performance, portability/power,

and cost


High Performance Computing

A 60+ year view


High performance architecture/program timeline

1950 . 1960 . 1970 . 1980 . 1990 . 2000Vtubes Trans. MSI(mini) Micro RISC nMicr

Sequential programming---->------------------------------(single execution stream)<SIMD Vector--//--------------- Parallelization---

Parallel programs aka Cluster Computing <---------------multicomputers <--MPP era------ultracomputers 10X in size & price! 10x MPP

“in situ” resources 100x in //sm NOW VLSCCgeographically dispersed Grid


Computer types

NetwrkedSupers…

GRIDLegionCondor Beowulf NT clusters

VPPuni

T3E SP2(mP) NOW

NEC mP

SGI DSM clusters &SGI DSM

NEC super Cray X…T(all mPv)

MainframesMultis

WSs PCs

-------- Connectivity--------

WAN/LAN SAN DSM SM

mic

ros

v

ecto

r

Clusters

Technical computer types

NetwrkedSupers…

GRID

LegionCondor Beowulf

VPPuni

SP2(mP) NOW

NEC mP

T series

SGI DSM clusters &SGI DSM

NEC super Cray X…T(all mPv)

MainframesMultis

WSs PCs

WAN/LAN SAN DSM SM

mic

ros

v

ecto

r

OldWorld( one

programstream)

New world: Clustered

Computing(multiple program

streams)


Dead Supercomputer Society

Dead Supercomputer Society ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research Culler-Harris Culler Scientific Cydrome Dana/Ardent/Stellar/Stardent Denelcor Elexsi ETA Systems Evans and Sutherland Computer Floating Point Systems Galaxy YH-1

Goodyear Aerospace MPP Gould NPL Guiltech Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories MasPar Meiko Multiflow Myrias Numerix Prisma Tera Thinking Machines Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Vitesse Electronics


SCI Research c1985-1995

35 university and corporate R&D projects

2 or 3 successes… All the rest failed to work or be

successful


How to build scalables?

To cluster or not to cluster… don’t we need a single, shared memory?

General purpose, non-parallelizable codes(PCs have it!)

VectorizableVectorizable & //able(Supers & small DSMs)Hand tuned, one-ofMPP course grainMPP embarrassingly //(Clusters of PCs...)

DatabaseDatabase/TPWeb HostStream Audio/Video

Technical

Commercial

Application Taxonomy

If central control & rich then IBM or large SMPselse PC Clusters


SNAP … c1995

Scalable Network And Platforms A View of Computing in 2000+

We all missed the impact of WWW!

Gordon Bell Jim GrayNetworkPlatform


ComputingSNAPbuilt entirelyfrom PCs Wide & Local

Area Networksfor: terminal,

PC, workstation,& servers

Centralized& departmental

uni- & mP servers(UNIX & NT)

Legacymainframes &

minicomputersservers & terms

Wide-areaglobal

network

Legacymainframe &

minicomputerservers & terminals

Centralized& departmental

servers buit fromPCs

scalable computers

built from PCs

TC=TV+PChome ...

(CATV or ATM or satellite)

???

Portables

A space, time (bandwidth), & generation scalable environment

Person servers (PCs)

Person servers (PCs)

MobileNets


0.0001

0.001

0.01

0.1

1

10

100

1000

1985 1990 1995 2000 2005 2010

Bell Prize and Future Peak Tflops (t)

Petaflops study target

NEC

XMP NCube

CM2

*IBM


Top 10 tpc-c

Top two Compaq systems are:Top two Compaq systems are:1.1 & 1.5X faster than IBM SPs;1.1 & 1.5X faster than IBM SPs;1/3 price of IBM1/3 price of IBM1/5 price of SUN1/5 price of SUN

Courtesy of Dr. Thomas Sterling, Caltech


Five ScalabilitiesSize scalable -- designed from a few components,

with no bottlenecks

Generation scaling -- no rewrite/recompile or user effort to run across generations of an architecture

Reliability scaling… chose any level

Geographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites)

Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer.

Problem x machine space => run time: problem scale, machine scale (#p), run time, implies speedup and efficiency,


Why I gave up on large smPs & DSMs

Economics: Perf/Cost is lower…unless a commodity Economics: Longer design time & life. Complex.

=> Poorer tech tracking & end of life performance. Economics: Higher, uncompetitive costs for processor &

switching. Sole sourcing of the complete system. DSMs … NUMA! Latency matters.

Compiler, run-time, O/S locate the programs anyway. Aren’t scalable. Reliability requires clusters. Start there. They aren’t needed for most apps… hence, a small

market unless one can find a way to lock in a user base. Important as in the case of IBM Token Rings vs Ethernet.

0

5

10

15

20

25

30

0 200 400 600

Number of SGI processors

GF

lop

s

MPI on SGI

MLP on SGI

FVCORE PerformanceFinite Volume Community Climate Model, Joint Code development NASA, LLNL and NCAR

Max T3E

Max C90-16

SX-4

SX-550

Cache based systems are nothing more than “vector” processors with a highly programmable “vector” register set (the caches). These caches are 1000x larger than the vector registers on a Cray vector system, and provide the opportunity to execute vector work at a very high sustained rate. In particular, note 512 CPU Origins contain 4 GBytes of cache. This is larger than most problems of interest, and offers a tremendous opportunity for high performance across a large number of CPUs. This has been borne out in fact at NASA Ames.

Vector lengths arbitrary

Vector lengths fixed

Vectors fed at high speed

Vectors fed at low speed

Vector registers8 KBytes

Memory

CPU

Vector System

1st & 2nd Lvl Caches8 MBytes

Memory

CPU

Microprocessor System

Two results per clock(Will be 4 in next Gen SGI)

Two results per clock

500Mhz 600Mhz

Architectural Contrasts – Vector vs Microprocessor

Convergence to

one architecture

limited scalability: mP, uniform memory access

experimental, scalable, multicomputer: smC, non uniform memory access

1st smC hypercube Transputer

(grid)

smC fine-grain

DSM??

smC med-coarse

grain

mP mainframe,

super

smC next gen.

DSM=>smP

Mosaic-C, J-machine

Fujitsu, Intel, Meiko, NCUBE, TMC; 1985-1994

Convex, Cray, Fujistu, IBM, Hitachi, NEC mainframes & supers

smC coarse gr.

clusters

smC: very

coarse grain

Cm* ('75), Butterfly ('85), Cedar ('88)

mP bus based

multi: mini, W/S

networked workstations: smC

mP ring-based

multi

Cosmic Cube, iPSC 1, NCUBE, Transputer-based

Apollo, SUN, HP, etc.

scalable, mP: smP, non-uniform memory access 1st smP

0 cache

smP DSM some cache

smP all cache

arch.

DASH, Convex, Cray T3D, SCI

KSR Allcache next gen. smP research e.g. DDM, DASH+

WSs Clusters via special switches 1994 & ATM 1995

micros

1995?Evolution of scalable multiprocessors, multicomputers, & workstations to shared memory computers

DEC, Encore, Sequent, Stratus, SGI, SUN, etc.

??

high bandwith switch , comm. protocols e.g. ATM

Natural evolution

Cache for locality

WS Micros, fast switch

1995?

1995?

note, only two structures: 1. shared memory mP with uniform & non-uniform memory access; and 2. networked workstations, shared nothing

mPs continueto be the main line


“Jim, what are the architectural challenges … for clusters?”

WANS (and even LANs) faster than backplanes at 40 Gbps

End of busses (fc=100 MBps)… except on a chip

What are the building blocks or combinations of processing, memory, & storage?

Infiniband http://www.infinibandta.org starts at OC48, but it may not go far or fast enough if it ever exists. OC192 is being deployed.

http://www.infinibandta.org/


What is the basic structure of these scalable systems?

Overall Disk connection especially wrt to

fiber channel SAN, especially with fast WANs

& LANs


Modern scalable switches … also hide a supercomputer

Scale from <1 to 120 Tbps of switch capacity

1 Gbps ethernet switches scale to 10s of Gbps

SP2 scales from 1.2 Gbps


SNAP Architecture----------


ISTORE Hardware Vision

System-on-a-chip enables computer, memory, without significantly increasing size of disk

5-7 year target:MicroDrive:1.7” x 1.4” x 0.2”

2006: ?1999: 340 MB, 5400 RPM,

5 MB/s, 15 ms seek2006: 9 GB, 50 MB/s ? (1.6X/yr capacity, 1.4X/yr BW)

Integrated IRAM processor2x height

Connected via crossbar switchgrowing like Moore’s law

16 Mbytes; ; 1.6 Gflops; 6.4 Gops10,000+ nodes in one rack! 100/board = 1 TB; 0.16 Tflops


The Disk Farm? or a System On a Card?

The 500GB disc cardAn array of discsCan be used as 100 discs 1 striped disc 50 FT discs ....etcLOTS of accesses/second of bandwidth

A few disks are replaced by 10s of Gbytes of RAM and a processor to run Apps!!

14"


Map of Gray Bell Prize resultsRedmond/Seattle, WA

San Francisco, CA

New York

Arlington, VA

5626 km10 hops

single-thread single-stream tcp/ip single-thread single-stream tcp/ip via 7 hopsvia 7 hops desktop-to-desktop …Win 2K desktop-to-desktop …Win 2K

out of the box performance*out of the box performance*


1 GBps1 GBps

Ubiquitous 10 GBps SANs in 5 years

1Gbps Ethernet are reality now.– Also FiberChannel ,MyriNet, GigaNet,

ServerNet,, ATM,…

10 Gbps x4 WDM deployed now (OC192)

– 3 Tbps WDM working in lab In 5 years, expect 10x,

wow!!

5 MBps20 MBps

40 MBps

80 MBps

120 MBps120 MBps(1Gbps)(1Gbps)


0

50

100

150

200

250

100Mbps Gbps SAN

Transmitreceivercpusender cpu

Time µs toSend 1KB

The Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/

Yesterday: – 10 MBps (100 Mbps Ethernet)

– ~20 MBps tcp/ip saturates 2 cpus

– round-trip latency ~250 µs

Now– Wires are 10x faster

Myrinet, Gbps Ethernet, ServerNet,…

– Fast user-level communication

- tcp/ip ~ 100 MBps 10% cpu- round-trip latency is 15 us

1.6 Gbps demoed on a WAN

http://www.viarch.org/




Processor improvements… 90% of ISCA’s focus


We get more of everything

Mainframes, minis, micros, and risc

Mips 25 mhz

0.1

1.0

10

100P

erfo

rman

ce (

VA

X 7

80s)

1980 1985 1990

MV10K

68K

780 5 Mhz

RISC 60% / y

rPerformance vs Time for Several Computers

uVAX 6K (CMOS)

8600

TTL

ECL 15%/yr

CMOS CISC

38%/yr

o | | MIPS (8 Mhz)

o 9000

Mips (65 Mhz)

uVAX CMOS Will RISC continue on a

60%, (x4 / 3 years)? Moore's speed law?

4K


Computer ops/sec x word length / $

y = 1E-248e0.2918x

1.E-06

1.E-03

1.E+00

1.E+03

1.E+06

1.E+09

1880 1900 1920 1940 1960 1980 2000

.=1.565^(t-1959.4)

doubles every 7.5

doubles every 2.3

doubles every 1.0


0.01

0.1

1

10

100

1000

10000

1986

1988

1990

1992

1994

1996P

erf

orm

an

ce in

Mfl

op

/s

Micros

Supers

8087 802876881

80387

R2000

i860

RS6000/540Alpha

RS6000/590Alpha

Cray 1S

Cray X-MP

Cray 2 Cray Y-MP Cray C90Cray T90

1998

Growth of microprocessor performance

1980

1982


Albert Yu predictions ‘96

When 2000 2006

Clock (MHz) 900 4000 4.4x

MTransistors 40 350 8.75x

Mops 2400 20,000 8.3x

Die (sq. in.) 1.1 1.4 1.3x


Processor Limit: DRAM GapµProc60%/yr..

DRAM7%/yr..

1

10

100

10001980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

• Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions• Caches in Pentium Pro: 64% area, 88% transistors*Taken from Patterson-Keeton Talk to SigMod

“Moore’s Law”


The “memory gap”

Multiple e.g. 4 processors/chip in order to increase the ops/chip while waiting for the inevitable access delays

Or alternatively, multi-threading (MTA) Vector processors with a supporting

memory system System-on-a-chip… to reduce chip

boundary crossings


If system-on-a-chip is the answer, what is the problem? Small, high volume products

– Phones, PDAs, – Toys & games (to sell batteries)– Cars– Home appliances– TV & video

Communication infrastructure Plain old computers… and portables


SOC Alternatives… not including C/C++ CAD Tools

The blank sheet of paper: FPGA Auto design of a basic system: Tensilica Standardized, committee designed components*,

cells, and custom IP Standard components including more application

specific processors *, IP add-ons and custom One chip does it all: SMOP*Processors, Memory, Communication & Memory

Links,


Xilinx 10Mg, 500Mt, .12 mic


Free 32 bit processor core

System-on-a-chip alternatives

FPGA Sea of un-committed gate arrays

Xylinx, Altera

Compile a system

Unique processor for every app

Tensillica

Systolic | array

Many pipelined or parallel processors + custom

DSP | VLIW

Spec. purpose processors cores + custom

TI

Pc & Mp.

ASICS

Gen. Purpose cores. Specialized by I/O, etc.

IBM, Intel, Lucent

Universal Micro

Multiprocessor array, programmable I/o

Cradle

Cradle: Universal Microsystemtrading Verilog & hardware for C/C++

Single part for all apps App spec’d@ run time using FPGA & ROM 5 quad mPs at 3 Gflops/quad = 15 Glops Single shared memory space, caches Programmable periphery including:

1 GB/s; 2.5 GipsPCI, 100 baseT, firewire

$4 per flops; 150 mW/Gflops

UMS : VLSI = microprocessor : special systemsSoftware : Hardware

MSP

MSP

MSP

M EM O R Y

MSP

MSP

MSP

MSP

M EM O R Y

MSP

MSP

MSP

MSP

M EM O R Y

C LO C KS,D EBU G

MSP

MSP

MSP

MSP

M EM O R YD R AMC O N TR O L

MSP

D R AM

PR O G I/O PR O G I/O PR

OG

I/O

PR

OG

I/O

PR

OG

I/O

PROG I/OPROG I/OPROG I/OPROG I/O

PR

OG

I/OP

RO

G I/O

PR

OG

I/O

N VM EM

UMS Architecture

Memory bandwidth scales with processing Scalable processing, software, I/O Each app runs on its own pool of processors Enables durable, portable intellectual property


Recapping the challenges

Scalable systems– Latency in a distributed memory– Structure of the system and nodes– Network performance for OC192 (10 Gbps)– Processing nodes and legacy software

Mobile systems… power, RF, voice, I/0– Design time!


The End

Copyright Gordon Bell & Jim Gray ISCA2000 All the chips outside… and around the PC what new...

Documents

Transcript of Copyright Gordon Bell & Jim Gray ISCA2000 All the chips outside… and around the PC what new...