Whats new in social media and what it means for you June 19th 2013
Copyright Gordon Bell & Jim Gray ISCA2000 All the chips outside… and around the PC what new...
-
Upload
jayden-napier -
Category
Documents
-
view
219 -
download
1
Transcript of Copyright Gordon Bell & Jim Gray ISCA2000 All the chips outside… and around the PC what new...
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
All the chips outside… and around the PC
what new platforms? Apps?
Challenges, what’s interesting, and what needs doing?
Gordon Bell
Bay Area Research Center
Microsoft Corporation
Architecture changes when everyone and everything is mobile!
Power, security, RF, WWW, display, data-types e.g. video & voice…
it’s the application of architecture!
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
The architecture problem The apps
– Data-types: video, voice, RF, etc.– Environment: power, speed, cost
The material: clock, transistors… Performance… it’s about parallelism
– Program & programming environment– Network e.g. WWW and Grid– Clusters– Multiprocessors– Storage, cluster, and network interconnect– Processor and special processing– Multi-threading and multiple processor per chip– Instruction Level Parallelism vs– Vector processors
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
IP On Everything
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
poochi
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Sony Playstation export limiits
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
PC At An Inflection Point?
PCsPCsNon-PCNon-PCdevices and Internetdevices and Internet
It needs to continue to be upward. These scalable systems provide the highest technical (Flops) and commercial (TPC) performance.
They drive microprocessor competition!
The Dawn Of The PC-Plus Era, The Dawn Of The PC-Plus Era, NotNot The Post-PC Era… The Post-PC Era…
devices aggregate via PCs!!! devices aggregate via PCs!!!
Consumer Consumer PCsPCsTV/AVTV/AV MobileMobile
CompanionsCompanions
Household Household ManagementManagement
CommunicationsCommunications Automation Automation & Security & Security
PC will prevail for the next decade as a dominant platform … 2nd to smart, mobile devices Moore’s Law increases performance; and
alternatively reduces prices PC server clusters with low cost OS beat
proprietary switches, smPs, and DSMs Home entertainment & control …
– Very large disks (1TB by 2005) to “store everything”– Screens to enhance use
Mobile devices, etc. dominate WWW >2003! Voice and video become important apps!
C = Commercial; C’ = Consumer
Where’s the action? Problems? Constraints: Speech, video, mobility, RF, GPS, security…
Moore’s Law, including network speed Scalability and high performance processing
– Building them: Clusters vs DSM– Structure: where’s the processing, memory, and switches (disk and
ip/tcp processing)– Micros: getting the most from the nodes
Not ISAs: Change can delay Moore Law effect … and wipe out software investment! Please, please, just interpret my object code!
System on a chip alternatives… apps drive– Data-types (e.g. video, video, RF) performance, portability/power,
and cost
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
High Performance Computing
A 60+ year view
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
High performance architecture/program timeline
1950 . 1960 . 1970 . 1980 . 1990 . 2000Vtubes Trans. MSI(mini) Micro RISC nMicr
Sequential programming---->------------------------------(single execution stream)<SIMD Vector--//--------------- Parallelization---
Parallel programs aka Cluster Computing <---------------multicomputers <--MPP era------ultracomputers 10X in size & price! 10x MPP
“in situ” resources 100x in //sm NOW VLSCCgeographically dispersed Grid
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Computer types
NetwrkedSupers…
GRIDLegionCondor Beowulf NT clusters
VPPuni
T3E SP2(mP) NOW
NEC mP
SGI DSM clusters &SGI DSM
NEC super Cray X…T(all mPv)
MainframesMultis
WSs PCs
-------- Connectivity--------
WAN/LAN SAN DSM SM
mic
ros
v
ecto
r
Clusters
Technical computer types
NetwrkedSupers…
GRID
LegionCondor Beowulf
VPPuni
SP2(mP) NOW
NEC mP
T series
SGI DSM clusters &SGI DSM
NEC super Cray X…T(all mPv)
MainframesMultis
WSs PCs
WAN/LAN SAN DSM SM
mic
ros
v
ecto
r
OldWorld( one
programstream)
New world: Clustered
Computing(multiple program
streams)
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Dead Supercomputer Society
Dead Supercomputer Society ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research Culler-Harris Culler Scientific Cydrome Dana/Ardent/Stellar/Stardent Denelcor Elexsi ETA Systems Evans and Sutherland Computer Floating Point Systems Galaxy YH-1
Goodyear Aerospace MPP Gould NPL Guiltech Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories MasPar Meiko Multiflow Myrias Numerix Prisma Tera Thinking Machines Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Vitesse Electronics
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
SCI Research c1985-1995
35 university and corporate R&D projects
2 or 3 successes… All the rest failed to work or be
successful
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
How to build scalables?
To cluster or not to cluster… don’t we need a single, shared memory?
General purpose, non-parallelizable codes(PCs have it!)
VectorizableVectorizable & //able(Supers & small DSMs)Hand tuned, one-ofMPP course grainMPP embarrassingly //(Clusters of PCs...)
DatabaseDatabase/TPWeb HostStream Audio/Video
Technical
Commercial
Application Taxonomy
If central control & rich then IBM or large SMPselse PC Clusters
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
SNAP … c1995
Scalable Network And Platforms A View of Computing in 2000+
We all missed the impact of WWW!
Gordon Bell Jim GrayNetworkPlatform
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
ComputingSNAPbuilt entirelyfrom PCs Wide & Local
Area Networksfor: terminal,
PC, workstation,& servers
Centralized& departmental
uni- & mP servers(UNIX & NT)
Legacymainframes &
minicomputersservers & terms
Wide-areaglobal
network
Legacymainframe &
minicomputerservers & terminals
Centralized& departmental
servers buit fromPCs
scalable computers
built from PCs
TC=TV+PChome ...
(CATV or ATM or satellite)
???
Portables
A space, time (bandwidth), & generation scalable environment
Person servers (PCs)
Person servers (PCs)
MobileNets
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
0.0001
0.001
0.01
0.1
1
10
100
1000
1985 1990 1995 2000 2005 2010
Bell Prize and Future Peak Tflops (t)
Petaflops study target
NEC
XMP NCube
CM2
*IBM
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Top 10 tpc-c
Top two Compaq systems are:Top two Compaq systems are:1.1 & 1.5X faster than IBM SPs;1.1 & 1.5X faster than IBM SPs;1/3 price of IBM1/3 price of IBM1/5 price of SUN1/5 price of SUN
Courtesy of Dr. Thomas Sterling, Caltech
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Five ScalabilitiesSize scalable -- designed from a few components,
with no bottlenecks
Generation scaling -- no rewrite/recompile or user effort to run across generations of an architecture
Reliability scaling… chose any level
Geographic scaling -- compute anywhere (e.g. multiple sites or in situ workstation sites)
Problem x machine scalability -- ability of an algorithm or program to exist at a range of sizes that run efficiently on a given, scalable computer.
Problem x machine space => run time: problem scale, machine scale (#p), run time, implies speedup and efficiency,
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Why I gave up on large smPs & DSMs
Economics: Perf/Cost is lower…unless a commodity Economics: Longer design time & life. Complex.
=> Poorer tech tracking & end of life performance. Economics: Higher, uncompetitive costs for processor &
switching. Sole sourcing of the complete system. DSMs … NUMA! Latency matters.
Compiler, run-time, O/S locate the programs anyway. Aren’t scalable. Reliability requires clusters. Start there. They aren’t needed for most apps… hence, a small
market unless one can find a way to lock in a user base. Important as in the case of IBM Token Rings vs Ethernet.
0
5
10
15
20
25
30
0 200 400 600
Number of SGI processors
GF
lop
s
MPI on SGI
MLP on SGI
FVCORE PerformanceFinite Volume Community Climate Model, Joint Code development NASA, LLNL and NCAR
Max T3E
Max C90-16
SX-4
SX-550
Cache based systems are nothing more than “vector” processors with a highly programmable “vector” register set (the caches). These caches are 1000x larger than the vector registers on a Cray vector system, and provide the opportunity to execute vector work at a very high sustained rate. In particular, note 512 CPU Origins contain 4 GBytes of cache. This is larger than most problems of interest, and offers a tremendous opportunity for high performance across a large number of CPUs. This has been borne out in fact at NASA Ames.
Vector lengths arbitrary
Vector lengths fixed
Vectors fed at high speed
Vectors fed at low speed
Vector registers8 KBytes
Memory
CPU
Vector System
1st & 2nd Lvl Caches8 MBytes
Memory
CPU
Microprocessor System
Two results per clock(Will be 4 in next Gen SGI)
Two results per clock
500Mhz 600Mhz
Architectural Contrasts – Vector vs Microprocessor
Convergence to
one architecture
limited scalability: mP, uniform memory access
experimental, scalable, multicomputer: smC, non uniform memory access
1st smC hypercube Transputer
(grid)
smC fine-grain
DSM??
smC med-coarse
grain
mP mainframe,
super
smC next gen.
DSM=>smP
Mosaic-C, J-machine
Fujitsu, Intel, Meiko, NCUBE, TMC; 1985-1994
Convex, Cray, Fujistu, IBM, Hitachi, NEC mainframes & supers
smC coarse gr.
clusters
smC: very
coarse grain
Cm* ('75), Butterfly ('85), Cedar ('88)
mP bus based
multi: mini, W/S
networked workstations: smC
mP ring-based
multi
Cosmic Cube, iPSC 1, NCUBE, Transputer-based
Apollo, SUN, HP, etc.
scalable, mP: smP, non-uniform memory access 1st smP
0 cache
smP DSM some cache
smP all cache
arch.
DASH, Convex, Cray T3D, SCI
KSR Allcache next gen. smP research e.g. DDM, DASH+
WSs Clusters via special switches 1994 & ATM 1995
micros
1995?Evolution of scalable multiprocessors, multicomputers, & workstations to shared memory computers
DEC, Encore, Sequent, Stratus, SGI, SUN, etc.
??
high bandwith switch , comm. protocols e.g. ATM
Natural evolution
Cache for locality
WS Micros, fast switch
1995?
1995?
note, only two structures: 1. shared memory mP with uniform & non-uniform memory access; and 2. networked workstations, shared nothing
mPs continueto be the main line
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
“Jim, what are the architectural challenges … for clusters?”
WANS (and even LANs) faster than backplanes at 40 Gbps
End of busses (fc=100 MBps)… except on a chip
What are the building blocks or combinations of processing, memory, & storage?
Infiniband http://www.infinibandta.org starts at OC48, but it may not go far or fast enough if it ever exists. OC192 is being deployed.
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
What is the basic structure of these scalable systems?
Overall Disk connection especially wrt to
fiber channel SAN, especially with fast WANs
& LANs
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Modern scalable switches … also hide a supercomputer
Scale from <1 to 120 Tbps of switch capacity
1 Gbps ethernet switches scale to 10s of Gbps
SP2 scales from 1.2 Gbps
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
GB plumbing from the baroque:evolving from the 2 dance-hall model
Mp — S — Pc : | :
|—————— S.fc — Ms | : |— S.Cluster |— S.WAN —
MpPcMs — S.Lan/Cluster/Wan — :
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
SNAP Architecture----------
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
ISTORE Hardware Vision
System-on-a-chip enables computer, memory, without significantly increasing size of disk
5-7 year target:MicroDrive:1.7” x 1.4” x 0.2”
2006: ?1999: 340 MB, 5400 RPM,
5 MB/s, 15 ms seek2006: 9 GB, 50 MB/s ? (1.6X/yr capacity, 1.4X/yr BW)
Integrated IRAM processor2x height
Connected via crossbar switchgrowing like Moore’s law
16 Mbytes; ; 1.6 Gflops; 6.4 Gops10,000+ nodes in one rack! 100/board = 1 TB; 0.16 Tflops
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
The Disk Farm? or a System On a Card?
The 500GB disc cardAn array of discsCan be used as 100 discs 1 striped disc 50 FT discs ....etcLOTS of accesses/second of bandwidth
A few disks are replaced by 10s of Gbytes of RAM and a processor to run Apps!!
14"
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Map of Gray Bell Prize resultsRedmond/Seattle, WA
San Francisco, CA
New York
Arlington, VA
5626 km10 hops
single-thread single-stream tcp/ip single-thread single-stream tcp/ip via 7 hopsvia 7 hops desktop-to-desktop …Win 2K desktop-to-desktop …Win 2K
out of the box performance*out of the box performance*
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
1 GBps1 GBps
Ubiquitous 10 GBps SANs in 5 years
1Gbps Ethernet are reality now.– Also FiberChannel ,MyriNet, GigaNet,
ServerNet,, ATM,…
10 Gbps x4 WDM deployed now (OC192)
– 3 Tbps WDM working in lab In 5 years, expect 10x,
wow!!
5 MBps20 MBps
40 MBps
80 MBps
120 MBps120 MBps(1Gbps)(1Gbps)
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
0
50
100
150
200
250
100Mbps Gbps SAN
Transmitreceivercpusender cpu
Time µs toSend 1KB
The Promise of SAN/VIA:10x in 2 years http://www.ViArch.org/
Yesterday: – 10 MBps (100 Mbps Ethernet)
– ~20 MBps tcp/ip saturates 2 cpus
– round-trip latency ~250 µs
Now– Wires are 10x faster
Myrinet, Gbps Ethernet, ServerNet,…
– Fast user-level communication
- tcp/ip ~ 100 MBps 10% cpu- round-trip latency is 15 us
1.6 Gbps demoed on a WAN
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Processor improvements… 90% of ISCA’s focus
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
We get more of everything
Mainframes, minis, micros, and risc
Mips 25 mhz
0.1
1.0
10
100P
erfo
rman
ce (
VA
X 7
80s)
1980 1985 1990
MV10K
68K
780 5 Mhz
RISC 60% / y
rPerformance vs Time for Several Computers
uVAX 6K (CMOS)
8600
TTL
ECL 15%/yr
CMOS CISC
38%/yr
o | | MIPS (8 Mhz)
o 9000
Mips (65 Mhz)
uVAX CMOS Will RISC continue on a
60%, (x4 / 3 years)? Moore's speed law?
4K
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Computer ops/sec x word length / $
y = 1E-248e0.2918x
1.E-06
1.E-03
1.E+00
1.E+03
1.E+06
1.E+09
1880 1900 1920 1940 1960 1980 2000
.=1.565^(t-1959.4)
doubles every 7.5
doubles every 2.3
doubles every 1.0
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
0.01
0.1
1
10
100
1000
10000
1986
1988
1990
1992
1994
1996P
erf
orm
an
ce in
Mfl
op
/s
Micros
Supers
8087 802876881
80387
R2000
i860
RS6000/540Alpha
RS6000/590Alpha
Cray 1S
Cray X-MP
Cray 2 Cray Y-MP Cray C90Cray T90
1998
Growth of microprocessor performance
1980
1982
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Albert Yu predictions ‘96
When 2000 2006
Clock (MHz) 900 4000 4.4x
MTransistors 40 350 8.75x
Mops 2400 20,000 8.3x
Die (sq. in.) 1.1 1.4 1.3x
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Processor Limit: DRAM GapµProc60%/yr..
DRAM7%/yr..
1
10
100
10001980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
• Alpha 21264 full cache miss / instructions executed: 180 ns/1.7 ns =108 clks x 4 or 432 instructions• Caches in Pentium Pro: 64% area, 88% transistors*Taken from Patterson-Keeton Talk to SigMod
“Moore’s Law”
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
The “memory gap”
Multiple e.g. 4 processors/chip in order to increase the ops/chip while waiting for the inevitable access delays
Or alternatively, multi-threading (MTA) Vector processors with a supporting
memory system System-on-a-chip… to reduce chip
boundary crossings
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
If system-on-a-chip is the answer, what is the problem? Small, high volume products
– Phones, PDAs, – Toys & games (to sell batteries)– Cars– Home appliances– TV & video
Communication infrastructure Plain old computers… and portables
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
SOC Alternatives… not including C/C++ CAD Tools
The blank sheet of paper: FPGA Auto design of a basic system: Tensilica Standardized, committee designed components*,
cells, and custom IP Standard components including more application
specific processors *, IP add-ons and custom One chip does it all: SMOP*Processors, Memory, Communication & Memory
Links,
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Xilinx 10Mg, 500Mt, .12 mic
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Free 32 bit processor core
System-on-a-chip alternatives
FPGA Sea of un-committed gate arrays
Xylinx, Altera
Compile a system
Unique processor for every app
Tensillica
Systolic | array
Many pipelined or parallel processors + custom
DSP | VLIW
Spec. purpose processors cores + custom
TI
Pc & Mp.
ASICS
Gen. Purpose cores. Specialized by I/O, etc.
IBM, Intel, Lucent
Universal Micro
Multiprocessor array, programmable I/o
Cradle
Cradle: Universal Microsystemtrading Verilog & hardware for C/C++
Single part for all apps App spec’d@ run time using FPGA & ROM 5 quad mPs at 3 Gflops/quad = 15 Glops Single shared memory space, caches Programmable periphery including:
1 GB/s; 2.5 GipsPCI, 100 baseT, firewire
$4 per flops; 150 mW/Gflops
UMS : VLSI = microprocessor : special systemsSoftware : Hardware
MSP
MSP
MSP
M EM O R Y
MSP
MSP
MSP
MSP
M EM O R Y
MSP
MSP
MSP
MSP
M EM O R Y
C LO C KS,D EBU G
MSP
MSP
MSP
MSP
M EM O R YD R AMC O N TR O L
MSP
D R AM
PR O G I/O PR O G I/O PR
OG
I/O
PR
OG
I/O
PR
OG
I/O
PROG I/OPROG I/OPROG I/OPROG I/O
PR
OG
I/OP
RO
G I/O
PR
OG
I/O
N VM EM
UMS Architecture
Memory bandwidth scales with processing Scalable processing, software, I/O Each app runs on its own pool of processors Enables durable, portable intellectual property
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
Recapping the challenges
Scalable systems– Latency in a distributed memory– Structure of the system and nodes– Network performance for OC192 (10 Gbps)– Processing nodes and legacy software
Mobile systems… power, RF, voice, I/0– Design time!
Copyright Gordon Bell & Jim Gray ISCA2000Copyright Gordon Bell & Jim Gray ISCA2000
The End