Social inflation and emerging mass torts: now, next and beyond
NOW and Beyond
description
Transcript of NOW and Beyond
NOW and Beyond
Workshop on Clusters and Computational Grids for Scientific Computing
David E. Culler
Computer Science Division
Univ. of California, Berkeley
http://now.cs.berkeley.edu/
7/30/98 HPDC Panel 2
NOW Project Goals
• Make a fundamental change in how we design and construct large-scale systems
– market reality:
» 50%/year performance growth => cannot allow 1-2 year engineering lag
– technological opportunity:
» single-chip “Killer Switch” => fast, scalable communication
• Highly integrated building-wide system
• Explore novel system design concepts in this new “cluster” paradigm
7/30/98 HPDC Panel 3
Berkeley NOW
• 100 Sun UltraSparcs
– 200 disks
• Myrinet SAN
– 160 MB/s
• Fast comm.– AM, MPI, ...
• Ether/ATM switched external net
• Global OS
• Self Config
7/30/98 HPDC Panel 4
Minute Sort
SGI Power Challenge
SGI Orgin
0123456789
0 10 20 30 40 50 60 70 80 90 100
Processors
Gig
abyt
es s
orted
Landmarks
• Top 500 Linpack Performance List
• MPI, NPB performance on par with MPPs
• RSA 40-bit Key challenge
• World Leading External Sort
• Inktomi search engine
• NPACI resource site
7/30/98 HPDC Panel 5
Taking Stock
• Surprising successes– virtual networks
– implicit co-scheduling
– reactive IO
– service-based applications
– automatic network mapping
• Surprising unsuccesses– global system layer
– xFS file system
• New directions for Millennium– Paranoid construction
– Computational Economy
– Smart Clients
7/30/98 HPDC Panel 6
Fast Communication
• Fast communication on clusters is obtained through direct access to the network, as on MPPs
• Challenge is make this general purpose– system implementation should not dictate how it can be used
0
2
4
6
8
10
12
14
16
µs
gLOrOs
7/30/98 HPDC Panel 7
Virtual Networks
• Endpoint abstracts the notion of “attached to the network”
• Virtual network is a collection of endpoints that can name each other.
• Many processes on a node can each have many endpoints, each with own protection domain.
7/30/98 HPDC Panel 8
Process 3
How are they managed?
• How do you get direct hardware access for performance with a large space of logical resources?
• Just like virtual memory– active portion of large logical space is bound to physical
resources
Process n
Process 2Process 1
***
HostMemory
Processor
NICMem
Network Interface
P
7/30/98 HPDC Panel 9
Network Interface Support
• NIC has endpoint frames
• Services active endpoints
• Signals misses to driver– using a system endpont
Frame 0
Frame 7
Transmit
Receive
EndPoint Miss
7/30/98 HPDC Panel 10
Communication under Load
Client
Client
Client
ServerServerServer
Msgburst work
0
10000
20000
30000
40000
50000
60000
70000
80000
1 4 7 10
13
16
19
22
25
28
Number of virtual networks
Ag
gre
ga
te m
sg
s/s continuous
1024 msgs
2048 msgs
4096 msgs
8192 msgs
16384 msgs
=> Use of networking resources adapts to demand.
7/30/98 HPDC Panel 11
Implicit Coscheduling
• Problem: parallel programs designed to run in parallel => huge slowdowns with local scheduling– gang scheduling is rigid, fault prone, and complex
• Coordinate schedulers implicitly using the communication in the program– very easy to build, robust to component failures
– inherently “service on-demand”, scalable
– Local service component can evolve.
A
LS
A
GS
A
LS
GS
A
LS
A
GS
LS
A
GS
7/30/98 HPDC Panel 12
Why it works
• Infer non-local state from local observations
• React to maintain coordinationobservation implication action
fast response partner scheduled spin
delayed response partner not scheduled block
WS 1 Job A Job A
WS 2 Job B Job A
WS 3 Job B Job A
WS 4 Job B Job A
sleep
spin
request response
7/30/98 HPDC Panel 13
Example
• Range of granularity and load imbalance– spin wait 10x slowdown
7/30/98 HPDC Panel 14
I/O Lessons from NOW sort
• Complete system on every node powerful basis for data intensive computing
– complete disk sub-system
– independent file systems
» MMAP not read, MADVISE
– full OS => threads
• Remote I/O (with fast comm.) provides same bandwidth as local I/O.
• I/O performance is very tempermental– variations in disk speeds
– variations within a disk
– variations in processing, interrupts, messaging, ...
7/30/98 HPDC Panel 15
Reactive I/O
• Loosen data semantics– ex: unordered bag of records
• Build flows from producers (eg. Disks) to consumers (eg. Summation)
• Flow data to where it can be consumed
D A
D A
D A
D A
D A
D A
D A
D ADis
trib
ute
d Q
ue
ue
Static Parallel Aggregation Adaptive Parallel Aggregation
7/30/98 HPDC Panel 16
Performance Scaling
• Allows more data to go to faster consumer
0%10%20%30%40%50%60%70%80%90%
100%
0 5 10 15
Nodes
% o
f Pea
k I/O
Rat
e
Adpative Agr.
Static Agr.
0%10%20%30%40%50%60%70%80%90%
100%
0 5 10 15
Nodes Perturbed%
of P
eak
I/O R
ate
Adpative Agr.
Static Agr.
7/30/98 HPDC Panel 17
Service Based Applications
• Application provides services to clients
• Grows/Shrinks according to demand, availability, and faults
Service request
Front-endservice threads
Caches
User ProfileDatabaseManager
Physicalprocessor
Transcend Transcoding Proxy
7/30/98 HPDC Panel 18
On the other hand
• Glunix– offered much that was not available elsewhere
» interactive use, load balancing, transparency (partial), …
– straightforward master-slaves architecture
– millions of jobs served, reasonable scalability, flexible partitioning
– crash-prone, inscrutable, unaware, …
• xFS– very sophisticated co-operative caching + network RAID
– integrated at vnode layer
– never robust enough for real use
Both are hard, outstanding problems
7/30/98 HPDC Panel 19
Lessons
• Strength of clusters comes from – complete, independent components
– incremental scalability (up and down)
– nodal isolation
• Performance heterogeneity and change are fundamental
• Subsystems and applications need to be reactive and self-tuning
• Local intelligence + simple, flexible composition
7/30/98 HPDC Panel 20
Millennium
• Campus-wide cluster of clusters
• PC based (Solaris/x86 and NT)
• Distributed ownership and control
• Computational science and internet systems testbed
Gigabit Ethernet
SIMS
C.S.
E.E.
M.E.
BMRC
N.E.
IEORC. E. MSME
NERSC
Transport
Business
Chemistry
Astro
Physics
Biology
Economy Math
7/30/98 HPDC Panel 21
Paranoid Construction
• What must work for RSH, dCOM, RMI, read, …?
• A page of C to safely read a line from a socket!
=> carefully controlled set of cluster system op’s
=> non-blocking with timeout and full error checking– even if need a watcher thread
=> optimistic with fail-over of implementation
=> global capability at physical level
=> indirection used for transparency must track fault envelope, not just provide mapping
7/30/98 HPDC Panel 22
Computational Economy Approach
• System has a supply of various resources
• Demand for resources revealed in price– distinct from the cost of acquiring the resources
• User has unique assessment of value
• Client agent negotiates for system resources on user’s behalf
– submits requests, receives bids or participates in auctions
– selects resources of highest value at least cost
7/30/98 HPDC Panel 23
Advantages of the Approach
• Decentralized load balancing– according to user’s perception of importance, not system’s
– adapts to system and workload changes
• Creates Incentive to adopt efficient modes of use– maintain resources in usable form
– avoid excessive usage when needed by others
– exploit under-utilized resources
– maximize flexibility (e.g., migratable, restartable applications)
• Establishes user-to-user feedback on resource usage– basis for exchange rate across resources
• Powerful framework for system design– Natural for client to be watchful, proactive, and wary
– Generalizes from resources to services
• Rich body of theory ready for application
7/30/98 HPDC Panel 24
Resource Allocation
• Traditional approach allocates requests to resources to optimize some system utility function
– e.g., put work on least loaded, most free mem, short queue, ...
• Economic approach views each user as having a distinct utility function
– e.g., can exchange resource and have both happy!
Allocator
Stream of (incomplete)Client Requests
Stream of (partial, delayed, or incomplete)resource status information
7/30/98 HPDC Panel 25
Pricing and all that
• What’s the value of a CPU-minute, a MB-sec, a GB-day?
• Many iterative market schemes– raise price till load drops
• Auctions avoid setting a price– Vikrey (second price sealed bid) will cause resources to go to
where they are most valued at the lowest price
– In self-interest to reveal true utility function!
• Small problem: auctions are awkward for most real allocation problems
• Big problem: people (and their surrogates) don’t know what value to place on computation and storage!
7/30/98 HPDC Panel 26
Smart Clients
• Adopt the NT “everything is two-tier, at least”– UI stays on the desktop and interacts with computation “in the
cluster of clusters” via distributed objects– Single-system image provided by wrapper
• Client can provide complete functionality– resource discovery, load balancing– request remote execution service
• Flexible appln’s will monitor availability and adapt.
• Higher level services 3-tier optimization– directory service, membership, parallel startup
7/30/98 HPDC Panel 27
Everything is a service
• Load-balancing
• Brokering
• Replication
• Directories
=> they need to be cost-effective or client will fall back to “self support”
– if they are cost-effective, competitors might arise
• Useful applications should be packaged as services
– their value may be greater than the cost of resources consumed
7/30/98 HPDC Panel 28
Conclusions
• We’ve got the building blocks for very interesting clustered systems
– fast communication, authentication, directories, distributed object models
• Transparency and uniform access are convenient, but...
• It is time to focus on exploiting the new characteristics of these systems in novel ways.
• We need to get real serious about availability.
• Agility (wary, reactive, adaptive) is fundamental.
• Gronky “F77 + MPI and no IO” codes will seriously hold us back
• Need to provide a better framework for cluster applications