Microsoft Keyboard. Cluster and Grid Computing Pittsburgh Supercomputing Center John Kochmar J. Ray...

Post on 29-Mar-2015

217 views 0 download

Tags:

Transcript of Microsoft Keyboard. Cluster and Grid Computing Pittsburgh Supercomputing Center John Kochmar J. Ray...

Microsoft Keyboard

Cluster and Grid Computing

Pittsburgh Supercomputing Center

John KochmarJ. Ray Scott(Derek Simmel)(Jason Sommerfield)

Pittsburgh Supercomputing CenterWho We Are

• Cooperative effort of– Carnegie Mellon

University

– University of Pittsburgh

– Westinghouse Electric

•Research Department of Carnegie Mellon •Offices in Mellon Institute, Oakland

–On CMU campus–Adjacent to University of Pittsburgh campus.

Westinghouse Electric Company

Energy Center, Monroeville, PA

Agenda

• HPC Clusters

• Large Scale Clusters

• Commodity Clusters

• Cluster Software

• Grid Computing

TOP500 BenchmarkCompleted

October 1, 2001May August December February April May August

1999 1999 1999 2000 2000 2000 2000

October

2000

March

2001

August - October

2001

Three Systems in the Top 500

HP AlphaServer SC ES40 “TCSINI”Ranked 246 with 263.6 GFlops Linpack

Performance

Cray T3E900 “Jaromir”Ranked 182 with 341 GFlops Linpack Performance

HP AlphaServer SC ES45 “LeMieux”Ranked 6 with 4.463 TFlops Linpack Performance

Top Academic System

Cluster Node CountRank Installation Site Nodes

1 Earth Simulator Center 640

2 Los Alamos National Laboratory 1024

3 Los Alamos National Laboratory 1024

4 Lawrence Livermore National Laboratory 512

5 Lawrence Livermore National Laboratory 128

6 Pittsburgh Supercomputing Center 750

7 Commissariat a l'Energie Atomique 680

8 Forecast Systems Laboratory - NOAA 768

9 HPCx 40

10 National Center for Atmospheric Research 40

One Year of Production

lemieux.psc.edu

It’s Really all About Applications

• Single CPU with common data stream– seti@home

• Large Shared Memory Jobs

• Multi-CPU Jobs

• …but, let’s talk systems!

HPC Systems Architectures

HPC Systems

• Larger SMPs

• MPP- Massively Parallel Machines

• Non Uniform Memory Access (NUMA) machines

• Clusters of smaller machines

Larger SMPs

• Pros:– Use existing technology and management

techniques– Maintain parallelization paradigm (threading)– It’s what users really want!

• Cons:– Cache coherency gets difficult– Increased resource contention– Pin counts add up– Increased incremental cost

HPC Clusters

• Rationale– If one box, can’t do it, maybe 10 can…– Commodity hardware is advancing rapidly– Potentially far less costly than a single larger

system– Big systems are only so big

HPC Clusters

• Central Issues– Management of multiple systems– Performance

• Within each node

• Interconnections

– Effects on parallel programming methodology• Varying communication characteristics

The Next Contender?

• CPU 128 Bit CPU

• System Clock Frequency 294.912 MHz

• 32MB Main Memory direct RDRAM

• Embedded Cache VRAM 4MB

• I/O Processor

• CD-ROM and DVD-ROM

Why not let everyone play?

What’s a Cluster?Base Hardware

• Commodity Nodes– Single, Dual, Quad, ???– Intel, AMD– Switch port cost vs cpu

• Interconnect– Bandwidth– Latency

• Storage– Node local– Shared filesystem

Terascale Computing System

Hardware SummaryHardware Summary• 750 ES45 Compute Nodes• 3000 EV68 CPU’s @ 1 GHz

• 6 Tflop • 3 TB memory• 41 TB node disk, ~90GB/s• Multi-rail fat-tree network• Redundant Interactive nodes• Redundant monitor/ctrl• WAN/LAN accessible

• File servers: 30TB, ~32 GB/s• Mass Store buffer disk, ~150 TB• Parallel visualization• ETF coupled

QuadricsControl

LAN

Compute Nodes

File Servers/tmp

WAN/LAN

Interactive

/usr

• Compute Nodes• AlphaServer ES45

– 5 nodes per cabinet

– 3 local disks /node

Row upon row…

PSC/HP Grid Alliance• A strategic alliance to demonstrate

the potential of the National Science Foundation's Extensible TeraGrid

• 16 Node HP Itanium2/Linux cluster• Through this collaboration, PSC and

HP expect to further the TeraGrid goals of enabling scalable, open source, commodity computing on IA64/Linux to address real-world problems

What’s a Cluster?Base Hardware

• Commodity Nodes– Single, Dual, Quad, ???– Switch port cost vs cpu

• Interconnect– Bandwidth– Latency

• Storage– Node local– Shared filesystem

Cluster InterconnectLow End

• 10/100 Mbit Ethernet– Very cheap– Slow with High Latency

• Gigabit Ethernet– Sweet Spot– Especially with:

• Channel Bonding

• Jumbo Frames

Cluster Interconnect, cont.Mid-Range

• Myrinet– http://www.myrinet.com/

– High speed with Good (not great) latency

– High port count switches

– Well adopted and supported in the Cluster Community

• Infiniband– Emerging

– Should be inexpensive and pervasive

Cluster Interconnect, cont.Outta Sight!

• Quadrics Elan– http://www.quadrics.com/– Very High Performance

• Great Speed• Spectacular Latency

– Software• RMS• QSNET

– Becoming more “Commodity”

512-1024 way switch. (4096 & 8192-way same but bigger!)

Switches..8: 8*(16-way)

8-16: 64U64D

• • • • 8 • • • •

• • • • 8-16 • • •(13 for TCS)

FederatedFederated switch switch

Overhead Cables

Fully wired Fully wired switch cabinetswitch cabinet

1 of 24.1 of 24. Wires up & downWires up & down

Wiring:Wiring: QuadricsQuadrics

What’s a Cluster?Base Hardware

• Commodity Nodes– Single, Dual, Quad, ???– Switch port cost vs cpu

• Interconnect– Bandwidth– Latency

• Storage– Node local– Shared filesystem

Commodity Cache Servers

• Linux• Custom Software

– libtcom/tcsiod– Coherency Manager (SLASH)

• Special Purpose DASP– Connection to Outside– Multi-Protocol

• *ftp• SRB• Globus

• 3Ware SCSI/ATA Disk Controllers

What’s a Cluster?System Software

• Installation

• Replication

• Consistency

• Parallel File System

• Resource Management

• Job Control

•Installation

•Replication

•Consistency

Users

Job Management Software

queues

submit

Batch JobManagement

“Simon”scheduler

TCS schedulingpractices

PBS/RMS

Job invocation

usageaccounting

database

Monitoring

What’s next? supply

process distributionexecution, control

PSC NSF

Visualization

Nodes

tcscommCheckpoint / restart

user file servers

HSM

tcscopy / hsmtcscopy

requeue

tcscomm

node eventmanagement

call tracking andfield service db

user notification

demand

Compute Nodes

CPR

CPR

CPR

CPR

PSC Terascale Computing System

Monitoring Non-Contiguous Scheduling

What’s a Cluster?Application Support

• Parallel Execution

• MPI– http://www.mpi-forum.org/

• Shared Memory

• Other…– Portals– Global Arrays

Building Your Cluster

• Pre-Built– PSSC – Chemistry– Tempest

• Roll-your-Own– Campus Resources– Web

• Use PSC– Rich Raymond (raymond@psc.edu)– http://www.psc.edu/homepage_files/state_funding.html

• Open Source Cluster Application Resources

• Cluster on a CD – automates cluster install process

• Wizard driven• Nodes are built over network• OSCAR <= 64 node clusters

for initial target• Works on PC commodity

components• RedHat based (for now)• Components: Open source

and BSD style license• NCSA “Cluster in a Box”

basewww.oscar.sourgeforge.net

• Enable application scientists to build and manage their own resources

– Hardware cost is not the problem– System Administrators cost

money, and do not scale– Software can replace much of

the day-to-day grind of system administration

• Train the next generation of users on loosely coupled parallel machines

– Current price-performance leader for HPC

– Users will be ready to “step up” to NPACI (or other) resources when needed

• Rocks scales to Top500 sized resources

– Experiment on small clusters– Build your own supercomputer

with the same software!www.rockscluster.org

scary technology

GriPhyN and European DataGrid

Virtual Data ToolsRequest Planning and

Scheduling ToolsRequest Execution Management Tools

Transforms

Distributed resources(code, storage,computers, and network)

Resource Management

Services

Resource Management

Services

Security and Policy

Services

Security and Policy

Services

Other Grid Services

Other Grid Services

Interactive User Tools

Production Team

Individual Investigator Other Users

Raw data source

Illustration courtesy C. Catlett, ©2001 Global Grid Forum

Extensible Terascale Facility - ETF "TeraGrid"

NCSA: Compute-IntensiveSDSC: Data-Intensive PSC: Compute-Intensive

IA64

IA64 Pwr4 EV68

IA32

IA32

EV7

IA64 Sun

10 TF IA-64128 large memory nodes

230 TB Storage

5 TF IA-64DB2 Server 500 TB Storage1.1 TF Power4

6 TF EV6871 TB Storage

0.3 TF EV7 shared-memory150 TB Storage Server

1.25 TF IA-6496 Visualization nodes

20 TB Storage

0.4 TF IA-64IA32 Datawulf80 TB Storage

Extensible Backplane Network

LAHub

ChicagoHub

IA32

Storage Server

Disk Storage

Cluster

Shared Memory

VisualizationCluster

LEGEND

Storage Server

Disk Storage

Cluster

Shared Memory

VisualizationCluster

LEGEND

30 Gb/s

IA64

30 Gb/s

30 Gb/s30 Gb/s

30 Gb/s

Sun Sun

ANL: VisualizationCaltech: Data collection analysis

40 Gb/s

Grid Building Blocks

Middleware:Hardware and software infrastructure to enable access to

computational resources

Services:Security Information ServicesResource Discovery / LocationResource Management Fault Tolerance / Detection

www.globus.org

Thank You

lemieux.psc.edu