High Performance Linux Clusters

96
High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC

description

High Performance Linux Clusters. Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC. Overview of San Diego Supercomputer Center. Founded in 1985 Non-military access to supercomputers Over 400 employees Mission: Innovate, develop, and deploy technology to advance science - PowerPoint PPT Presentation

Transcript of High Performance Linux Clusters

Page 1: High Performance Linux Clusters

High Performance Linux Clusters

Guru Session, Usenix, Boston

June 30, 2004

Greg Bruno, SDSC

Page 2: High Performance Linux Clusters

Overview of San Diego Supercomputer Center Founded in 1985

Non-military access to supercomputers

Over 400 employees Mission: Innovate, develop, and

deploy technology to advance science Recognized as an international leader

in: Grid and Cluster Computing Data Management High Performance Computing Networking Visualization

Primarily funded by NSF

Page 3: High Performance Linux Clusters

My Background

1984 - 1998: NCR - Helped to build the world’s largest database computers Saw the transistion from proprietary parallel systems to

clusters

1999 - 2000: HPVM - Helped build Windows clusters

2000 - Now: Rocks - Helping to build Linux-based clusters

Page 4: High Performance Linux Clusters

Why Clusters?

Page 5: High Performance Linux Clusters

Moore’s Law

Page 6: High Performance Linux Clusters

Cluster Pioneers

In the mid-1990s, Network of Workstations project (UC Berkeley) and the Beowulf Project (NASA) asked the question:

Can You Build a High Performance Machine FromCommodity Components?

Page 7: High Performance Linux Clusters

The Answer is: Yes

Source: Dave Pierce, SIO

Page 8: High Performance Linux Clusters

The Answer is: Yes

Page 9: High Performance Linux Clusters

Types of Clusters

High Availability Generally small (less than 8 nodes)

Visualization

High Performance Computational tools for scientific computing Large database machines

Page 10: High Performance Linux Clusters

High Availability Cluster

Composed of redundant components and multiple communication paths

Page 11: High Performance Linux Clusters

Visualization Cluster

Each node in the cluster drives a display

Page 12: High Performance Linux Clusters

High Performance Cluster

Constructed with many compute nodes and often a high-performance interconnect

Page 13: High Performance Linux Clusters

Cluster Hardware Components

Page 14: High Performance Linux Clusters

Cluster Processors

Pentium/Athlon Opteron Itanium

Page 15: High Performance Linux Clusters

Processors: x86

Most prevalent processor used in commodity clustering

Fastest integer processor on the planet: 3.4 GHz Pentium 4, SPEC2000int: 1705

Page 16: High Performance Linux Clusters

Processors: x86

Capable floating point performance #5 machine on Top500 list built with Pentium 4

processors

Page 17: High Performance Linux Clusters

Processors: Opteron

Newest 64-bit processor Excellent integer performance

SPEC2000int: 1655

Good floating point performance SPEC2000fp: 1691 #10 machine on Top500

Page 18: High Performance Linux Clusters

Processors: Itanium First systems released June 2001 Decent integer performance

SPEC2000int: 1404

Fastest floating-point performance on the planet SPEC2000fp: 2161

Impressive Linpack efficiency: 86%

Page 19: High Performance Linux Clusters

Processors Summary

Processor GHz SPECint SPECfp Price

Pentium 4 EE

3.4 1705 1561 791

Athlon

FX-51

2.2 1447 1423 728

Opteron 150 2.4 1655 1644 615

Itanium 2 1.5 1404 2161 4798

Itanium 2 1.3 1162 1891 1700

Power4+ 1.7 1158 1776 ????

Page 20: High Performance Linux Clusters

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

But What You Really Build?

Itanium: Dell PowerEdge 3250 Two 1.4 GHz CPUs (1.5 MB cache)

11.2 Gflops peak

2 GB memory 36 GB disk $7,700

Two 1.5 GHz (6 MB cache) makes the system cost ~$17,700

1.4 GHz vs. 1.5 GHz ~7% slower ~130% cheaper

Page 21: High Performance Linux Clusters

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Opteron

IBM eServer 325 Two 2.0 GHz Opteron 246

8 Gflops peak

2 GB memory 36 GB disk $4,539

Two 2.4 GHz CPUs: $5,691

2.0 GHz vs. 2.4 GHz ~17% slower ~25% cheaper

Page 22: High Performance Linux Clusters

Pentium 4 Xeon

HP DL140 Two 3.06 GHz CPUs

12 Gflops peak

2 GB memory 80 GB disk $2,815

Two 3.2 GHz: $3,368

3.06 GHz vs. 3.2 GHz ~4% slower ~20% cheaper

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Page 23: High Performance Linux Clusters

If You Had $100,000 To Spend On A Compute Farm

System# of Boxes

Peak GFlops

Aggregate

SPEC2000fp

Aggregate

SPEC2000int

Pentium 4

3 GHz

35 420 89810 104370

Opteron 246 2.0 GHz

22 176 56892 57948

Itanium

1.4 GHz

12 132 46608 24528

Page 24: High Performance Linux Clusters

What People Are Buying

Gartner study

Servers shipped in 1Q04 Itanium: 6,281 Opteron: 31,184

Opteron shipped 5x more servers than Itanium

Page 25: High Performance Linux Clusters

What Are People Buying

Gartner study

Servers shipped in 1Q04 Itanium: 6,281 Opteron: 31,184 Pentium: 1,000,000

Pentium shipped 30x more than Opteron

Page 26: High Performance Linux Clusters

Interconnects

Page 27: High Performance Linux Clusters

Interconnects

Ethernet Most prevalent on clusters

Low-latency interconnects Myrinet Infiniband Quadrics Ammasso

Page 28: High Performance Linux Clusters

Why Low-Latency Interconnects?

Performance Lower latency Higher bandwidth

Accomplished through OS-bypass

Page 29: High Performance Linux Clusters

How Low Latency Interconnects Work

Decrease latency for a packet by reducing the number memory copies per packet

Page 30: High Performance Linux Clusters

Bisection Bandwidth

Definition: If split system in half, what is the maximum amount of data that can pass between each half?

Assuming 1 Gb/s links: Bisection bandwidth = 1 Gb/s

Page 31: High Performance Linux Clusters

Bisection Bandwidth

Assuming 1 Gb/s links: Bisection bandwidth = 2 Gb/s

Page 32: High Performance Linux Clusters

Bisection Bandwidth

Definition: Full bisection bandwidth is a network topology that can support N/2 simultaneous communication streams.

That is, the nodes on one half of the network can communicate with the nodes on the other half at full speed.

Page 33: High Performance Linux Clusters

Large Networks When run out of ports on a single switch, then you must

add another network stage

In example above: Assuming 1 Gb/s links, uplinks from stage 1 switches to stage 2 switches must carry at least 6 Gb/s

Page 34: High Performance Linux Clusters

Large Networks

With low-port count switches, need many switches on large systems in order to maintain full bisection bandwidth

128-node system with 32-port switches requires 12 switches and 256 total cables

Page 35: High Performance Linux Clusters

Myrinet

Long-time interconnect vendor Delivering products since 1995

Deliver single 128-port full bisection bandwidth switch

MPI Performance: Latency: 6.7 us Bandwidth: 245 MB/s Cost/port (based on 64-port configuration): $1000

Switch + NIC + cable http://www.myri.com/myrinet/product_list.html

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 36: High Performance Linux Clusters

Myrinet

Recently announced 256-port switch Available August 2004

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 37: High Performance Linux Clusters

Myrinet

#5 System on Top500 list

System sustains 64% of peak performance But smaller Myrinet-connected systems hit 70-75% of

peak

Page 38: High Performance Linux Clusters

Quadrics

QsNetII E-series Released at the end of May 2004

Deliver 128-port standalone switches

MPI Performance: Latency: 3 us Bandwidth: 900 MB/s Cost/port (based on 64-port configuration): $1800

Switch + NIC + cable http://doc.quadrics.com/Quadrics/QuadricsHome.nsf/

DisplayPages/A3EE4AED738B6E2480256DD30057B227

Page 39: High Performance Linux Clusters

Quadrics

#2 on Top500 list

Sustains 86% of peak Other Quadrics-connected systems on Top500 list

sustain 70-75% of peak

Page 40: High Performance Linux Clusters

Infiniband

Newest cluster interconnect

Currently shipping 32-port switches and 192-port switches

MPI Performance: Latency: 6.8 us Bandwidth: 840 MB/s Estimated cost/port (based on 64-port configuration): $1700 -

3000 Switch + NIC + cable http://www.techonline.com/community/related_content/24364

Page 41: High Performance Linux Clusters

Ethernet

Latency: 80 us

Bandwidth: 100 MB/s

Top500 list has ethernet-based systems sustaining between 35-59% of peak

Page 42: High Performance Linux Clusters

Ethernet

With Myrinet, would have sustained ~1 Tflop At a cost of ~$130,000

Roughly 1/3 the cost of the system

What we did with 128 nodes and a $13,000 ethernet network $101 / port

$28/port with our latest Gigabit Ethernet switch Sustained 48% of peak

Page 43: High Performance Linux Clusters

Rockstar Topology

24-port switches Not a symmetric network

Best case - 4:1 bisection bandwidth Worst case - 8:1 Average - 5.3:1

Page 44: High Performance Linux Clusters

Low-Latency Ethernet

Bring os-bypass to ethernet Projected performance:

Latency: less than 20 us Bandwidth: 100 MB/s

Potentially could merge management and high-performance networks

Vendor “Ammasso”

Page 45: High Performance Linux Clusters

Application Benefits

Page 46: High Performance Linux Clusters

Storage

Page 47: High Performance Linux Clusters

Local Storage

Exported to compute nodes via NFS

Page 48: High Performance Linux Clusters

Network Attached Storage

A NAS box is an embedded NFS appliance

Page 49: High Performance Linux Clusters

Storage Area Network

Provides a disk block interface over a network (Fibre Channel or Ethernet) Moves the shared disks out of the servers and onto the network Still requires a central service to coordinate file system operations

Page 50: High Performance Linux Clusters

Parallel Virtual File System

PVFS version 1 has no fault tolerance PVFS version 2 (in beta) has fault tolerance mechanisms

Page 51: High Performance Linux Clusters

Lustre

Open Source “Object-based” storage

Files become objects, not blocks

Page 52: High Performance Linux Clusters

Cluster Software

Page 53: High Performance Linux Clusters

Cluster Software Stack

Linux Kernel/Environment RedHat, SuSE, Debian, etc.

Page 54: High Performance Linux Clusters

Cluster Software Stack

HPC Device Drivers Interconnect driver (e.g., Myrinet, Infiniband, Quadrics) Storage drivers (e.g., PVFS)

Page 55: High Performance Linux Clusters

Cluster Software Stack

Job Scheduling and Launching Sun Grid Engine (SGE) Portable Batch System (PBS) Load Sharing Facility (LSF)

Page 56: High Performance Linux Clusters

Cluster Software Stack

Cluster Software Management E.g., Rocks, OSCAR, Scyld

Page 57: High Performance Linux Clusters

Cluster Software Stack

Cluster State Management and Monitoring Monitoring: Ganglia, Clumon, Nagios, Tripwire, Big Brother Management: Node naming and configuration (e.g., DHCP)

Page 58: High Performance Linux Clusters

Cluster Software Stack

Message Passing and Communication Layer E.g., Sockets, MPICH, PVM

Page 59: High Performance Linux Clusters

Cluster Software Stack

Parallel Code / Web Farm / Grid / Computer Lab Locally developed code

Page 60: High Performance Linux Clusters

Cluster Software Stack

Questions: How to deploy this stack across every machine in the cluster? How to keep this stack consistent across every machine?

Page 61: High Performance Linux Clusters

Software Deployment

Known methods: Manual Approach “Add-on” method

Bring up a frontend, then add cluster packages OpenMosix, OSCAR, Warewulf

Integrated Cluster packages are added at frontend installation time

Rocks, Scyld

Page 62: High Performance Linux Clusters

Rocks

Page 63: High Performance Linux Clusters

Primary Goal

Make clusters easy

Target audience: Scientists who want a capable computational resource in their own lab

Page 64: High Performance Linux Clusters

Philosophy Not fun to “care and feed” for a system All compute nodes are 100% automatically

installed Critical for scaling

Essential to track software updates RHEL 3.0 has issued 232 source RPM updates since Oct

21 Roughly 1 updated SRPM per day

Run on heterogeneous standard high volume components Use the components that offer the best

price/performance!

Page 65: High Performance Linux Clusters

More Philosophy

Use installation as common mechanism to manage a cluster Everyone installs a system:

On initial bring up When replacing a dead node Adding new nodes

Rocks also uses installation to keep software consistent If you catch yourself wondering if a node’s software is up-to-

date, reinstall! In 10 minutes, all doubt is erased

Rocks doesn’t attempt to incrementally update software

Page 66: High Performance Linux Clusters

Rocks Cluster Distribution

Fully-automated cluster-aware distribution Cluster on a CD set

Software Packages Full Red Hat Linux distribution

Red Hat Linux Enterprise 3.0 rebuilt from source De-facto standard cluster packages Rocks packages Rocks community packages

System Configuration Configure the services in packages

Page 67: High Performance Linux Clusters

Rocks Hardware Architecture

Page 68: High Performance Linux Clusters

Minimum Components

X86, Opteron, IA64 server

Local HardDrive

Power

Ethernet

OS on all nodes (not SSI)

Page 69: High Performance Linux Clusters

Optional Components

Myrinet high-performance network Infiniband support in Nov 2004

Network-addressable power distribution unit

keyboard/video/mouse network not required Non-commodity How do you manage your management

network? Crash carts have a lower TCO

Page 70: High Performance Linux Clusters

Storage

NFS The frontend exports all home directories

Parallel Virtual File System version 1 System nodes can be targeted as Compute + PVFS or

strictly PVFS nodes

Page 71: High Performance Linux Clusters

Minimum Hardware Requirements

Frontend: 2 ethernet connections 18 GB disk drive 512 MB memory

Compute: 1 ethernet connection 18 GB disk drive 512 MB memory

Power Ethernet switches

Page 72: High Performance Linux Clusters

Cluster Software Stack

Page 73: High Performance Linux Clusters

Rocks ‘Rolls’

Rolls are containers for software packages and the configuration scripts for the packages

Rolls dissect a monolithic distribution

Page 74: High Performance Linux Clusters

Rolls: User-Customizable Frontends

Rolls are added by the Red Hat installer Software is added and configured at initial installation

time

Benefit: apply security patches during initial installation This method is more secure than the add-on method

Page 75: High Performance Linux Clusters

Red Hat Installer Modified to Accept Rolls

Page 76: High Performance Linux Clusters

Approach Install a frontend

1. Insert Rocks Base CD2. Insert Roll CDs (optional components)3. Answer 7 screens of configuration data4. Drink coffee (takes about 30 minutes to install)

Install compute nodes:1. Login to frontend2. Execute insert-ethers3. Boot compute node with Rocks Base CD (or PXE)4. Insert-ethers discovers nodes5. Goto step 3

Add user accounts Start computing Optional Rolls

Condor Grid (based on NMI R4) Intel (compilers) Java SCE (developed in Thailand) Sun Grid Engine PBS (developed in Norway) Area51 (security monitoring tools)

Page 77: High Performance Linux Clusters

Login to Frontend

Create ssh public/private key Ask for ‘passphrase’ These keys are used to securely login into compute

nodes without having to enter a password each time you login to a compute node

Execute ‘insert-ethers’ This utility listens for new compute nodes

Page 78: High Performance Linux Clusters

Insert-ethers

Used to integrate “appliances” into the cluster

Page 79: High Performance Linux Clusters

Boot a Compute Node in Installation Mode

Instruct the node to network boot Network boot forces the compute node to run the PXE protocol (Pre-

eXecution Environment)

Also can use the Rocks Base CD If no CD and no PXE-enabled NIC, can use a boot floppy built from

‘Etherboot’ (http://www.rom-o-matic.net)

Page 80: High Performance Linux Clusters

Insert-ethers Discovers the Node

Page 81: High Performance Linux Clusters

Insert-ethers Status

Page 82: High Performance Linux Clusters

eKVEthernet Keyboard and Video

Monitor your compute node installation over the ethernet network No KVM required!

Execute: ‘ssh compute-0-0’

Page 83: High Performance Linux Clusters

Node Info Stored In A MySQL Database

If you know SQL, you can execute some powerful commands

Page 84: High Performance Linux Clusters

Cluster Database

Page 85: High Performance Linux Clusters

Kickstart

Red Hat’s Kickstart Monolithic flat ASCII file No macro language Requires forking based on site

information and node type.

Rocks XML Kickstart Decompose a kickstart file into

nodes and a graph Graph specifies OO framework Each node specifies a service and its

configuration Macros and SQL for site

configuration Driven from web cgi script

Page 86: High Performance Linux Clusters

Sample Node File<?xml version="1.0" standalone="no"?><!DOCTYPE kickstart SYSTEM "@KICKSTART_DTD@" [<!ENTITY ssh "openssh">]><kickstart>

<description>Enable SSH</description>

<package>&ssh;</package><package>&ssh;-clients</package><package>&ssh;-server</package><package>&ssh;-askpass</package>

<post>

<file name="/etc/ssh/ssh_config">Host * CheckHostIP no ForwardX11 yes ForwardAgent yes StrictHostKeyChecking no UsePrivilegedPort no FallBackToRsh no Protocol 1,2</file>

chmod o+rx /rootmkdir /root/.sshchmod o+rx /root/.ssh

</post></kickstart>>

Page 87: High Performance Linux Clusters

Sample Graph File<?xml version="1.0" standalone="no"?><!DOCTYPE kickstart SYSTEM "@GRAPH_DTD@">

<graph><description>Default Graph for NPACI Rocks.</description>

<edge from="base" to="scripting"/><edge from="base" to="ssh"/><edge from="base" to="ssl"/><edge from="base" to="lilo" arch="i386"/><edge from="base" to="elilo" arch="ia64"/>

…<edge from="node" to="base" weight="80"/><edge from="node" to="accounting"/><edge from="slave-node" to="node"/><edge from="slave-node" to="nis-client"/>

<edge from="slave-node" to="autofs-client"/> <edge from="slave-node" to="dhcp-client"/> <edge from="slave-node" to="snmp-server"/> <edge from="slave-node" to="node-certs"/> <edge from="compute" to="slave-node"/> <edge from="compute" to="usher-server"/> <edge from="master-node" to="node"/> <edge from="master-node" to="x11"/> <edge from="master-node" to="usher-client"/></graph>

Page 88: High Performance Linux Clusters

Kickstart framework

Page 89: High Performance Linux Clusters

Appliances

Laptop / Desktop Appliances Final classes Node types

Desktop IsA standalone

Laptop IsA standalone pcmcia

Code re-use is good

Page 90: High Performance Linux Clusters

Architecture Differences

Conditional inheritance Annotate edges with target

architectures if i386

Base IsA grub

if ia64 Base IsA elilo

One Graph, Many CPUs Heterogeneity is easy Not for SSI or Imaging

Page 91: High Performance Linux Clusters

Installation Timeline

Page 92: High Performance Linux Clusters

Status

Page 93: High Performance Linux Clusters

But Are Rocks Clusters High Performance Systems? Rocks Clusters on June 2004 Top500 list:

Page 94: High Performance Linux Clusters
Page 95: High Performance Linux Clusters

What We Proposed To Sun

Let’s build a Top500 machine … … from the ground up … … in 2 hours … … in the Sun booth at Supercomputing ‘03

Page 96: High Performance Linux Clusters

Rockstar Cluster (SC’03) Demonstrate

We are now in the age of “personal supercomputing”

Highlight abilities of: Rocks SGE

Top500 list #201 - November 2003 #413 - June 2004

Hardware 129 Intel Xeon servers

1 Frontend Node 128 Compute Nodes

Gigabit Ethernet $13,000 (US) 9 24-port switches 8 4-gigabit trunk uplinks