Clusters to Supercomputers Schenk’s System AdministrationApril 2008 Matthew Woitaszek University...
-
Upload
dortha-esther-hancock -
Category
Documents
-
view
214 -
download
0
Transcript of Clusters to Supercomputers Schenk’s System AdministrationApril 2008 Matthew Woitaszek University...
Clusters to Supercomputers
Schenk’s System Administration April 2008
Matthew WoitaszekUniversity of Colorado, Boulder
NCAR Computer Science Section
Presented forChris Schenk’s CSCI 4113Unix System Administration
We’re Hiring!
System Administrators
Software Developers
Web Technology Geeks Nobody does server-side refresh anymore
Job positions at NCAR and CU Full-time, part-time, and occasional
April 2008 2
Outline
Motivation My other computer is a …
Parallel Computing Processors Networks Storage Software
Grid Computing Software Platforms
April 2008 3
April 2008 4
James Demmel’s Reasons for HPC
Traditional scientific and engineering paradigm Do theory or paper design Perform experiments or build system
Replacing both by numerical experiments Real phenomena are too complicated to model by hand Real experiments are:
too hard, e.g., build large wind tunnels too expensive, e.g., build a throw-away passenger jet too slow, e.g., wait for climate or galactic evolution too dangerous, e.g., weapons, drug design
April 2008 5
Time-Critical Simulations
NCAR’s time-critical HPC simulations Mesoscale meterology Global climate
My favorite: Traffic simulations
Require more than a single processor to complete in a reasonable amount of time
Realtime ComputingModel and simulate to predict
The forecast simulation has to bedone before the weather happens.
April 2008 6
Performance: Vector vs. Parallel MM5 (1999)
7000
6000
5000
4000
3000
2000
1000
0 2 4 6 8 10 12 14 16
Non-realtime MM5
Number of Processors
Tim
e (
sec)
3600
Cray J90
Linux Cluster (theHIVE)Cray T3E
J. Dorband, J. Kouatchou, J. Michalakes, and U. Ranawake, “Implementing MM5 on NASA Goddard Space Flight Center computing systems: a performance study”, 1999.
April 2008 7
Performance: POP 640x768 (2003)POP 640x768 Simulated Years per Day by Number of Processors
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
8 16 24 32 40 48 56 64 72
Number of Processors
Sim
ula
ted
Ye
ars
pe
r W
all
Clo
ck D
ay
Xeon 2.4 Dual / Dolphin
Xeon 2.4 Single / Dolphin
Xeon 2.4 Single / Myrinet D
Xeon 2.8 Dual / Myrinet D
Xeon 3.06 Dual / Infiniband
Opteron 2.0 / Myrinet D (A)
Opteron 2.0 / Myrinet D (B)
IBM p690
POP on Xeon: memory bandwidth limit
M. Woitaszek, M. Oberg, and H. M. Tufo, “Comparing Linux Clusters for the Community Climate Systems Model”, 2003.
April 2008 8
Traffic Flow Simulation:15.5 miles17 exit ramps19 entry rampst = .5 sd = 100 ftT = 24 hr
Minneapolis I-494 Highway Simulation (1997)
Realtime Computing by Parallelization
April 2008 9
Performance: Realtime Computing by Parallelization
Legacy Code and Hardware Intel P133 single processor 65.7 minutes (3942 seconds)
simulation time
Massively Parallel Implementation Cray T3E: 450 MHz Alpha 21164 67.04 seconds with 1 PE 60x faster than P133 6.26 seconds with 16 PEs 629x 2.39 seconds with 256 PEs 1649x
Wait! If it takes ~1 minute on 1 PE, shouldn’t it take…67 seconds / 16 = 4.10 s on 16 PEs, or 67 seconds / 256 = 0.25 s on 256 PEs?
C. Johnston and A. Chronopolus, “The parallelization of a highway traffic flow simulation”, 1999.
April 2008 10
Speedup and Overhead
Amdahl’s Law The part you don’t optimize comes back to haunt you!
Speedup is limited by Memory latency Disk I/O bottlenecks Network bandwidth and latency Algorithm
April 2008 11
Performance: HOMME on BlueGene/L (2007)
G. Bhanot, J.M. Dennis, J. Edwards, W. Grabowski, M. Gupta, K. Jordan, R.D. Loft, J. Sexton, A. St-Cyr, S.J. Thomas, H.M. Tufo, T. Voran, R. Walkup, and A.A. Wyszogrodski, "Early Experiences with the 360TF IBM
BlueGene/L Platform," International Journal of Computational Methods, 2006.
0
50
100
150
200
250
300
32768 16384 8192 4096
ro
ss
ec
or
p/
ce
s/
PO
LF
M
Processors
coprocessor mode, snake 2x2 mapping, eager limit = 500virtual-node mode, snake 2x2 mapping, eager limit = 500virtual-node mode, grouped mapping, eager limit = 500
April 2008 12
Performance: HOMME on BlueGene/L (2007)
0 1
2 3
4 5
6 7 0
1 2
3 4
5 6
7
0
1
2
3
4
5
6
7
Z
X
Y
Z
April 2008 13
Outline
Motivation My other computer is a …
Parallel Computing Processors Networks Storage Software
Grid Computing Software Platforms
April 2008 14
Processors
From Gregory Pfister’s In Search of Clusters:
A pack of pooches.
A savage multi-headed pooch (SMP).
A pooch.
Source: Gregory Pfister’s In Search of Clusters
April 2008 15
Parallel Architectures
RAMA processor.
A symmetric multiprocessor (SMP).
CPU
RAM RAM
$
CPU
$
RAM RAM
A pack of processors.
CPU
$
CPU
$
CPU
$
April 2008 16
Parallel Architectures
A cache-coherent non-uniform [distributed shared] memory (ccNUMA) cluster of chip-multiprocessor (CMP) symmetric multiprocessors (SMP).
CPU CPU
RAML1$ L1$
Ln$
CPU CPU
RAML1$ L1$
Ln$
April 2008 17
Scalable Parallel Architectures
Emerging massively parallel architectures IBM BlueGene – 131072 chips (can be 2x processors)
Multi-core commodity architectures AMD Opteron, now Intel
Two CPUs per chip
Two chips per card
32 chips per node card
32 node cards per rack
64 racks in system
Source: IBM
April 2008 18
Networks
Network types Message passing (MPI) File system Job control System monitoring
Technologies and Competitors 1Gbps Ethernet and RDMA 10Gbps Ethernet Fixed Topology (3D Torus, Tree, Scali, etc.) Switched (Infiniband, Myrinet)
April 2008 19
Gigabit Ethernet Performance (2006)
RDMA has highest throughput (switched configuration):110 MB/s RDMA, 66 MB/s legacy, 45 MB/s motherboard
Payload Size (Bytes)
1 2 4 8 16
32
64
12
8
25
6
51
2
10
24
20
48
40
96
81
92
16
38
4
32
76
8
65
53
6
13
10
72
26
21
44
52
42
88
10
48
57
6
20
97
15
2
41
94
30
4
Th
rou
gh
pu
t (M
B/s
)
1
10
20
30
40
50
60
70
80
90
100
110
120(110.21MB/s) Ammasso RDMA Crossover(110.07MB/s) Ammasso RDMA Switch( 66.79MB/s) Ammasso Legacy Crossover( 66.40MB/s) Ammasso Legacy Switch( 50.39MB/s) Intel Crossover( 45.07MB/s) Intel Switch
RDMA
Legacy
Motherboard
M. Oberg, H. M. Tufo, T. Voran, and M. Woitaszek, “Evaluation of RDMA Over Ethernet Technology for Building Cost Effective Linux Clusters”, May 2006.
April 2008 20
RDMA for High-Performance Applications
Single network interface for all communications RDMA for MPI, DAPL (Direct Access Programming Library),
and Sockets Direct Protocol (SDP) RDMA bypasses operating system kernel Legacy interface for standard operating system TCP/IP
User SpaceApplication
RDMA NIC
OSKernel
User SpaceApplication
RDMA NIC
OSKernel
Zero-copy, interrupt-free RDMA for MPI applications
April 2008 21
Interconnect Performance (2006)
InterconnectMinimum Latency
Peak Bandwidth
GigE 30 - 60 us 125 MB/s
10GigE 30 - 60 us 1250 MB/s
Myrinet 4 us 250 MB/s
SCI 4 us 250 MB/s
Atoll 4 us 250 MB/s
Infiniband 5 us 1250 MB/s
Atoll Benchmarking ResultsManufacturer’s Ratings
1.5 Mbps = 192 KB/s and 1Gbps = 125 MB/s
April 2008 22
10Gbps Ethernet Performance (2007)
Ethernet approaches 10Gbps (and can be trunked!) Infiniband (4x) reported at 8Gbps sustainable
April 2008 23
StorageArchival StorageTape silo systems(3+ PB)
Supercomputersand local working storage(1 – 100 TB per system)
Grid GatewayGridFTP Servers
Archive Managementand disk cache controller
Shared Storage Clusterwith shared file system(100 – 500 TB)
E N T E R P R I S E6 0 0 0
E N T E R P R I S E6 0 0 0
Visualization Systemsand local working storage
TM
TMOrigin 3200
April 2008 24
Thousands of Disks
April 2008 25
Occam - Average Aggregate NFS Bandwidth by Clients
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30
Concurrent Clients
Ave
rag
e A
gg
reg
ate
Ban
dw
idth
(M
B/s
)
Read Write
The Single Server Limitation
Aggregate bandwidth decreases with increasing concurrent use!
Just you
SHAREDwith
others
April 2008 26
Cluster File Systems – Read Rate (2005)
0
20
40
60
80
100
120
0 1 2 3 4 5 6 7 8Number of Concurrent Clients
Ave
rag
e A
gg
reg
ate
Rea
d R
ate
(MB
/s)
NFS PVFS2 PVFS2-1GbpsLustre Lustre-1Gbps TerraFSGPFS
April 2008 27
Cluster File Systems – Write Rate (2005)
0
20
40
60
80
100
120
0 1 2 3 4 5 6 7 8Number of Concurrent Clients
Ave
rag
e A
gg
reg
ate
Wri
te R
ate
(MB
/s)
NFS PVFS2 PVFS2-1GbpsLustre Lustre-1Gbps TerraFSGPFS
April 2008 28
Table of Administrator Pain and Agony
GPFS 2.3 Lustre 1.4.0 PVFS2 TerraFS
Intel x86-64Metadata server
Not UsedRestrictedSLES .141
No Change Not Used
Intel Xeon Storage server
RestrictedSLES .111
RestrictedSLES .141
No Change No Change
PPC970 ClientRestrictedSLES .111
RestrictedSLES .141
All(Module Only)
N/A
Intel XeonClient
RestrictedSLES .111
RestrictedSLES .141
All(Module Only)
Custom2.4.26 Patch
Original goal was to fit file system in environment File system influences operating system stack
GPFS required a commercial OS and a specific kernel version Lustre required a commercial OS and a specific kernel patch TerraFS required a custom kernel
April 2008 29
Bullet Points of Administrator Pain and Agony
Remain responsive even in failure conditions Filesystem failure should not interrupt standard UNIX
commands used by administrators “ls –la /mnt” or “df” should not hang the console Zombies should respond to “kill –s 9”
Support clean normal and abnormal termination Support both service start and shutdown commands
Provide an Emergency Stop feature Cut losses and let the administrators fix things
Never hang Linux “reboot” command
April 2008 30
Block-Based Access is Complicated!
(Adapted from May, 2001, p. 79)
1 625 3 847
One file shared for writes on four nodes (cyclic mapping)
Server 1
Logical access size
Filesystem block size
Logical file view and physical block placement on two servers
1 2 1 2 1 2 1 2 1 2 1
Server 2
Consider the overhead of correlating blocks to servers:(Example: Where is the first byte of the red data stored?)
Blue Gene/L Single-Partition Performance (2008)
April 2008 31
Clients (Single I/O Node)
0 5 10 15 20 25 30
Ave
rage
Thr
ough
put (
MB/
s)
0
20
40
60
80
100
120
140GPFS Xyratex 2 Lustre Xyratex 2 PVFS Xyratex 2
GPFS Xyratex 4 Lustre Xyratex 4 PVFS Xyratex 4
GPFS Xyratex 6 Lustre Xyratex 6 PVFS Xyratex 6
GPFS DDN 4 Lustre DDN 4 PVFS DDN 4
GPFS FAStT−900 Terascala 10 GPFS FAStT−500
Blue Gene/L Storage Performance (2008)
April 2008 32
Clients0 200 400 600 800 1000
Ave
rage
Thr
ough
put (
MB/
s)
0
200
400
600
800
1000
1200
1400 Xyratex (2 hosts)Xyratex (4 hosts)Xyratex (6 hosts)
DDN (4 hosts)FAStT−900FAStT−500
April 2008 33
Software
Parallel Execution MPI
Job Control Batch queues: PBS, Torque/Maui
Libraries Optimized math routines BLAS, LAPACK
The next slides show what we tell the users…
April 2008 34
The Batch Queue System
Batch queues control access to compute nodes Please don’t ssh to a node and run programs Please don’t mpirun on the head node itself People expect to have the whole node for performance runs!
Resource management Flags and disables offline nodes (down or administrative) Matches job requests to nodes Reserves nodes preventing oversubscribing
Scheduling Queue prioritization spreads CPUs among users Queue limits prevent a single user from hogging the cluster
April 2008 35
PBS Queues on CSC Systems
Debugging for HPSC students Limited to 8 nodes, 10 minutes
Default queue for “friendly” jobs Limited to 16 nodes, 24 hours
Queue for large and long running jobs No resource limit, only 1 running job
Queue for users with special projects approved by the people in charge
speedq
friendlyq
workq
reservedq
April 2008 36
PBS Commands – Queue Status
[matthew@hemisphere]$ qstat -a
hemisphere.cs.colorado.edu: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK Memory Time S Time--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----320102.hemisphe jkihm friendly WL_SIMP_17 7032 1 1 1024mb 24:00 R 09:52320103.hemisphe jkihm friendly WL_SIMP_18 7078 1 1 1024mb 24:00 R 09:52320355.hemisphe jkihm friendly WL_SIMP_17 4537 1 1 1024mb 24:00 R 08:18320388.hemisphe jkihm friendly WL_SIMP_25 -- 1 1 1024mb 24:00 Q --320389.hemisphe jkihm friendly WL_SIMP_25 -- 1 1 1024mb 24:00 Q --320390.hemisphe jkihm friendly WL_SIMP_30 -- 1 1 1024mb 24:00 Q --321397.hemisphe barcelos workq missile 21769 16 32 -- 01:12 R 00:04
What jobs are running? What jobs are waiting?
April 2008 37
Playing Nicely in the Cluster Sandbox
Security considerations Don’t share your account or your files (o+rw) Don’t put the current directory (.) in your path
Compute time considerations Don’t submit more than 10-60 jobs to PBS at a time Don’t submit from a shell script without a `sleep 1’ statement
Storage space considerations Keep large input and output sets in /quicksand, not /home Don’t keep large files around forever: compress or delete Please store your personal media collections elsewhere
Don’t use a password you have ever used anywhere else!
April 2008 38
Outline
Motivation My other computer is a …
Parallel Computing Processors Networks Storage Software
Grid Computing Software Platforms
April 2008 39
Sharing Computing and Data with Grids
Grids link computers together… more than a network! Networks connect computers Grids allow distant computers to work on a single problem
Services look like web servers HTTP for data transfer XML Simple Object Access Protocol (SOAP) instead of HTML
Grid services Metadata and Discovery Services (WS MDS) Job execution (WS GRAM) Data transfer (GridFTP) Workflow management (that’s what we do!)
April 2008 40
Grid-BGC Carbon Cycle Model
GridFTP
Project Database
Web Portal GUI
NCARMass
Storage
Project Object Manager
Job Execution Interface
Grid-BGC Web Service
Job Database
ExecutablesSurfer Resource
Broker
WS-GRAM
Scratch Space
Scratch Space
Grid Security Infrastructure Boundary
WorkflowManagement Service
GridFTP
Globus RFT
Web Service Client
J. Cope, C. Hartsough, S. McCreary, P. Thornton, H. M. Tufo, N. Wilhelmi, and M. Woitaszek, “Experiences from Simulating the Global Carbon Cycle in a Grid Computing Environment”, 2005.
April 2008 41
TeraGrid – Extensible Terascale Facility
April 2008 42
A National Research Priority
2000 $36 Terascale Computing SystemPSC
2001 $45 Distributed Terascale Facility NCSA, SDSC, ANL, CalTech
2002 $35 Extensible Terascale FacilityPSC
2003 $150 TeraGrid Extension ($10M) + OpsIU, Purdue, ORNL, TACC
2007 $65 Track 2 Mid-Range HPCORNL, TACC, NCAR
2007 $208 Track 1 “Blue Waters” PetascaleUIUC / NCSA
http://www.nsf.gov/news/news_summ.jsp?cntn_id=109850http://www.nsf.gov/news/news_summ.jsp?cntn_id=106875
All figures are in millions.
A Few TeraGrid Resources
April 2008 43
RP Name Vendor CPUs CPU Type TFLOP/s
TACC Ranger Sun 62,976 Opteron 504
ORNL NICS Kracken Cray XT4 7,488 Opteron 170
NCSA Abe Dell 9,600 Intel 64 89
TACC Lonestar Dell 5,840 Intel 64 62
PSC BigBen Cray XT4 4,180 Opteron 21
NCSA Cobalt SGI Altix 1,024 Itanium 6
NCAR Frost IBM BG/L 2,048 PPC 440 5
April 2008 44
Challenges and Definitions
Power consumption BlueVista: 276 kilowatts Average U.S. home: 10.5 kilowatts
Physical space
What’s the difference between a cluster and a supercomputer? Price Number of SMP processors in a compute node Network used to connect nodes in the cluster
April 2008 46
Cluster Administration
Parallel and distributed shells pdsh dsh
$ sudo pdsh –w node0[01-27] /etc/init.d/sshd restart
Configuration file management IBM CSM xCAT
Automated operating system installation
April 2008 47
Cluster Security
The most important question:
Centralized inaccessible logging
Intrusion detection Custom scripts Network monitoring difficult at 10Gbps
Desperate measures Extreme firewalling (but don’t depend on it!) Virtual hosting for services One-time passwords (RSA SecureID, CryptoCard)
How do you know if you’ve been compromised?
Questions?Matthew Woitaszek
Thanks to my CU and NCAR colleagues:
Jason Cope, John Dennis, Bobby House,
Rory Kelly, Dustin Leverman, Paul Marshal,
Michael Oberg, Henry Tufo, and Theron Voran
Schenk’s System Administration April 2008