Building Scalable and Cost-Effective Clusters with Linux · Building Scalable and Cost-Effective...
Transcript of Building Scalable and Cost-Effective Clusters with Linux · Building Scalable and Cost-Effective...
Building Scalable and Cost-Effective Clusters with Linux
TMSI
2
Contents
• Overview• Hardware
– Choice of CPU– Interconnection– Storage
• Software– CentOS– xCat/Rocks– GlusterFS– Sun Grid Engine– MPICH + compilers– Zabbix– Backup/ZFS
Building Scalable and Cost-Effective Clusters with Linux
TMSI
3
Contents
• Performance data– IO tests– Hydrodynamics models
• Common issues– Hardware failure– GlusterFS– SGE/MPD– Samba share
• Upcoming– Puppet, cobbler, … or Rocks?– SSO– More ZFS
Building Scalable and Cost-Effective Clusters with Linux
TMSI
4
About me
• Please call me «Chinh» ;) (millions of NGUYENs!)• Grew up in Hue, Vietnam, studied in Moscow State
University M.V. Lomonosov• Senior systems engineer at Tropical Marine Science
Institute, National University of Singapore• Doing different thing from time to time: programming,
systems integration, sysadmin, DBA, …• Started with FreeBSD in 2003• Converted to Linux since 2006 because of … a laptop
Building Scalable and Cost-Effective Clusters with Linux
TMSI
5
About TMSI
• Research institute under NUS• Acoustic research lab
– Sensing and signal processing, underwater communications and networking, bioacoustics, …
• Marine Biology & Ecology Lab• Marine Mammal Research Lab• Physical Oceanography Research Lab
– Hydrodynamics, environmental processes, GIS, water resources, climate change
Building Scalable and Cost-Effective Clusters with Linux
TMSI
6
Why build our own cluster?
• We can't run WRF/Acusim properly on our IBM blades LS21/LS22 and 6-years old Sun cluster
• NUS Computer Center does have some big clusters, but very long queue
• Ex. we need to simulate 100 years of climate for sub-SEA region within few months
• Remote cloud services do not work (EC2, POD, …)• Budget limitation: ~200k only• IHPC charges 70k/year for 64-cores Power5 (4 nodes)
Building Scalable and Cost-Effective Clusters with Linux
TMSI
7
Overview
• July/2010: 112 PCs, Intel Core i7-940 & i7-965 (8Mb L3)• 51 permanent Linux nodes (the rest still run Windows
2008 HPC)• 15 sqm room at Civil Engineering Dept. (hot!)• Typical configuration of node:
– 4 cores at 2.93Ghz (HT disabled or not used)– 8-12 Gb DDR3-1066– 2 x 500GB or 3 x 500GB SATA-II– Myricom 10G-PCIE + 1Gb Ethernet (may not be connected)
• 1 head node (SGE master) + NAT, 50 compute nodes• 22 GlusterFS brick nodes (11Tb storage)• Deployed Apr-May 2009
Building Scalable and Cost-Effective Clusters with Linux
TMSI
8
Hardware — Core i7
• Pros– Nehalem based, QPI 4.8/6.4 GT/s– Cost: ~600$, 30-40% cheaper than Xeon E55xx– Consumer mainboard, no ECC memory, PC chassis– Can buy off-the-shelf– Opteron Barcelona/Shanghai was too expensive
• Cons– HT does not work well– Very hot (130W)– AMD Opteron 4000 came out June/2010, <100$/CPU!
Building Scalable and Cost-Effective Clusters with Linux
TMSI
11
Interconnection — Myricom 10G
• Cons– It's old, and not likely to advance further– Infiniband DDR/QDR dominates– 10G Ethernet with low latency becomes popular
• BUT compare with 1GbE– 10x bandwidth– At least 5x lower latency (~10µs with MPI)
• Compare with 10GbE/IB– < 1k$/node for 128 nodes setup– 10GbE/IB: 1.5k+/node– Latency comparable with 10GbE– Second hand even available on ebay!
Building Scalable and Cost-Effective Clusters with Linux
TMSI
12
Storage
• Centralized NFS server was good, until we run WRF and similar models
• Storage bandwidth at 3-6 Gbp/s (IBM DS3xxx, ...)
• Caused very high IO load on head node (wa)
• 1GbE network overloaded
Building Scalable and Cost-Effective Clusters with Linux
TMSI
13
Decentralized Storage
• Cluster file system: Lustre, GlusterFS, GFS2
Lustre GlusterFS GFS2*
Maintenance
Reliability
Performance
Scalability
* GFS2 not in same class with Lustre/GlusterFS
Building Scalable and Cost-Effective Clusters with Linux
TMSI
14
Storage — GlusterFS
• Pros– Easy to use, low maintenance cost– No single point of failure (no metadata server)– Scale well (1000+ nodes)– Flexible (ex. distribute ontop of stripe)– Stable enough (used for more than 1 year)– Self-heal works– Supports Infiniband RDMA (but we don't have IB anyway)
• Cons– Completely in user-space (using FUSE)– Inconsistency problem does exist– IO tuning may not work persistently
Building Scalable and Cost-Effective Clusters with Linux
TMSI
15
Software — CentOS
• Pros– Key point: a perfect clone of RHEL– De-facto supported by all scientific software running under
Linux– A lot of prebuilt packages, 3-rd party repositories
(RPMForge, EPEL, Remi, …)– We use CentOS for many years
• Cons– Convervative in updating packages in repo (RHEL policy)– Maintained by a small non-dedicated team (huge thanks to
them and their great commitment through out the years!)
Building Scalable and Cost-Effective Clusters with Linux
TMSI
16
xCAT/Rocks
• xCAT– Recommended by IBM guys (for our LS2x clusters)– No IPMI on our PCs– Do network install for nodes with Ethernet connection– We need to install Myrinet-only nodes manually (bad!)– Common used commands: psh, pscp, node*
• Rocks– We're looking to use Rocks in the next cluster– Rocks was developed intensively for the last 3 years (NSF
grant?)– Support SGE and MPICH natively (Rocks 5.x)
Building Scalable and Cost-Effective Clusters with Linux
TMSI
17
GlusterFS
• Installation– Install glusterfsd on each «brick» node– Create software raid-1 (md) from 2 x 500 Gb– Mount md drive and export using glusterfsd.vol– Create glusterfs.vol (type cluster/distribute) and distribute to
all nodes– Start glusterfsd and glusterfs everywherec57:/home 459G 71G 365G 17% /homec57:/cluster 450G 14G 413G 4% /clusterc57:/opt/sge 450G 14G 413G 4% /opt/sgeglusterfs#127.0.0.1 9.5T 8.0T 1.1T 89% /data
Building Scalable and Cost-Effective Clusters with Linux
TMSI
18
GlusterFS — glusterfsd.vol
• Several tuning parameters: read-ahead, write-behing, io-cache, io-threads, ...
• Basic glusterfsd.volvolume md0 type storage/posix option directory /glusterfsend-volume
volume brick type protocol/server option transport-type tcp option auth.addr.md0.allow 192.168.10.* subvolumes md0end-volume
Building Scalable and Cost-Effective Clusters with Linux
TMSI
19
GlusterFS — glusterfs.vol
• Flexible: ex. Distribute over stripes• Local system gets priority for write.....volume c80type protocol/clientoption transport-type tcpoption remote-host 192.168.10.80option remote-subvolume md0end-volume
volume datatype cluster/distributeoption min-free-disk 10%subvolumes c58 c59 c60 c61 c62 c63 c64 c65 c66 c67 c68 c69 c70 c71 c72 c73 c74 c75 c76 c77 c78 c79 c80end-volume
Building Scalable and Cost-Effective Clusters with Linux
TMSI
20
GlusterFS — Optimization
• Home-made hydrodynamics models: each MPI process reads/writes its own domain data (significant improvement)
• Try with performance/read-ahead, performance/write-behind and io-cache
• type cluster/stripe with big block-size may works better than md (cross stripe between bricks)volume distrcache type performance/io-cache option cache-size 512MB subvolumes dataend-volume
Building Scalable and Cost-Effective Clusters with Linux
TMSI
21
Sun Grid Engine
• Opensource, supported by SUN/Oracle• Easy to use, transparent to users program (qrsh)• Flexible: add/remove nodes on-the-fly, support
heterogeneous environment, multiple queues• Support multiple scheduling algorithms (ex. round
robin, fill up, unique, …)• Support basic policy (commercial version N1 Grid
Engine has more complex policies setting)– Resource quota (memory usage, number of processors, …)– Deadline jobs
• Nice GUI
Building Scalable and Cost-Effective Clusters with Linux
TMSI
22
Sun Grid Engine — Resource quota
• Don't trust users :) Quota must be in place to avoid problems
• A single process with excess memory usage will cause swapping and crash the compute node{ name slotlim description NONE enabled TRUE limit users * to slots = 100}{ name vmemfree description NONE enabled TRUE limit users * to virtual_free=4g}
Building Scalable and Cost-Effective Clusters with Linux
TMSI
23
Sun Grid Engine — Different allocation rules
• Instruct users to select best allocation rule depends on the nature of their programs and the cluster type
• Ex. fill up rule instead of default round robinpe_name mpi_fillupslots 200user_lists NONExuser_lists NONEstart_proc_args /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfilestop_proc_args /opt/sge/mpi/stopmpi.shallocation_rule $fill_upcontrol_slaves TRUEjob_is_first_task FALSEurgency_slots minaccounting_summary FALSE
Building Scalable and Cost-Effective Clusters with Linux
TMSI
24
MPICH + Compilers
• MPICH2– Most widely used– Well supported and actively developed (ANL)– Works with all models that we use– Lam-MPI (now OpenMPI) was also not bad, but ...
• Compilers– Intel C/Fortran is the best choice on Intel platform– Linux version was free for non-commercial use– Now only 30-days trial available for Intel Compiler Suite 11.1– Never use Intel compilers on AMD processor! *– GNU F90 does not always work (performance is another
issue)http://www.agner.org/optimize/blog/read.php?i=49
Building Scalable and Cost-Effective Clusters with Linux
TMSI
25
MPICH2 — Compilers
• Must stick with Intel compilers for now because:– Several models were developed with Intel specific functions
(ex. FVCOM)– Researchers don't want to change– The other good compiler that we know is also not
opensource and free: Pathscale– Provide best performance on Intel hardware (use Pathscale
for AMD)– Various numerical libraries are included/optimized (blas,
scalapack, lapack, …)
Building Scalable and Cost-Effective Clusters with Linux
TMSI
26
Zabbix
• Nagios probably the most popular• A lot of other tools: Cacti, OpenNMS, Cricket, ... and
even MRTG• But Zabbix
– Has all features (monitoring, graphing, trigger-based alerting, web monitoring, automatic action, …)
– Highly configurable: user defines what to monitor, what threshold to alert, and what alert level to assign
– Support distributed monitoring using proxies– Zabbix agent available for Linux, BSD, Windows, AIX, …– Access control per user, group– Easy to use with great GUI
Building Scalable and Cost-Effective Clusters with Linux
TMSI
30
Zabbix — Proxy & HA
• Proxy– Allow monitoring of nodes behind strict firewall/head node– Effectively reduce load on Zabbix server
• HA– No built-in support for HA– Recommended method: DRBD + Heartbeat
Building Scalable and Cost-Effective Clusters with Linux
TMSI
31
Backup/ZFS• Before ZFS …
– Used to buy NAS devices for storing backups– Costly: at least 15k$ for 16Tb storage
• ZFS is great– Reliable: checksum everything, COW– Storage pool: easy to manage, no more fdisk, gpart, resize– No FSCK (!)– Raid-Z, Raid-Z2, Raid-Z3– Deduplication– Transparent compression
• Other than Solaris/OpenSolaris– For Linux: ZFS only available via FUSE– FreeBSD guys are doing good job in porting ZFS
Building Scalable and Cost-Effective Clusters with Linux
TMSI
32
ZFS - POC
• Installation– Supermicro board, 1 x Quad-Core Xeon 5620, 8Gb RAM,
LSI SATA controllers, 12 x 1Tb SATA-II– Cost: 5.3k$– OpenSolaris build 134– Next to try: run OpenSolaris on 2 x USB sticks in rpool
• Rsync scripts from clusters• Savings
root@dory:/root/backup# zpool listNAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOTraid6 9.06T 5.82T 3.25T 64% 1.31x ONLINE -rpool 928G 7.47G 921G 0% 1.06x ONLINE -root@dory:/root/backup# zfs get compressratio raid6NAME PROPERTY VALUE SOURCEraid6 compressratio 1.24x -
Building Scalable and Cost-Effective Clusters with Linux
TMSI
33
Performance — IO tests *• Follow same test as in URL below• Results are not the best due to the storage is almost
full• Sequencial write block 1KB
– 71.1424 seconds, 14.4 MB/s• Sequencial write block 64KB
– 3.06428 seconds, 334 MB/s• Sequencial read block 1KB
– 1.58016 seconds, 648 MB/s• Sequencial read block 64KB
– 1.21792 seconds, 841 MB/s (?)* http://www.gluster.com/community/documentation/index.php/GlusterFS_2.0_I/O_Benchmark_Results
Building Scalable and Cost-Effective Clusters with Linux
TMSI
34
Performance — IO tests *• Use scripts as in URL below• 1MB/file, clear buffer cache when read (echo 3 >
/proc/sys/vm/drop_caches)• How many files can be created in 10 minutes
– 10 threads: ~302MB/sec– 20 threads: ~580MB/sec
• How many files can be readed in 10 minutes– 10 threads: ~422MB/sec– 20 threads: ~786MB/sec
• Didn't do much stress test to see if GlusterFS failed
* http://www.gluster.com/community/documentation/index.php/GlusterFS_2.0_I/O_Benchmark_Results
Building Scalable and Cost-Effective Clusters with Linux
TMSI
35
Performance — Hydrodynamics model• Paper «Modeling of extreme wave breaking», My Ha Dao, Hai
Hua Xu, Pavel Tkalich, and Eng Soon Chan. EGU2010, Vienna May 2010
• Test case: 800k steps, 3673200 particles• Each process reads/writes its own domain dataset
25 50 75 900
20
40
60
80
100
120
140
160
180
200174.2
86.462.8 55.7
Number of cores
Hou
rs
Building Scalable and Cost-Effective Clusters with Linux
TMSI
36
Performance — Hydrodynamics model• Compare with current IBM LS22 cluster (compute time in
hours) for 2 random cases
24 500
50
100
150
200
250
300
47.5
239.5
28
86.4
IBM LS22PCs cluster
Number of cores
Hou
rs
Building Scalable and Cost-Effective Clusters with Linux
TMSI
37
Common issues
• Hardware failure– Happens almost every month– Power supply block and mainboard– One time aircon system failed
• Add/remove brick nodes– Adding node: update glusterfs.vol and restart all glusterfs
mounts– Remove node (keep data): update glusterfs.vol, restart all
glusterfs mounts, copy local data from the node to glusterfs, disconnect the node
Building Scalable and Cost-Effective Clusters with Linux
TMSI
38
Common issues
• Non-root mpd ring– Initially we run common MPD under root– After crash/reboot, a node needs to rejoin MPD ring– MPD daemons also failed sometimes– Solution: put quick instruction in motd to ask each user to
run/manage his own MPD ring• SGE accounting
– N: number of jobs of a user, C: number of cores used, T: running time
– Calculate cluster usage for each user and the overall cluster utilization
∑i=0
N
Ci∗Ti
Building Scalable and Cost-Effective Clusters with Linux
TMSI
39
Common issues
• Data access for users using Samba• Time sync
– Important to have time synced (some models rely on this)– Setup ntp server on the head node– Configure ntp service on all compute nodes to sync with
head node• Update for compute nodes (if there is no NAT)
– Do update on head node– Prsync /var/cache/yum to all nodes– Yum update on nodes
Building Scalable and Cost-Effective Clusters with Linux
TMSI
40
Upcoming
• Systems Management v2.0?– Puppet/Cobbler/etc. are very hot now– For single-purpose cluster, probably we can still stick with
the old way– Rocks 5 now has most built-in features that we need– Going to test and replace xCAT
• Upgrade disks of the backup machine• Convert more nodes from Windows to Linux• Single Sign On
– We're going to implement ldap (AD) authentication for all Linux machines
– NUS Computer Center is using Centrify