Building Scalable and Cost-Effective Clusters with Linux · Building Scalable and Cost-Effective...

Building Scalable and Cost-Effective Clusters with Linux

Chinh NguyenTMSI, NUSAug 2010


TMSI

2

Contents

• Overview• Hardware

– Choice of CPU– Interconnection– Storage

• Software– CentOS– xCat/Rocks– GlusterFS– Sun Grid Engine– MPICH + compilers– Zabbix– Backup/ZFS


TMSI

3

Contents

• Performance data– IO tests– Hydrodynamics models

• Common issues– Hardware failure– GlusterFS– SGE/MPD– Samba share

• Upcoming– Puppet, cobbler, … or Rocks?– SSO– More ZFS


TMSI

4

About me

• Please call me «Chinh» ;) (millions of NGUYENs!)• Grew up in Hue, Vietnam, studied in Moscow State

University M.V. Lomonosov• Senior systems engineer at Tropical Marine Science

Institute, National University of Singapore• Doing different thing from time to time: programming,

systems integration, sysadmin, DBA, …• Started with FreeBSD in 2003• Converted to Linux since 2006 because of … a laptop


TMSI

5

About TMSI

• Research institute under NUS• Acoustic research lab

– Sensing and signal processing, underwater communications and networking, bioacoustics, …

• Marine Biology & Ecology Lab• Marine Mammal Research Lab• Physical Oceanography Research Lab

– Hydrodynamics, environmental processes, GIS, water resources, climate change


TMSI

6

Why build our own cluster?

• We can't run WRF/Acusim properly on our IBM blades LS21/LS22 and 6-years old Sun cluster

• NUS Computer Center does have some big clusters, but very long queue

• Ex. we need to simulate 100 years of climate for sub-SEA region within few months

• Remote cloud services do not work (EC2, POD, …)• Budget limitation: ~200k only• IHPC charges 70k/year for 64-cores Power5 (4 nodes)


TMSI

7

Overview

• July/2010: 112 PCs, Intel Core i7-940 & i7-965 (8Mb L3)• 51 permanent Linux nodes (the rest still run Windows

2008 HPC)• 15 sqm room at Civil Engineering Dept. (hot!)• Typical configuration of node:

– 4 cores at 2.93Ghz (HT disabled or not used)– 8-12 Gb DDR3-1066– 2 x 500GB or 3 x 500GB SATA-II– Myricom 10G-PCIE + 1Gb Ethernet (may not be connected)

• 1 head node (SGE master) + NAT, 50 compute nodes• 22 GlusterFS brick nodes (11Tb storage)• Deployed Apr-May 2009


TMSI

8

Hardware — Core i7

• Pros– Nehalem based, QPI 4.8/6.4 GT/s– Cost: ~600$, 30-40% cheaper than Xeon E55xx– Consumer mainboard, no ECC memory, PC chassis– Can buy off-the-shelf– Opteron Barcelona/Shanghai was too expensive

• Cons– HT does not work well– Very hot (130W)– AMD Opteron 4000 came out June/2010, <100$/CPU!


TMSI

9

Interconnection — 1 GbE


TMSI

10

Interconnection — 1 GbE


TMSI

11

Interconnection — Myricom 10G

• Cons– It's old, and not likely to advance further– Infiniband DDR/QDR dominates– 10G Ethernet with low latency becomes popular

• BUT compare with 1GbE– 10x bandwidth– At least 5x lower latency (~10µs with MPI)

• Compare with 10GbE/IB– < 1k$/node for 128 nodes setup– 10GbE/IB: 1.5k+/node– Latency comparable with 10GbE– Second hand even available on ebay!


TMSI

12

Storage

• Centralized NFS server was good, until we run WRF and similar models

• Storage bandwidth at 3-6 Gbp/s (IBM DS3xxx, ...)

• Caused very high IO load on head node (wa)

• 1GbE network overloaded


TMSI

13

Decentralized Storage

• Cluster file system: Lustre, GlusterFS, GFS2

Lustre GlusterFS GFS2*

Maintenance

Reliability

Performance

Scalability

* GFS2 not in same class with Lustre/GlusterFS


TMSI

14

Storage — GlusterFS

• Pros– Easy to use, low maintenance cost– No single point of failure (no metadata server)– Scale well (1000+ nodes)– Flexible (ex. distribute ontop of stripe)– Stable enough (used for more than 1 year)– Self-heal works– Supports Infiniband RDMA (but we don't have IB anyway)

• Cons– Completely in user-space (using FUSE)– Inconsistency problem does exist– IO tuning may not work persistently


TMSI

15

Software — CentOS

• Pros– Key point: a perfect clone of RHEL– De-facto supported by all scientific software running under

Linux– A lot of prebuilt packages, 3-rd party repositories

(RPMForge, EPEL, Remi, …)– We use CentOS for many years

• Cons– Convervative in updating packages in repo (RHEL policy)– Maintained by a small non-dedicated team (huge thanks to

them and their great commitment through out the years!)


TMSI

16

xCAT/Rocks

• xCAT– Recommended by IBM guys (for our LS2x clusters)– No IPMI on our PCs– Do network install for nodes with Ethernet connection– We need to install Myrinet-only nodes manually (bad!)– Common used commands: psh, pscp, node*

• Rocks– We're looking to use Rocks in the next cluster– Rocks was developed intensively for the last 3 years (NSF

grant?)– Support SGE and MPICH natively (Rocks 5.x)


TMSI

17

GlusterFS

• Installation– Install glusterfsd on each «brick» node– Create software raid-1 (md) from 2 x 500 Gb– Mount md drive and export using glusterfsd.vol– Create glusterfs.vol (type cluster/distribute) and distribute to

all nodes– Start glusterfsd and glusterfs everywherec57:/home 459G 71G 365G 17% /homec57:/cluster 450G 14G 413G 4% /clusterc57:/opt/sge 450G 14G 413G 4% /opt/sgeglusterfs#127.0.0.1 9.5T 8.0T 1.1T 89% /data


TMSI

18

GlusterFS — glusterfsd.vol

• Several tuning parameters: read-ahead, write-behing, io-cache, io-threads, ...

• Basic glusterfsd.volvolume md0 type storage/posix option directory /glusterfsend-volume

volume brick type protocol/server option transport-type tcp option auth.addr.md0.allow 192.168.10.* subvolumes md0end-volume


TMSI

19

GlusterFS — glusterfs.vol

• Flexible: ex. Distribute over stripes• Local system gets priority for write.....volume c80type protocol/clientoption transport-type tcpoption remote-host 192.168.10.80option remote-subvolume md0end-volume

volume datatype cluster/distributeoption min-free-disk 10%subvolumes c58 c59 c60 c61 c62 c63 c64 c65 c66 c67 c68 c69 c70 c71 c72 c73 c74 c75 c76 c77 c78 c79 c80end-volume


TMSI

20

GlusterFS — Optimization

• Home-made hydrodynamics models: each MPI process reads/writes its own domain data (significant improvement)

• Try with performance/read-ahead, performance/write-behind and io-cache

• type cluster/stripe with big block-size may works better than md (cross stripe between bricks)volume distrcache type performance/io-cache option cache-size 512MB subvolumes dataend-volume


TMSI

21

Sun Grid Engine

• Opensource, supported by SUN/Oracle• Easy to use, transparent to users program (qrsh)• Flexible: add/remove nodes on-the-fly, support

heterogeneous environment, multiple queues• Support multiple scheduling algorithms (ex. round

robin, fill up, unique, …)• Support basic policy (commercial version N1 Grid

Engine has more complex policies setting)– Resource quota (memory usage, number of processors, …)– Deadline jobs

• Nice GUI


TMSI

22

Sun Grid Engine — Resource quota

• Don't trust users :) Quota must be in place to avoid problems

• A single process with excess memory usage will cause swapping and crash the compute node{ name slotlim description NONE enabled TRUE limit users * to slots = 100}{ name vmemfree description NONE enabled TRUE limit users * to virtual_free=4g}


TMSI

23

Sun Grid Engine — Different allocation rules

• Instruct users to select best allocation rule depends on the nature of their programs and the cluster type

• Ex. fill up rule instead of default round robinpe_name mpi_fillupslots 200user_lists NONExuser_lists NONEstart_proc_args /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfilestop_proc_args /opt/sge/mpi/stopmpi.shallocation_rule $fill_upcontrol_slaves TRUEjob_is_first_task FALSEurgency_slots minaccounting_summary FALSE


TMSI

24

MPICH + Compilers

• MPICH2– Most widely used– Well supported and actively developed (ANL)– Works with all models that we use– Lam-MPI (now OpenMPI) was also not bad, but ...

• Compilers– Intel C/Fortran is the best choice on Intel platform– Linux version was free for non-commercial use– Now only 30-days trial available for Intel Compiler Suite 11.1– Never use Intel compilers on AMD processor! *– GNU F90 does not always work (performance is another

issue)http://www.agner.org/optimize/blog/read.php?i=49


TMSI

25

MPICH2 — Compilers

• Must stick with Intel compilers for now because:– Several models were developed with Intel specific functions

(ex. FVCOM)– Researchers don't want to change– The other good compiler that we know is also not

opensource and free: Pathscale– Provide best performance on Intel hardware (use Pathscale

for AMD)– Various numerical libraries are included/optimized (blas,

scalapack, lapack, …)


TMSI

26

Zabbix

• Nagios probably the most popular• A lot of other tools: Cacti, OpenNMS, Cricket, ... and

even MRTG• But Zabbix

– Has all features (monitoring, graphing, trigger-based alerting, web monitoring, automatic action, …)

– Highly configurable: user defines what to monitor, what threshold to alert, and what alert level to assign

– Support distributed monitoring using proxies– Zabbix agent available for Linux, BSD, Windows, AIX, …– Access control per user, group– Easy to use with great GUI


TMSI

27

Zabbix — Linux template


TMSI

28

Zabbix — Triggers


TMSI

29

Zabbix — Access control


TMSI

30

Zabbix — Proxy & HA

• Proxy– Allow monitoring of nodes behind strict firewall/head node– Effectively reduce load on Zabbix server

• HA– No built-in support for HA– Recommended method: DRBD + Heartbeat


TMSI

31

Backup/ZFS• Before ZFS …

– Used to buy NAS devices for storing backups– Costly: at least 15k$ for 16Tb storage

• ZFS is great– Reliable: checksum everything, COW– Storage pool: easy to manage, no more fdisk, gpart, resize– No FSCK (!)– Raid-Z, Raid-Z2, Raid-Z3– Deduplication– Transparent compression

• Other than Solaris/OpenSolaris– For Linux: ZFS only available via FUSE– FreeBSD guys are doing good job in porting ZFS


TMSI

32

ZFS - POC

• Installation– Supermicro board, 1 x Quad-Core Xeon 5620, 8Gb RAM,

LSI SATA controllers, 12 x 1Tb SATA-II– Cost: 5.3k$– OpenSolaris build 134– Next to try: run OpenSolaris on 2 x USB sticks in rpool

• Rsync scripts from clusters• Savings

root@dory:/root/backup# zpool listNAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOTraid6 9.06T 5.82T 3.25T 64% 1.31x ONLINE -rpool 928G 7.47G 921G 0% 1.06x ONLINE -root@dory:/root/backup# zfs get compressratio raid6NAME PROPERTY VALUE SOURCEraid6 compressratio 1.24x -


TMSI

33

Performance — IO tests *• Follow same test as in URL below• Results are not the best due to the storage is almost

full• Sequencial write block 1KB

– 71.1424 seconds, 14.4 MB/s• Sequencial write block 64KB

– 3.06428 seconds, 334 MB/s• Sequencial read block 1KB

– 1.58016 seconds, 648 MB/s• Sequencial read block 64KB

– 1.21792 seconds, 841 MB/s (?)* http://www.gluster.com/community/documentation/index.php/GlusterFS_2.0_I/O_Benchmark_Results


TMSI

34

Performance — IO tests *• Use scripts as in URL below• 1MB/file, clear buffer cache when read (echo 3 >

/proc/sys/vm/drop_caches)• How many files can be created in 10 minutes

– 10 threads: ~302MB/sec– 20 threads: ~580MB/sec

• How many files can be readed in 10 minutes– 10 threads: ~422MB/sec– 20 threads: ~786MB/sec

• Didn't do much stress test to see if GlusterFS failed

* http://www.gluster.com/community/documentation/index.php/GlusterFS_2.0_I/O_Benchmark_Results


TMSI

35

Performance — Hydrodynamics model• Paper «Modeling of extreme wave breaking», My Ha Dao, Hai

Hua Xu, Pavel Tkalich, and Eng Soon Chan. EGU2010, Vienna May 2010

• Test case: 800k steps, 3673200 particles• Each process reads/writes its own domain dataset

25 50 75 900

20

40

60

80

100

120

140

160

180

200174.2

86.462.8 55.7

Number of cores

Hou

rs


TMSI

36

Performance — Hydrodynamics model• Compare with current IBM LS22 cluster (compute time in

hours) for 2 random cases

24 500

50

100

150

200

250

300

47.5

239.5

28

86.4

IBM LS22PCs cluster

Number of cores

Hou

rs


TMSI

37

Common issues

• Hardware failure– Happens almost every month– Power supply block and mainboard– One time aircon system failed

• Add/remove brick nodes– Adding node: update glusterfs.vol and restart all glusterfs

mounts– Remove node (keep data): update glusterfs.vol, restart all

glusterfs mounts, copy local data from the node to glusterfs, disconnect the node


TMSI

38

Common issues

• Non-root mpd ring– Initially we run common MPD under root– After crash/reboot, a node needs to rejoin MPD ring– MPD daemons also failed sometimes– Solution: put quick instruction in motd to ask each user to

run/manage his own MPD ring• SGE accounting

– N: number of jobs of a user, C: number of cores used, T: running time

– Calculate cluster usage for each user and the overall cluster utilization

∑i=0

N

Ci∗Ti


TMSI

39

Common issues

• Data access for users using Samba• Time sync

– Important to have time synced (some models rely on this)– Setup ntp server on the head node– Configure ntp service on all compute nodes to sync with

head node• Update for compute nodes (if there is no NAT)

– Do update on head node– Prsync /var/cache/yum to all nodes– Yum update on nodes


TMSI

40

Upcoming

• Systems Management v2.0?– Puppet/Cobbler/etc. are very hot now– For single-purpose cluster, probably we can still stick with

the old way– Rocks 5 now has most built-in features that we need– Going to test and replace xCAT

• Upgrade disks of the backup machine• Convert more nodes from Windows to Linux• Single Sign On

– We're going to implement ldap (AD) authentication for all Linux machines

– NUS Computer Center is using Centrify


TMSI

41

Thank you & QA

Building Scalable and Cost-Effective Clusters with Linux · Building Scalable and Cost-Effective...

Documents

Transcript of Building Scalable and Cost-Effective Clusters with Linux · Building Scalable and Cost-Effective...