Administration Tools for Managing Large Scale Linux Cluster CRC KEK Japan S.Kawabata, A.Manabe...
-
date post
19-Dec-2015 -
Category
Documents
-
view
229 -
download
0
Transcript of Administration Tools for Managing Large Scale Linux Cluster CRC KEK Japan S.Kawabata, A.Manabe...
Administration Tools for Managing Large Scale Linux Cluster
CRC KEK Japan S.Kawabata, A.Manabe
PC cluster 4 (Neutron simulation)
Fujitsu TS225 50 nodes Pentium III 1GHz x 2CPU 512MB memory 31GB disk 100BaseTX x 2 1U rack-mount model RS232C x2
Remote BIOS settingRemote reset/power-off
2002/6/26 ACAT2002 8
PC clusters
Already more than 400 (>800CPUs) nodes Linux PC clusters were installed. Only >middle size PC cluster are counted. A major exp. (Belle) group plan to install sever
al x100 nodes of blade server in this year. All PC clusters are managed by individual u
ser group themselves.
2002/6/26 ACAT2002 9
Center Machine (KEK CRC)
Currently machines in KEK Computer Center(CRC) are UNIX(solaris,AIX) servers.
Plan to have >1000 nodes Linux computing cluster in near future (~2004).
Will be installed under `~4years rental’ contract. (every 2 years HW update ?)
2002/6/26 ACAT2002 10
Center Machine
The system will be share among many user groups. (don’t dedicate to one gr. only) Their demand for CPU power vary with months.
(High demand before int’l-conference or so on) Of course, we use load-balancing Batch system. Big groups uses their own software frame work.
Their jobs only run under some restricted version of OS(Linux) /middle-ware/configuration.
2002/6/26 ACAT2002 11
R&D system
Frequent change of system configuration/ cpu partition.
To manage such size of PC cluster and such user request, we need to have some sophisticated admin. tools.
2002/6/26 ACAT2002 12
Necessary admin. tools
System (SW) Installation /update
Configuration
Status Monitoring/ System Health Check
Command Execution
2002/6/26 ACAT2002 14
Installation tool
Two types of `installation tool’ Disk Cloning
Application Package Installersystem(kernel) is an application in this
term.
2002/6/26 ACAT2002 15
Installation tool (cloning)
Install system/applicationon a `master host’.
Copy disk partition image to nodes
Image Cloning
2002/6/26 ACAT2002 16
Installation tool (package installer)
Package server
PackageInformation
DB
Packagearchive
Clients
request
Image and control
2002/6/26 ACAT2002 17
Remote Installation via NW
Cloning disk image SystemImager (VA) http://systemimager.sourceforge.net/
CATS-i (soongsil Univ.) CloneIt http://www.ferzkopp.net/Software/CloneIt/ Comercial: ImageCast, Ghost,…..
Packages/Applications installation Kickstart + rpm (RedHat) LUI (IBM) http://oss.software.ibm.com/developerworks/projects/lui
Lucie (TiTec) http://matsu-www.is.titech.ac.jp/~takamiya/lucie/
LCFGng, Arusha
Public Domin Software
2002/6/26 ACAT2002 18
Dolly+
We developed ‘image cloning via NW’ installer `dolly+’.
WHY ANOTHER? We install/update
maybe frequently (according to user needs)100~1000 nodes simultaneously.
Making packages for our own softwares is boring. Traditional Server/Client type software suffer server bot
tleneck. Multicast copy with ~GB image seems unstable.
(No free soft ? )
Server could be a daemon process.(you don‘t need to start it by hand)
Performance is not scalable against # of nodes.• Server bottle neck. Network congestionS
No server bottle neck.Get max performance of network which support multicasting
in switch fabrics.Nodes failure does not affect to all the process very much, it
could be robust. Since failed node need re-transfer. Speed is governed by the
slowest node as in RING topology.Not TCP but UDP, so application must take care of transfer
reliability.
(few) Server - (Many) Client model
Multicasting or Broadcasting
Dolly and Dolly+Dolly
A Linux application software to copy/clone files or/anddisk images among many PCs through a network.
Dolly is originally developed by CoPs project in ETH (Swiss) and an open software.
Dolly+ featuresSequential files (no limitation of over 2GB) and/or normal files (optinal:deco
mpress and untar on the fly) transfer/copy via TCP/IP network.Virtual RING network connection topology to cope with server b
ottleneck problem.Pipeline and multi-threading mechanism for speed-up.Fail recovery mechanism for robust operation.
Dolly: Virtual Ring Topology
• Physical network connection is as you like.
• Logically ‘Dolly’ makes a node ring chain which is specified by dolly’s config file and send data node by node bucket relay.
• Though transfer is only between its two adjacent nodes, it can utilize max. performance ability of switching network of full duplex ports.
• Good for network complex of many switches.
Master = host having original image
node PCnetwork hub switchphysical connection
Logical (virtual) connection
Server bottle neck could be overcome.
Cannot get maximum network performance but better than many client to only one serv. topology.
Week against a node failure. Failure will spread in cascade way as well and difficult to recover.
Cascade Topology
PIPELINING & multi threading
Next node
BOF
network
network
Server
Node 1
Node 2
EOF 1 2 3 4 5 6 7 8 9 …..
9
8 7
8 7
6
6
7
5
6
5
File chunk =4MB
3 thread in parallel
2002/6/26 ACAT2002 24
HW: FujitsuTS225 PenIII 1GHz x2, SCSI disk, 512MB mem, 100BaseT NW
Performance of dolly+
Less than 5min!for 100 nodes expected
1 5 10 50 100 5000
5
10
15
Number of hosts
Ela
psed
tim
e (m
in)
total 2GB disk image cloning
total 4GB disk image cloning
Elapsed time for cloning vs number of nodes
4MB chunk size, ~10MB/s transfer speed measured by TS225
0 100 2000
500
1000
1500
elapsed time (sec)
tran
sfer
ed b
ytes
(M
B)
10M
B/s lin
e
7MB/s line
PC: Hardware spec. (server & nodes) 1GHz PentiumIII x 2 IDE-ATA/100 disk 100BASE-TX net 256MB memory
Dolly+ transfer speed scalability with size of image
setup elapsed time speed
1server-1nodes 230sec 8.2MB/s 1server-2nodes 252sec 7.4MB/s x2 1server-7nodes 266sec 7.0MB/s x7 1server-10nodes 260sec 7.2MB/s x10
40 50 60 70300
400
500
600
2002/6/26 ACAT2002 26
S
Fail recovery mechanism
• Only one node failure could be “show stopper” in RING (=series connection) topology.
• Dolly+ provides automatic ‘short cut’ mechanism against a node trouble.• In a node trouble, the upper stream nod
e detect it by sending time out.• The upper stream node negotiate with t
he lower stream node for reconnection and retransfer of a file chunk.
• RING topology makes its implementation easy. Short cutting
time out
Re-transfer in short cutting
Next node
BOF
network
network
Server
Node 1
Node 2
EOF 1 2 3 4 5 6 7 8 9 …..
9
8 7
8 7
6
6
7
5
6
5
File chunk =4MB
Works with even Sequential file.
2002/6/26 ACAT2002 28
Dolly+: How do you start it on linux
Server side (which has the original file)
% dollyS [-v] -f config_file
Nodes side
% dollyC [-v]
iofiles 3/dev/hda1 > /tmp/dev/hda1/data/file.gz >> /data/fileboot.tar.Z >> /boot server n000.kek.jpfirstclient n001.kek.jplastclient n020.kek.jp client 20n001n002 :n020endconfig
Config file example # of files to Xfer master name
# of client nodes
clients names
end code
The left of ‘>’ is input file in the server. The right is output file in clients. '>' means dolly+ does not modify the image. '>>' indicate dolly+ should cook (decompress , untar ..) the file according to the name of the file.
2002/6/26 ACAT2002 29
How does dolly+ clone the system after booting.
Nodes broadcast over the LAN in search of an installation server (Pre-eXecution Environment).
PXE/DHCP server respond to nodes with information about the nodes IP and kernel download server.
The kernel and `ram disk image’ are Multicast TFTP’ed to the nodes and the kernel gets start.
The kernel hands off to an installation script which run a disk tool and ‘dolly+ ’.(scripts and appli. are in the ram disk image)
2002/6/26 ACAT2002 30
How does dolly+ start after rebooting.
The code partitions the hard drive, creates file systems and start `dolly+’ client on the node.
You start `dolly+’ master on the master host to start up a disk clone process.
The code then configure unique node information such as Host name, IP addess from DHCP information.
ready to boot from its hard drive for the first time.
2002/6/26 ACAT2002 31
PXE Trouble
BY THE WAYwe suffered sometimes PXE mtftp transfer failure in the case of >20 nodes booting simultaneously.
If you have same trouble, mail me please.
We start rewriting mtftp client code of RedHat Linux PXE server.
2002/6/26 ACAT2002 33
(Sub)system ConfigurationLinux (Unix) has a lot of configuration file to config
ure sub-systems. If you have 1000nodes, you have to manage (many)x1000 config. files.
To manage them, three types of solution Cetralized information service server
(like NIS).Need support by sub-system (nsswitch)
Automatic remote editing raw config. files (like cfengine).Must care about each node’s file separately.
2002/6/26 ACAT2002 34
Configuration--new proposal from CS.
Program (configure) whole system with a source code by O.O way. Systematic & uniform way configuration. Source reuse (inheritance) as much as possible.
Templateoverride to other-site’s configuration.
Arusha (http://ark.sourceforge.net)LCFGng (http://www.lcfg.org)
New Compile
Fetch new profile
Configuration files & control commands exec.
Ack Notify
LCFGng (Univ. Edinburgh)
2002/6/26 ACAT2002 36
LCFGng
Good things Author says that it works on ~1000 nodes. Fully automatic. (you just edit source code and
compile it in a host.) Differences of sub-systems are hidden from u
ser (administrator). (or move to `components (DB->actual config file)’)
2002/6/26 ACAT2002 37
LCFGng
Configuration Language is too primitive.Hostname.Component.Parameter Value
Components are not so manyor you must write your own components scripts for each sub-system by yourself. far easier writing config. file itself than writing com
ponent.Activating timing of the config. change could no
t be controlled.
2002/6/26 ACAT2002 39
Status Monitoring
System state monitoring CPU/memory/disk/network utilizationGanglia*1,plantir*2
(Sub-)system service sanity check Pikt*3/Pica*4/cfengine
*1 http://ganglia.sourceforge.net *2 http://www.netsonde.com*3 http://pikt.org *4 http://pica.sourceforge.net/wtf.html
2002/6/26 ACAT2002 40
Ganglia ( Univ. Calfornia)
Gmond (each node) All node `multicast’ each system status info. each
other and each node has current status of all nodes. -> good redundancy and robust
declare that it works on ~1000 nodesMeta-deamon (Web server)
stores volatile data of gmond in Round-robin DB and represent XML image of all nodes activity
Web Interface
2002/6/26 ACAT2002 42
Plantir (Network adaption )
Quick understanding of system status from One Web Page.
2002/6/26 ACAT2002 44
Remote execution
Administrator sometimes need to issue a command to all (part of ) nodes urgently.
Remote execution could be rsh/ssh/pikt/cfengine/SUT(mpich)* /gexec..
Points are To make it easy to know the execution result (fail or
success) at a glance. Parallel execution among nodes.
Otherwise If it takes 1sec. at each node, then 1000 sec for 1000 nodes.
*) Scalable Unix tools for cluster http://www-unix.mcs.anl.gov/sut/
2002/6/26 ACAT2002 45
WANI
WEB base remote command executer. Easy to select nodes concerned. Easy to specify script or to type-in command
lines to execute in nodes. Issue the commands to nodes in parallel.
Collect result with error/failure detection.Currently, the software is in prototyping
by combinations of existing protocol and tools. (Anyway it works!)
2002/6/26 ACAT2002 47
Command execution result
Host name
Results from 200nodesin 1 Page
Switch to another page
2002/6/26 ACAT2002 48
Error detection
BG color1
Exit code
2
“fail/error” word `grep –i`.
3
*sys_errlist[] (perror) list check.
4
`strings /bin/sh` output check
Flame color represents;White: initialYellow: command startsBlack: finished
Piktc_svc
Piktc
execution
error detector
Lpd
print_filter
WEB Browser
PIKT serverWebmin server
Node hosts
Command Result Result Pages
lprError marked Result
2002/6/26 ACAT2002 51
Summary
I reviewed admin. tools which can be used against ~1000 nodes Linux PC cluster.
Installation: dolly+ Install/Update/Switch hosts >100 nodes very quickly.
Configuration manager Not matured yet. But can expect a lot from DataGrid re
search.Status monitor
seems several good software already exists. Extra daemons and network traffic.
2002/6/26 ACAT2002 52
Summary
Remote Command Execution `Result at a glance’ is important for quick iteration. Parallel execution is important.
Some programs and links is /will be
http://corvus.kek.jp/~manabe
Thank you for your listening.
2002/6/26 ACAT2002 53
0 100 200
20
40
60
80
aggregate of modified file size (MB)
elap
sed
time
(sec
)
y=Σ an xn
a0=8.68524263e-01
a1=4.24465056e-01
2.04576224e+00
|r|=9.97385098e-01
Synchronizing time by rsync(dir = 4096, filesize ~20kB # of file=43680 total size=1.06GB)
total 1GB 43680files ~20kB/file total 2GB ~50kB/file