Administration Tools for Managing Large Scale Linux Cluster CRC KEK Japan S.Kawabata, A.Manabe...

53
Administration Tools for Managing Large Scale Lin ux Cluster CRC KEK Japan S.Kawabata, A.Manabe [email protected]
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    229
  • download

    0

Transcript of Administration Tools for Managing Large Scale Linux Cluster CRC KEK Japan S.Kawabata, A.Manabe...

Administration Tools for Managing Large Scale Linux Cluster

CRC KEK Japan S.Kawabata, A.Manabe

[email protected]

Linux PC Clusters in KEK

PC Cluster 1PenIII Xeon 500MHz 144 CPUs (36 nodes)

PC Cluster 2PenIII 800MHz80CPU (40 nodes)

PC Cluster 3 (Belle)Pentium III Xeon 700MHz 320CPU (80 nodes)

PC cluster 4 (Neutron simulation)

Fujitsu  TS225 50 nodes Pentium III 1GHz x 2CPU 512MB memory 31GB disk  100BaseTX x 2 1U rack-mount model RS232C x2

Remote BIOS settingRemote reset/power-off

PC Cluster 5(Belle)

1U serverPentium III 1.2GHz256 CPU(128 nodes)

PC Cluster 6 Blade server: LP Pentium III 700MHz

40CPU (40 nodes)

3U

2002/6/26 ACAT2002 8

PC clusters

Already more than 400 (>800CPUs) nodes Linux PC clusters were installed. Only >middle size PC cluster are counted. A major exp. (Belle) group plan to install sever

al x100 nodes of blade server in this year. All PC clusters are managed by individual u

ser group themselves.

2002/6/26 ACAT2002 9

Center Machine (KEK CRC)

Currently machines in KEK Computer Center(CRC) are UNIX(solaris,AIX) servers.

Plan to have >1000 nodes Linux computing cluster in near future (~2004).

Will be installed under `~4years rental’ contract. (every 2 years HW update ?)

2002/6/26 ACAT2002 10

Center Machine

The system will be share among many user groups. (don’t dedicate to one gr. only) Their demand for CPU power vary with months.

(High demand before int’l-conference or so on) Of course, we use load-balancing Batch system. Big groups uses their own software frame work.

Their jobs only run under some restricted version of OS(Linux) /middle-ware/configuration.

2002/6/26 ACAT2002 11

R&D system

Frequent change of system configuration/ cpu partition.

To manage such size of PC cluster and such user request, we need to have some sophisticated admin. tools.

2002/6/26 ACAT2002 12

Necessary admin. tools

System (SW) Installation /update

Configuration

Status Monitoring/ System Health Check

Command Execution

2002/6/26 ACAT2002 13

Installation tool

2002/6/26 ACAT2002 14

Installation tool

Two types of `installation tool’ Disk Cloning

Application Package Installersystem(kernel) is an application in this

term.

2002/6/26 ACAT2002 15

Installation tool (cloning)

Install system/applicationon a `master host’.

Copy disk partition image to nodes

Image Cloning

2002/6/26 ACAT2002 16

Installation tool (package installer)

Package server

PackageInformation

DB

Packagearchive

Clients

request

Image and control

2002/6/26 ACAT2002 17

Remote Installation via NW

Cloning disk image SystemImager (VA) http://systemimager.sourceforge.net/

CATS-i (soongsil Univ.) CloneIt http://www.ferzkopp.net/Software/CloneIt/ Comercial: ImageCast, Ghost,…..

Packages/Applications installation Kickstart + rpm (RedHat) LUI (IBM) http://oss.software.ibm.com/developerworks/projects/lui

Lucie (TiTec) http://matsu-www.is.titech.ac.jp/~takamiya/lucie/

LCFGng, Arusha

Public Domin Software

2002/6/26 ACAT2002 18

Dolly+

We developed ‘image cloning via NW’ installer `dolly+’.

WHY ANOTHER? We install/update

maybe frequently (according to user needs)100~1000 nodes simultaneously.

Making packages for our own softwares is boring. Traditional Server/Client type software suffer server bot

tleneck. Multicast copy with ~GB image seems unstable.

(No free soft ? )

Server could be a daemon process.(you don‘t need to start it by hand)

Performance is not scalable against # of nodes.• Server bottle neck. Network congestionS

No server bottle neck.Get max performance of network which support multicasting

in switch fabrics.Nodes failure does not affect to all the process very much, it

could be robust. Since failed node need re-transfer. Speed is governed by the

slowest node as in RING topology.Not TCP but UDP, so application must take care of transfer

reliability.

(few) Server - (Many) Client model

Multicasting or Broadcasting

Dolly and Dolly+Dolly

A Linux application software to copy/clone files or/anddisk images among many PCs through a network.

Dolly is originally developed by CoPs project in ETH (Swiss) and an open software.

Dolly+ featuresSequential files (no limitation of over 2GB) and/or normal files (optinal:deco

mpress and untar on the fly) transfer/copy via TCP/IP network.Virtual RING network connection topology to cope with server b

ottleneck problem.Pipeline and multi-threading mechanism for speed-up.Fail recovery mechanism for robust operation.

Dolly: Virtual Ring Topology

• Physical network connection is as you like.

• Logically ‘Dolly’ makes a node ring chain which is specified by dolly’s config file and send data node by node bucket relay.

• Though transfer is only between its two adjacent nodes, it can utilize max. performance ability of switching network of full duplex ports.

• Good for network complex of many switches.

Master = host having original image

node PCnetwork hub switchphysical connection

Logical (virtual) connection

Server bottle neck could be overcome.

Cannot get maximum network performance but better than many client to only one serv. topology.

Week against a node failure. Failure will spread in cascade way as well and difficult to recover.

Cascade Topology

PIPELINING & multi threading

Next node

BOF

network

network

Server

Node 1

Node 2

EOF 1 2 3 4 5 6 7 8 9 …..

9

8 7

8 7

6

6

7

5

6

5

File chunk =4MB

3 thread in parallel

2002/6/26 ACAT2002 24

HW: FujitsuTS225 PenIII 1GHz x2, SCSI disk, 512MB mem, 100BaseT NW

Performance of dolly+

Less than 5min!for 100 nodes expected

1 5 10 50 100 5000

5

10

15

Number of hosts

Ela

psed

tim

e (m

in)

total 2GB disk image cloning

total 4GB disk image cloning

Elapsed time for cloning vs number of nodes

4MB chunk size, ~10MB/s transfer speed measured by TS225

0 100 2000

500

1000

1500

elapsed time (sec)

tran

sfer

ed b

ytes

(M

B)

10M

B/s lin

e

7MB/s line

PC: Hardware spec. (server & nodes) 1GHz PentiumIII x 2 IDE-ATA/100 disk 100BASE-TX net 256MB memory

Dolly+ transfer speed scalability with size of image

setup elapsed time speed

1server-1nodes 230sec 8.2MB/s 1server-2nodes 252sec 7.4MB/s x2 1server-7nodes 266sec 7.0MB/s x7 1server-10nodes 260sec 7.2MB/s x10

40 50 60 70300

400

500

600

2002/6/26 ACAT2002 26

S

Fail recovery mechanism

• Only one node failure could be “show stopper” in RING (=series connection) topology.

• Dolly+ provides automatic ‘short cut’ mechanism against a node trouble.• In a node trouble, the upper stream nod

e detect it by sending time out.• The upper stream node negotiate with t

he lower stream node for reconnection and retransfer of a file chunk.

• RING topology makes its implementation easy. Short cutting

time out

Re-transfer in short cutting

Next node

BOF

network

network

Server

Node 1

Node 2

EOF 1 2 3 4 5 6 7 8 9 …..

9

8 7

8 7

6

6

7

5

6

5

File chunk =4MB

Works with even Sequential file.

2002/6/26 ACAT2002 28

Dolly+: How do you start it on linux

Server side (which has the original file)

% dollyS [-v] -f config_file

Nodes side

% dollyC [-v]

iofiles 3/dev/hda1 > /tmp/dev/hda1/data/file.gz >> /data/fileboot.tar.Z >> /boot server n000.kek.jpfirstclient n001.kek.jplastclient n020.kek.jp client 20n001n002 :n020endconfig

Config file example # of files to Xfer master name

# of client nodes

clients names

end code

The left of ‘>’ is input file in the server. The right is output file in clients. '>' means dolly+ does not modify the image. '>>' indicate dolly+ should cook (decompress , untar ..) the file according to the name of the file.

2002/6/26 ACAT2002 29

How does dolly+ clone the system after booting.

Nodes broadcast over the LAN in search of an installation server (Pre-eXecution Environment).

PXE/DHCP server respond to nodes with information about the nodes IP and kernel download server.

The kernel and `ram disk image’ are Multicast TFTP’ed to the nodes and the kernel gets start.

The kernel hands off to an installation script which run a disk tool and ‘dolly+ ’.(scripts and appli. are in the ram disk image)

2002/6/26 ACAT2002 30

How does dolly+ start after rebooting.

The code partitions the hard drive, creates file systems and start `dolly+’ client on the node.

You start `dolly+’ master on the master host to start up a disk clone process.

The code then configure unique node information such as Host name, IP addess from DHCP information.

ready to boot from its hard drive for the first time.

2002/6/26 ACAT2002 31

PXE Trouble

BY THE WAYwe suffered sometimes PXE mtftp transfer failure in the case of >20 nodes booting simultaneously.

If you have same trouble, mail me please.

We start rewriting mtftp client code of RedHat Linux PXE server.

2002/6/26 ACAT2002 32

Configuration

2002/6/26 ACAT2002 33

(Sub)system ConfigurationLinux (Unix) has a lot of configuration file to config

ure sub-systems. If you have 1000nodes, you have to manage (many)x1000 config. files.

To manage them, three types of solution Cetralized information service server

(like NIS).Need support by sub-system (nsswitch)

Automatic remote editing raw config. files (like cfengine).Must care about each node’s file separately.

2002/6/26 ACAT2002 34

Configuration--new proposal from CS.

Program (configure) whole system with a source code by O.O way. Systematic & uniform way configuration. Source reuse (inheritance) as much as possible.

Templateoverride to other-site’s configuration.

Arusha (http://ark.sourceforge.net)LCFGng (http://www.lcfg.org)

New Compile

Fetch new profile

Configuration files & control commands exec.

Ack Notify

LCFGng (Univ. Edinburgh)

2002/6/26 ACAT2002 36

LCFGng

Good things Author says that it works on ~1000 nodes. Fully automatic. (you just edit source code and

compile it in a host.) Differences of sub-systems are hidden from u

ser (administrator). (or move to `components (DB->actual config file)’)

2002/6/26 ACAT2002 37

LCFGng

Configuration Language is too primitive.Hostname.Component.Parameter Value

Components are not so manyor you must write your own components scripts for each sub-system by yourself. far easier writing config. file itself than writing com

ponent.Activating timing of the config. change could no

t be controlled.

2002/6/26 ACAT2002 38

Status monitoring

2002/6/26 ACAT2002 39

Status Monitoring

System state monitoring CPU/memory/disk/network utilizationGanglia*1,plantir*2

(Sub-)system service sanity check Pikt*3/Pica*4/cfengine

*1 http://ganglia.sourceforge.net *2 http://www.netsonde.com*3 http://pikt.org *4 http://pica.sourceforge.net/wtf.html

2002/6/26 ACAT2002 40

Ganglia ( Univ. Calfornia)

Gmond (each node) All node `multicast’ each system status info. each

other and each node has current status of all nodes. -> good redundancy and robust

declare that it works on ~1000 nodesMeta-deamon (Web server)

stores volatile data of gmond in Round-robin DB and represent XML image of all nodes activity

Web Interface

2002/6/26 ACAT2002 42

Plantir (Network adaption )

Quick understanding of system status from One Web Page.

2002/6/26 ACAT2002 43

Remote Execution

2002/6/26 ACAT2002 44

Remote execution

Administrator sometimes need to issue a command to all (part of ) nodes urgently.

Remote execution could be rsh/ssh/pikt/cfengine/SUT(mpich)* /gexec..

Points are To make it easy to know the execution result (fail or

success) at a glance. Parallel execution among nodes.

Otherwise If it takes 1sec. at each node, then 1000 sec for 1000 nodes.

*) Scalable Unix tools for cluster http://www-unix.mcs.anl.gov/sut/

2002/6/26 ACAT2002 45

WANI

WEB base remote command executer. Easy to select nodes concerned. Easy to specify script or to type-in command

lines to execute in nodes. Issue the commands to nodes in parallel.

Collect result with error/failure detection.Currently, the software is in prototyping

by combinations of existing protocol and tools. (Anyway it works!)

2002/6/26 ACAT2002 46

WANI is implemented on `Webmin’ GUI

Command input

Node selection

Start

2002/6/26 ACAT2002 47

Command execution result

Host name

Results from 200nodesin 1 Page

Switch to another page

2002/6/26 ACAT2002 48

Error detection

BG color1

Exit code

2

“fail/error” word `grep –i`.

3

*sys_errlist[] (perror) list check.

4

`strings /bin/sh` output check

Flame color represents;White: initialYellow: command startsBlack: finished

Click here

Stderr output

Click here

Stdout output

Piktc_svc

Piktc

execution

error detector

Lpd

print_filter

WEB Browser

PIKT serverWebmin server

Node hosts

Command Result Result Pages

lprError marked Result

2002/6/26 ACAT2002 51

Summary

I reviewed admin. tools which can be used against ~1000 nodes Linux PC cluster.

Installation: dolly+ Install/Update/Switch hosts >100 nodes very quickly.

Configuration manager Not matured yet. But can expect a lot from DataGrid re

search.Status monitor

seems several good software already exists. Extra daemons and network traffic.

2002/6/26 ACAT2002 52

Summary

Remote Command Execution `Result at a glance’ is important for quick iteration. Parallel execution is important.

Some programs and links is /will be

http://corvus.kek.jp/~manabe

Thank you for your listening.

2002/6/26 ACAT2002 53

0 100 200

20

40

60

80

aggregate of modified file size (MB)

elap

sed

time

(sec

)

y=Σ an xn

a0=8.68524263e-01

a1=4.24465056e-01

2.04576224e+00

|r|=9.97385098e-01

Synchronizing time by rsync(dir = 4096, filesize ~20kB # of file=43680 total size=1.06GB)

total 1GB 43680files ~20kB/file total 2GB ~50kB/file