Télécharger

22
23 May 2001 LSCCW A.Manabe 1 System installation & updates A.Manabe (KEK)

description

 

Transcript of Télécharger

Page 1: Télécharger

23 May 2001 LSCCW A.Manabe 1

System installation &

updates

A.Manabe (KEK)

Page 2: Télécharger

23 May 2001 LSCCW A.Manabe 2

Installation & update

System(SW) installation & update is boring and hard work for me.

Question:How do you install or update system for Cluster of more than 100 nodes.

Question:Did you postpone a system upgrading, because the work is too much?

Page 3: Télécharger

23 May 2001 LSCCW A.Manabe 3

Installation & Update methods

1. Pre-installed, Pre-configured System• you can postpone your work, but soon or later ...

2. Manual installation; one PC by one PC.• many operators in parallel with many duplicated

installation CDs. it require many CRTs, days and cost (to hire operators)

3. Network Installation• with NFS/FTP server and Automated ‘batch’ installation. ‘Server too busy’ in installation to many nodes. A lot of works still remain (utility SW installation...).

Page 4: Télécharger

23 May 2001 LSCCW A.Manabe 4

Installation & update methods

4. Duplicate disk image• Attach many disks to one PC and dup. the installed

disk, then distribute duplicated disks to nodes. Hardware work is hard (attach/detach easy disk

unit).

5. Diskless PC• Using local disks only for swap and /var directory,

other dir. from NFS server. Powerful server is necessary. Node can do nothing alone (trouble shooting may

become difficult).

Page 5: Télécharger

23 May 2001 LSCCW A.Manabe 5

An Idea

Make one installed host, clone the disk image to nodes via network.

100PC installation in 10min. (objective value)

Necessary operator intervention as small as possible.

Page 6: Télécharger

23 May 2001 LSCCW A.Manabe 6

Our planning method (1)

Network Disk Cloning Software • dolly+ For cloning disk image.

Network Booting• PXE (Preboot Execution Environment) with Intel NICFor starting an Installer.

Batch Installer• Modified RedHat kickstartFor disk format, network setup and starting cloning sw.

make private /etc/fstab, /etc/sysconfig/network..

Page 7: Télécharger

23 May 2001 LSCCW A.Manabe 7

Our method (2)

Remote Power Controller • Network control power tap (Hardware) For remote system reset.

(replace ‘pushing reset button’ one by one)

Console server with a serial console feature of Linux.For watching everything done well.

Page 8: Télécharger

23 May 2001 LSCCW A.Manabe 8

Dolly+100PC installation in 10 min.

A software to copy/clone files or/anddisk images among many PCs through a network.

Running on Linux as a user program.Free Software

Dolly is developed by CoPs project in ETH. (Swiss)

Page 9: Télécharger

23 May 2001 LSCCW A.Manabe 9

Dolly+

Sequential file & Block file transfer.RING network connection topology.Pipeline mechanism.Fail recovery mechanism.

Page 10: Télécharger

23 May 2001 LSCCW A.Manabe 10

Config fileNeed only for Server host.

Server = host having original images or files

iofiles 3/data/image_hda1 > /dev/hda1/data/image_hda5 > /dev/hda5/dev/hda6 > /dev/hda6server dcpcf001clients 10n001n002 (listing of all nodes)endconfig

Page 11: Télécharger

Ring Topology

• Utilize max. performance ability of full duplex ports switches.

• Good for networks of complex of switches. (because connection is only needed between adjacent nodes)

S

Server = host having original image

Page 12: Télécharger

• Server bottle neck both in network and server itself.

Broadcast or Multicast• UDP• Difficulty in making

reliable transfer on multicast.

Sever bottle neck in One Server-many clients topology

Server = host having original image

S

Page 13: Télécharger

PIPELINING & multi threading

Next node

BOF

network

network

Server

Node 1

Node 2

EOF 1 2 3 4 5 6 7 8 9 …..

9

8 7

8 7

6

6

7

5

6

5

File chunk =4MB

3 thread in parallel

Page 14: Télécharger

23 May 2001 LSCCW A.Manabe 14

Performance (measured)

1Server - 1Nodes (Pent.III 500Mhz)• IDE disk/100BaseT network ~ 4MB/s• SCSI U2W/100BaseT network ~ 9MB/s• 4GB image copy >> 17min.(IDE), 8min.(SCSI)

1Server - 7Nodes• IDE/100BaseT• 4GB image copy -> 17min.(IDE) (+8sec.)

+Time for booting process.

Page 15: Télécharger

23 May 2001 LSCCW A.Manabe 15

Expected performance

1Server-100Nodes• IDE/100 ~ 19min.(+2min.Ovh)• SCSI/100 ~ 9min.(+1min.Ovh)

Page 16: Télécharger

0 200 400 600 800 10000

10

20

30

Number of hosts

Ela

psed

tim

e (m

in)

4GB disk image

8GB disk image

Time for cloning

4MB chunk size, 8MB/s transfer speed

How many min. to install to 1000 nodes?

+100%

+50%

Page 17: Télécharger

23 May 2001 LSCCW A.Manabe 17

S

Fail recovery mechanism

• In my experience, ~2% initial HW problem.

• Dolly+ provides automatic ‘short cut’ mechanism in node problem.• RING topology makes its

implementation easy. Short cutting

time out

Page 18: Télécharger

Server bottle neck could be overcome.

Week against a node failure. Failure will spread in cascade way as well and difficult to recover.

Cascade Topology

Page 19: Télécharger

23 May 2001 LSCCW A.Manabe 19

• Beta version will be available from

corvus.kek.jp/~manabe/pcf/dolly

after this work shop.

Page 20: Télécharger

23 May 2001 LSCCW A.Manabe 20

Page 21: Télécharger

5 10 50 100 5001000

10

100

1000

10000

number of hosts

tran

sfer

spe

ed (

MB

/s)

aggregate transfer speed

4GB disk image

8GB disk image

4MB chunk, 8MB/s each transfer speed

Page 22: Télécharger

PIPELINING & multi threading

Next node

BOF

network

network

Server

Node 1

Node 2

EOF 1 2 3 4 5 6 7 8 9 …..

9

8 7

8 7

6

6

7

5

6

5

File chunk =4MB