Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.
-
Upload
cecil-rudolph-phillips -
Category
Documents
-
view
220 -
download
1
Transcript of Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT.
2002/10/21 White Box Farms: [email protected] 2
Contents
Scale Behind the
Scenes Hardware
Complexity Dynamics Practical Steps
Software Legacy Projects
2002/10/21 White Box Farms: [email protected] 3
Scale ~1000 boxes 140k Jobs/wk 2400 int user 50 parallel
reinstalls Parallel cmd
engines
350kSi2000 ~7/38 in top
500 clusters
2002/10/21 White Box Farms: [email protected] 4
Complexity
Hardware 12 hardware acquisitions 38 combinations of CPU/Mem/Disk
Software 4 versions of RedHat OS 37 clusters (indep. configurations)
User Communities 30 expts/user communities + Public 12,000 users
2002/10/21 White Box Farms: [email protected] 5
Dynamics
Hardware Drift e.g. missing after reboot:
CPUs, Memory, Disks Ethernet speed wrong
Volatile configurations e.g. passwd file every couple of hours
Hardware Failures Up to 4% of farm on holiday
Replacements generate new configurations
Monitoring
InventoryTracking
2002/10/21 White Box Farms: [email protected] 6
Vendor Call Analysis
0
5
10
15
20
25
30
35
40
45
disks dead motherb. memory video processor floppy power/fan tot. calls
reasons
Nu
mb
er o
f ca
lls
SIEMENS
ELONEX
TECH AS
SEIL
1 every2 days!
2002/10/21 White Box Farms: [email protected] 7
Acquisition Cycles
0
200
400
600
800
1000
1200
Jan-97 Jan-98 Jan-99 Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05
Nu
mb
er o
f M
ach
ines
SEIL - 1000
ELONEX - 800
TECH - 600
ELONEX - 600
SIEMENS - 550
ELONEX - 500
HP - 450
ELONEX - 450
ELONEX - 450
ELONEX - 300
COGESTRA - 266
COGESTRA - 200
Out of Warantee
2002/10/21 White Box Farms: [email protected] 8
Addressing the Challenge Interactive: Refresh from uniform batch
machines Batch: One large production facility
Shares (and priorities) Selectable resources Flexibility Redundancy to reduced sensitivity to
failures Remedy Hardware workflows But intractable
Scatter in job return times Assumed but undeclared job requirements
2002/10/21 White Box Farms: [email protected] 9
SW: Legacy from Maturity
OS
Applications
Mgmt Tools
KickStart
SUE
ASIS
BIS
/home/usr/cute/usr/local/var/opt
2002/10/21 White Box Farms: [email protected] 10
BIS DB
SW: Legacy from Maturity
OS
Applications
Mgmt Tools
KickStart
SUE
ASIS
BIS
Oracle
AFSAFSAFSAFS
Local
acrontabs
/home/usr/cute/usr/local/var/opt
crontabs
Multiple owners,methods, formats
Multiplelocations
2002/10/21 White Box Farms: [email protected] 11
A Clean Restart
NodeConfiguration
SystemMonitoring
System
InstallationSystem
Fault MgmtSystem
2002/10/21 White Box Farms: [email protected] 12
A Clean Restart: SnapShot
NodeConfiguration
SystemMonitoring
System
InstallationSystem
Fault MgmtSystem
HWSW
FunctionState
Software UpdateBase Installation
RPM
AP
I
PXEKickstart
2002/10/21 White Box Farms: [email protected] 13
State and Configuration Mgt
Clean Initial State Linux Standards Base, RPM
Externally Specified Configuration System, local cache
Versioned + Repository CVS
No inherent drift No external crontabs No unregistered application provider triggered
updates Update verification nodes + release cycle Procedures and Workflows Transactions
Notifications
2002/10/21 White Box Farms: [email protected] 14
Conclusions Maturity brings…
Degradation of initial state definition HW + SW
Accumulation of innocuous temporary procedures
Scale brings… Marginal activities become full time
Many hands on the systems
Combat with strong management automation