7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof...
-
Upload
anna-jackson -
Category
Documents
-
view
215 -
download
2
Transcript of 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof...
7/2/2003 Supervision & Monitoring section 1
Supervision & Monitoring
Organization and work planOlof Bärring
7/2/2003 Supervision & Monitoring section 2
Mandate• Develop and deploy a monitoring solution that
addresses LHC-era needs in areas such as data rates, data volumes and scalability and that provides appropriate information for users, administrators, operators and management both for individual component services and in logical service groupings.
• Develop and deploy an automated fault tolerance solution that is compatible with the deployed monitoring solution.
• Develop and maintain infrastructure for remote console access and system reset.
• Fulfil CERN’s commitments to the monitoring and fault tolerance tasks within EDG/WP4 + WP4 management & integration
7/2/2003 Supervision & Monitoring section 3
LCG-1 monitoring: criteria
• All measurement data in Oracle– for service and computer center managers
• powerful reporting tools• complex correlation queries
• Physics users must be given access to measurement data– API for query/subscription– web based query interface?
• Alarm display– for operators and service managers
7/2/2003 Supervision & Monitoring section 4
LCG-1 monitoring: client
• WP4 Monitoring Sensor Agent (MSA) deployed on all CPU, disk and tape servers.
• Sensors:– FioSensor.pl: exception metrics– LinuxSensorProc: performance metrics– Castor performance/exception metrics
would be desirable? e.g.• Tape queues length per device group• Tape pools (%free)• Drive status (physical and VDQM)
– Network switches performance metrics
7/2/2003 Supervision & Monitoring section 5
LCG-1 monitoring: Server(1)
• Measurement Repository, deploy:– WP4 MR server, TCP or UDP transport– PVSS, UDP transport
• Both needs to be evaluated w.r.t.– Performance in a large deployment– Operational & maintenance burden– Physics user interface requirements
• Evaluation period: 1 – 2 months
7/2/2003 Supervision & Monitoring section 6
LCG-1 monitoring: Server(2)
• Oracle DB– Use PVSS info server to regularly export to
Oracle– Use WP4 MR server with Oracle backend
from David Front (LCG/Israel)
• User interfaces– Service mgrs: Oracle tools– Users: WP4 repository API + web based
query interface– Operators: alarm display
7/2/2003 Supervision & Monitoring section 7
MSA
MSA
MSA
MSA
MSA
MSA
oracleMonServer
PVSS
PVSS Info
Server Export
W2K
Oracle DB
Oracle DB
API
API
Can this be given
to users?
Evaluation phase architecture
API
7/2/2003 Supervision & Monitoring section 8
Monitoring deployment: Issues
• WP4 alarm display: needs to be finalized and deployed
• Externalized repository API for PVSS: Andreu’s library requires PVSS client to be installed
• Continue to duplicate efforts for another 2 months knowing that ~half of the work will be thrown away afterwards
7/2/2003 Supervision & Monitoring section 9
LCG-1 monitoring: Scenarios
• Test both solutions in parallel ~2 months
• Document the evaluation and decide:– WP4 solution selected– PVSS solution selected– Would need both because requirement scope
too wide, e.g.• PVSS alarm display is best for the operators• WP4 implementation of the repository API is best for
the users
7/2/2003 Supervision & Monitoring section 10
Fault tolerance (FT): plans
• Model the escalation procedures: ~May– Tracing of recovery actions– Exception escalation hierarchies
• Evaluate WP4 FT framework: ~September– Adaptable to the modeled escalation procedure?– If not: survey other frameworks (e.g. Pete’s Oracle
based correlation engine)
• Adapt the LCG-1 monitoring to the FT recovery action tracing if necessary. ~October
• Deploy. ~November
7/2/2003 Supervision & Monitoring section 11
FT: modeling (~May)
• Model the escalation procedure
Exception raised
Global recoveries
Local recoveries
Alarm raised
Exception reset
Escalation?Trace of actions?
Try local repairs
Try global repairs
Try manual repairs
Problem fixed!
7/2/2003 Supervision & Monitoring section 12
FT: evaluation (~September)
• Evaluate WP4 FT framework– Does it scale to global
correlations?– Is the rule syntax rich enough?
• Check other frameworks– Pete’s Oracle based solution
7/2/2003 Supervision & Monitoring section 13
FT: deployment (~November)
• Make sure the framework works together with the LCG-1 monitoring– FT related metrics– Correlation engines need:
• API for data consumption (subscription/queries)• API for action tracing (feedback to monitoring)
• Deploy the system and ...– Develop correlation engine and exception
escalation hierarchies– Check that it works in production
7/2/2003 Supervision & Monitoring section 14
Timelines•Deploy WP4 server•Deploy PVSS
Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Run both systems in parallel
Evaluation report
Selection
Gather input from selected set of LHC users
Maintenance
Fault tolerance: model escalation and tracing
Evaluate WP4 FT framework
Adapt and deploy
7/2/2003 Supervision & Monitoring section 15
Other tasks
• Develop and maintain infrastructure for remote console access and system reset– Strategy, man-power ??
• WP4 management– WP4 manager– WP4 monitoring task leader
7/2/2003 Supervision & Monitoring section 16
Who does what?PVSS dev/depl
WP4 mon depl
WP4 mon dev/plan
WP4 mgr
WP4 integr
FT esc
Ft eval
Remoteconsol
Bill
Hugo
Jan
Juan
Karim
Maite
Olof
Sylvain