7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof...

16
7/2/2003 Supervision & Monitoring section 1 Supervision & Monitoring Organization and work plan Olof Bärring

Transcript of 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof...

Page 1: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 1

Supervision & Monitoring

Organization and work planOlof Bärring

Page 2: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 2

Mandate• Develop and deploy a monitoring solution that

addresses LHC-era needs in areas such as data rates, data volumes and scalability and that provides appropriate information for users, administrators, operators and management both for individual component services and in logical service groupings.

• Develop and deploy an automated fault tolerance solution that is compatible with the deployed monitoring solution.

• Develop and maintain infrastructure for remote console access and system reset.

• Fulfil CERN’s commitments to the monitoring and fault tolerance tasks within EDG/WP4 + WP4 management & integration

Page 3: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 3

LCG-1 monitoring: criteria

• All measurement data in Oracle– for service and computer center managers

• powerful reporting tools• complex correlation queries

• Physics users must be given access to measurement data– API for query/subscription– web based query interface?

• Alarm display– for operators and service managers

Page 4: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 4

LCG-1 monitoring: client

• WP4 Monitoring Sensor Agent (MSA) deployed on all CPU, disk and tape servers.

• Sensors:– FioSensor.pl: exception metrics– LinuxSensorProc: performance metrics– Castor performance/exception metrics

would be desirable? e.g.• Tape queues length per device group• Tape pools (%free)• Drive status (physical and VDQM)

– Network switches performance metrics

Page 5: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 5

LCG-1 monitoring: Server(1)

• Measurement Repository, deploy:– WP4 MR server, TCP or UDP transport– PVSS, UDP transport

• Both needs to be evaluated w.r.t.– Performance in a large deployment– Operational & maintenance burden– Physics user interface requirements

• Evaluation period: 1 – 2 months

Page 6: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 6

LCG-1 monitoring: Server(2)

• Oracle DB– Use PVSS info server to regularly export to

Oracle– Use WP4 MR server with Oracle backend

from David Front (LCG/Israel)

• User interfaces– Service mgrs: Oracle tools– Users: WP4 repository API + web based

query interface– Operators: alarm display

Page 7: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 7

MSA

MSA

MSA

MSA

MSA

MSA

oracleMonServer

PVSS

PVSS Info

Server Export

W2K

Oracle DB

Oracle DB

API

API

Can this be given

to users?

Evaluation phase architecture

API

Page 8: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 8

Monitoring deployment: Issues

• WP4 alarm display: needs to be finalized and deployed

• Externalized repository API for PVSS: Andreu’s library requires PVSS client to be installed

• Continue to duplicate efforts for another 2 months knowing that ~half of the work will be thrown away afterwards

Page 9: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 9

LCG-1 monitoring: Scenarios

• Test both solutions in parallel ~2 months

• Document the evaluation and decide:– WP4 solution selected– PVSS solution selected– Would need both because requirement scope

too wide, e.g.• PVSS alarm display is best for the operators• WP4 implementation of the repository API is best for

the users

Page 10: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 10

Fault tolerance (FT): plans

• Model the escalation procedures: ~May– Tracing of recovery actions– Exception escalation hierarchies

• Evaluate WP4 FT framework: ~September– Adaptable to the modeled escalation procedure?– If not: survey other frameworks (e.g. Pete’s Oracle

based correlation engine)

• Adapt the LCG-1 monitoring to the FT recovery action tracing if necessary. ~October

• Deploy. ~November

Page 11: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 11

FT: modeling (~May)

• Model the escalation procedure

Exception raised

Global recoveries

Local recoveries

Alarm raised

Exception reset

Escalation?Trace of actions?

Try local repairs

Try global repairs

Try manual repairs

Problem fixed!

Page 12: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 12

FT: evaluation (~September)

• Evaluate WP4 FT framework– Does it scale to global

correlations?– Is the rule syntax rich enough?

• Check other frameworks– Pete’s Oracle based solution

Page 13: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 13

FT: deployment (~November)

• Make sure the framework works together with the LCG-1 monitoring– FT related metrics– Correlation engines need:

• API for data consumption (subscription/queries)• API for action tracing (feedback to monitoring)

• Deploy the system and ...– Develop correlation engine and exception

escalation hierarchies– Check that it works in production

Page 14: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 14

Timelines•Deploy WP4 server•Deploy PVSS

Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Run both systems in parallel

Evaluation report

Selection

Gather input from selected set of LHC users

Maintenance

Fault tolerance: model escalation and tracing

Evaluate WP4 FT framework

Adapt and deploy

Page 15: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 15

Other tasks

• Develop and maintain infrastructure for remote console access and system reset– Strategy, man-power ??

• WP4 management– WP4 manager– WP4 monitoring task leader

Page 16: 7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

7/2/2003 Supervision & Monitoring section 16

Who does what?PVSS dev/depl

WP4 mon depl

WP4 mon dev/plan

WP4 mgr

WP4 integr

FT esc

Ft eval

Remoteconsol

Bill

Hugo

Jan

Juan

Karim

Maite

Olof

Sylvain