A proposal for improving Job Reliability Monitoring
description
Transcript of A proposal for improving Job Reliability Monitoring
![Page 1: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/1.jpg)
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
A proposal for improving Job Reliability Monitoring
GDB2nd April 2008
![Page 2: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/2.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Problem Statement
• We would like to be able to gather job state transitions from all jobs submitted in WLCG resources– EGEE
• RB + WMS submitted jobs• Jobs submitted directly to a CE (via condor_g)
– OSG• Jobs on a OSG CE for WLCG VOs
– Nordugrid• ARC CE
• Use this to calculate resource reliability– And use for debugging e.g. Dashboards
2
![Page 3: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/3.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Principles
• Only gather this information once– Propagate to interested parties
• Using existing systems and expertise where possible– Don’t try and deploy components on every
WMS/RB/L&B/CE/…– Get ‘cooked’ data from the systems
• Hook up with Pilot Jobs– Linkage between pilot and experiment jobs as a
‘state change’• ‘Job Wrapper tests’ fit in here too
– They’re just another state change3
![Page 4: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/4.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Current situation - Gridview
• Currently mines L&B log files, and sends them via R-GMA
• Loses many records• Hacks to ‘finish’ unfinished jobs after 24
hours– Inaccurate results
4
![Page 5: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/5.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Current situation - Dashboard
• Jobs reported via experiment frameworks– Gathers from many sources – Imperial College
XML files, job submission tools, monAlisa reporting from jobs, R-GMA
• But some missing information for condor_g jobs– info between submission and user job starting on
WN– Job aborted
• Some work done (Sergey Belov/Dubna) on reporting state changes inside condor_g
Presentation title - 5
![Page 6: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/6.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Proposal
• Use WLCG Monitoring infrastructure (MSG) for collecting and transporting the data– Messaging system– Standard message formats
• Work with expert groups to instrument the job submission systems
• Visualization by Gridview + Dashboards
6
![Page 7: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/7.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Effort
• Provide some effort to do the instrumentation– Coordination – WLCG Monitoring (D Rodrigues)– Messaging system integration - Gridview– EGEE WMS – L&B , Gridview– Condor-G – Condor team, Dashboard team– OSG CE – through OSG participation in the Joint
Monitoring Group (OSG Operations (Rob Quick), measurements & metrics (Brian Bockelman)
Presentation title - 7
![Page 8: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/8.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
EGEE
• L&B Notifications means we don’t have to run components mining L&B logfiles– Consumer of notifications can be remote
• L&B is stated to scale for our needs– Tested at >1m records/day– Testing of integrating with notifications underway
by gridview team• Message formats already defined
– Old log mining approach will all be moved to messaging system to free GridView from R-GMA dependency
8
![Page 9: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/9.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
condor_g
• condor_g submitter instrumented to create L&B messages– Done by a separate listener process that is
started by condor_g– Limited subset of condor_g state changes will be
sent• Listener/reporter can use different transport
for reporting– Currently monalisa as a transport layer– Will migrate to WLCG messaging system
9
![Page 10: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/10.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
EGEE Architecture
10
![Page 11: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/11.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
OSG
• Gratia is used to transport messages inside OSG
• A Gratia-MSG Bridge could be implemented– Similar to the RSV bridge used for OSG
availability• Plan to include discussion in the upcoming
EGEE-OSG-WLCG design meeting in Madison at the end of May– Hope to further collaborate with OSG on the
infrastructure for the analysis of the collected data, dashboards etc.
11
![Page 12: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/12.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Nordugrid
• Nordugrid– Currently only Nordugrid Job info is via ATLAS
production DB– How do we get information from the CE?
• Will look to implement a similar bridge if needed– Need to work through the technical details with
the experts– Discussion yet to start…
Presentation title - 12
![Page 13: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/13.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Pilot Jobs
• L&B client resides on every worker node• Can be used to submit additional messages
to L&B for a job– Timestamps +environment for Job Wrapper
start/end– Timestamp of handover to user job– Linkage of pilot job to experiment job ID– …
• Benefit is that it’s all in one coherent data structure for a given job
13
![Page 14: A proposal for improving Job Reliability Monitoring](https://reader036.fdocuments.net/reader036/viewer/2022062410/56815d36550346895dcb3578/html5/thumbnails/14.jpg)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Summary
• Propose a more coherent approach to mining job state changes– Uses expert knowledge where possible to ‘cook’
the data into a useful structure• Fits the principles of the WLCG monitoring
activity– Use common message system components– Split the work across the relevant teams
• What have we missed?– Your feedback essential…
14