Monitoring and performance measurement in Production Grid Environments David Wallom.

19
Monitoring and performance measurement in Production Grid Environments David Wallom
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of Monitoring and performance measurement in Production Grid Environments David Wallom.

Page 1: Monitoring and performance measurement in Production Grid Environments David Wallom.

Monitoring and performance measurement in Production

Grid EnvironmentsDavid Wallom

Page 2: Monitoring and performance measurement in Production Grid Environments David Wallom.

Overview

• Who uses monitoring?

• Aspects of performance measurement

• Tools for monitoring

• Adding a new service into a monitoring framework

Page 3: Monitoring and performance measurement in Production Grid Environments David Wallom.

Who are the consumers of monitoring?

• Grid/VO management– Responsible for designing & maintaining requirements

– Verify fulfillment of SLAs by resource providers

• System administrators– Notified of problems– Enough information to understand context of problem

• End users– View results and compare to problems they are having

– Debug user account/environment issues– Advanced users: feedback to Grid/VO

Page 4: Monitoring and performance measurement in Production Grid Environments David Wallom.

Monitoring from a user perspective

• Things that need to work for the Grid?– Can I login?– Is my application[s] available on connected systems?

– Can I get to my input data?– What credentials do I need?– Can I get the input data to the application?– How long will my application take to run?– …

Page 5: Monitoring and performance measurement in Production Grid Environments David Wallom.

Performance Measurement

• Depends on monitoring of;– Availability– Usage

Page 6: Monitoring and performance measurement in Production Grid Environments David Wallom.

Measuring Availability

• Test the following grid functionality– User authorization– System information publishing– Data transfer to and from system– Submission of tasks onto the system

• Measurement of other functionality– Type of system

Page 7: Monitoring and performance measurement in Production Grid Environments David Wallom.

Measuring Usage

• Within each system need to know;– Current load

• e.g. queue lengths, number of running processes on an SMP system

– Knowledge of network connectivity– Total throughput rate for a submitted user job

Page 8: Monitoring and performance measurement in Production Grid Environments David Wallom.

Tools for monitoring availability

• Systems status

• Grid status

• All system and grid status monitoring

Page 9: Monitoring and performance measurement in Production Grid Environments David Wallom.

Ganglia

• Developed out of HPC community,

• Will monitor worker as well as system head nodes,

• Can have sub nodes reporting to a master to create grid monitoring,

• Example:– http://oxgrid-vom.ierc.ox.ac.uk/ganglia/

Page 10: Monitoring and performance measurement in Production Grid Environments David Wallom.
Page 11: Monitoring and performance measurement in Production Grid Environments David Wallom.

Big Brother

• Designed to monitor individual systems,• Simple interface giving immediate feedback on

overall system status,• Different providers can be added for additional

services such as different process to be monitored etc.

• Can be difficult to look at historical trends though,• Example;

– http://cerb-mds.bris.ac.uk/bb/bb.html

Page 12: Monitoring and performance measurement in Production Grid Environments David Wallom.
Page 13: Monitoring and performance measurement in Production Grid Environments David Wallom.

Grid Interoperability Test Scripts

• Developed by Southampton e-Science Centre,

• Tests in series each of the standard grid functionalities for a specified node

• Wrapper to test in parallel many systems• Example of the results

– http://www.ngs.ac.uk/ops/gits/oxford/NationalGridService.html

Page 14: Monitoring and performance measurement in Production Grid Environments David Wallom.
Page 15: Monitoring and performance measurement in Production Grid Environments David Wallom.

INCA

• Developed by SDSC and TeraGrid• Extensible framework for monitoring• Tests the following as standard

– Static system information– Installed software versions– Network performance– Load both on head and queue system if available

• Additionally the UK NGS has developed a plug-in for the GITS tests.

• Example– http://inca.grid-support.ac.uk/

Page 16: Monitoring and performance measurement in Production Grid Environments David Wallom.
Page 17: Monitoring and performance measurement in Production Grid Environments David Wallom.

Testing the behaviour of a Grid

• Define a set of concrete requirements for connected systems

• Write tests to verify requirements • Periodically run tests and collect data across all of the system

• Publish data and archive for reporting

• Automate Steps 3 and 4 to provide real time system status information

Page 18: Monitoring and performance measurement in Production Grid Environments David Wallom.

Connecting to existing production systems

• Determine monitoring requirements for systems to be connected

• Write independent tests for service being provided.

• Write information providers to fit tests into existing monitoring frameworks

Page 19: Monitoring and performance measurement in Production Grid Environments David Wallom.

Conclusions

• Monitoring must be based on a well known set of requirements for admins (both VO and systems) & users

• There are several products available to provide monitoring frameworks, each can be extended beyond initial capabilities

• Life would be made a lot simpler if there was a standard monitoring schema which could then be used to plug-in grid and system information into all monitoring frameworks!