Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu Hawkeye.

Post on 18-Jan-2016

214 views 0 download

Transcript of Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu Hawkeye.

Nick LeRoyComputer Sciences DepartmentUniversity of Wisconsin-Madison

nleroy@cs.wisc.eduhttp://www.cs.wisc.edu/condor/hawkeye

Hawkeye

www.cs.wisc.edu/condor

What is Hawkeye?

› A monitoring and management tool for distributed systems

› That's great, but... What does that mean? What can Hawkeye do for me?

www.cs.wisc.edu/condor

What is does that mean?

› Hawkeye is a tool that can be used to monitor various aspects of your computers

› Examples: System load monitoring Watching for run-away processes Monitoring the health of your Condor

pool

www.cs.wisc.edu/condor

What can Hawkeye do?

› Hawkeye can alert you when things go wrong. For example, Hawkeye can: Alert you when virtually any condition is

found Alert you when various Condor problems

are identified Allow you to specify your own custom

alerts

www.cs.wisc.edu/condor

Why Hawkeye?

› Make system administration easier› Make Condor pool maintenance

easier

www.cs.wisc.edu/condor

Hawkeye Monitoring Agent

Hawkeye Architecture

Hawkeye Module

Hawkeye Module

Hawkeye Monitoring Agent

CondorPool

Grid

Hawkeye Module

Hawkeye Manager

www.cs.wisc.edu/condor

Hawkeye Matchmaking

› Hawkeye alerts are done using ClassAd matchmaking.

MachineAd

TriggerAd

Match Alert

www.cs.wisc.edu/condor

Hawkeye ClassAds

› Hawkeye uses ClassAds to represent collected data Schema-free data representation Provides matching mechanism Represent whatever data you gather in

a way that works best for you

www.cs.wisc.edu/condor

Hawkeye ClassAds

› Example ClassAd “snippet”:RAM_MemFree = 841932800

RAM_MemShared = 0

RAM_MemTotal = 1055367168

RAM_SwapCached = 0

RAM_SwapFree = 2147483647

RAM_SwapTotal = 2147483647

www.cs.wisc.edu/condor

Hawkeye ClassAds

› Example ClassAd “snippet” #2:Condor_NumExecs = 2

Condor_NumMasters = 1

Condor_NumRunaway = 2

Condor_NumSchedds = 0

Condor_NumShadows = 0

Condor_NumStartds = 1

Condor_NumStarters = 2

Condor_RunawayPids = "3214,8753”

www.cs.wisc.edu/condor

Sample Alert Trigger

[

AlertTrigger = ( MyType == "Pool" && Absent.count > 5 );

AlertSeverity = ( Absent.count > 5 ) ? 1 : 0;

Name = "Absent Nodes";

AlertText = StrCat(Absent.count,

" machines are missing in ",

Name)

]

www.cs.wisc.edu/condor

Hawkeye at UW

› Currently at UW, we're using Hawkeye: To monitor our Condor cluster To aid in detecting and correcting

cluster problems To monitor the US/CMS testbed health

www.cs.wisc.edu/condor

www.cs.wisc.edu/condor

www.cs.wisc.edu/condor

www.cs.wisc.edu/condor

Customizing Hawkeye

› Hawkeye allows you to run your own custom “modules” to gather data.

› Hawkeye allows you in set your own custom “alerts”, on attributes generated by “standard” and “custom” modules.

www.cs.wisc.edu/condor

What is the status of Hawkeye?

› Hawkeye 1.0 Release Candidate 1 (RC1)

› Current module library includes modules to monitor system load, users, disk space, Condor, and more

› Available from http://cs.wisc.edu/condor/hawkeye