Nick LeRoy Computer Sciences Department University of Wisconsin-Madison [email protected] Hawkeye.

17
Nick LeRoy Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor/hawkeye Hawkeye

Transcript of Nick LeRoy Computer Sciences Department University of Wisconsin-Madison [email protected] Hawkeye.

Page 1: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

Nick LeRoyComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor/hawkeye

Hawkeye

Page 2: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

What is Hawkeye?

› A monitoring and management tool for distributed systems

› That's great, but... What does that mean? What can Hawkeye do for me?

Page 3: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

What is does that mean?

› Hawkeye is a tool that can be used to monitor various aspects of your computers

› Examples: System load monitoring Watching for run-away processes Monitoring the health of your Condor

pool

Page 4: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

What can Hawkeye do?

› Hawkeye can alert you when things go wrong. For example, Hawkeye can: Alert you when virtually any condition is

found Alert you when various Condor problems

are identified Allow you to specify your own custom

alerts

Page 5: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

Why Hawkeye?

› Make system administration easier› Make Condor pool maintenance

easier

Page 6: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

Hawkeye Monitoring Agent

Hawkeye Architecture

Hawkeye Module

Hawkeye Module

Hawkeye Monitoring Agent

CondorPool

Grid

Hawkeye Module

Hawkeye Manager

Page 7: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

Hawkeye Matchmaking

› Hawkeye alerts are done using ClassAd matchmaking.

MachineAd

TriggerAd

Match Alert

Page 8: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

Hawkeye ClassAds

› Hawkeye uses ClassAds to represent collected data Schema-free data representation Provides matching mechanism Represent whatever data you gather in

a way that works best for you

Page 9: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

Hawkeye ClassAds

› Example ClassAd “snippet”:RAM_MemFree = 841932800

RAM_MemShared = 0

RAM_MemTotal = 1055367168

RAM_SwapCached = 0

RAM_SwapFree = 2147483647

RAM_SwapTotal = 2147483647

Page 10: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

Hawkeye ClassAds

› Example ClassAd “snippet” #2:Condor_NumExecs = 2

Condor_NumMasters = 1

Condor_NumRunaway = 2

Condor_NumSchedds = 0

Condor_NumShadows = 0

Condor_NumStartds = 1

Condor_NumStarters = 2

Condor_RunawayPids = "3214,8753”

Page 11: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

Sample Alert Trigger

[

AlertTrigger = ( MyType == "Pool" && Absent.count > 5 );

AlertSeverity = ( Absent.count > 5 ) ? 1 : 0;

Name = "Absent Nodes";

AlertText = StrCat(Absent.count,

" machines are missing in ",

Name)

]

Page 12: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

Hawkeye at UW

› Currently at UW, we're using Hawkeye: To monitor our Condor cluster To aid in detecting and correcting

cluster problems To monitor the US/CMS testbed health

Page 13: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

Page 14: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

Page 15: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

Page 16: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

Customizing Hawkeye

› Hawkeye allows you to run your own custom “modules” to gather data.

› Hawkeye allows you in set your own custom “alerts”, on attributes generated by “standard” and “custom” modules.

Page 17: Nick LeRoy Computer Sciences Department University of Wisconsin-Madison nleroy@cs.wisc.edu  Hawkeye.

www.cs.wisc.edu/condor

What is the status of Hawkeye?

› Hawkeye 1.0 Release Candidate 1 (RC1)

› Current module library includes modules to monitor system load, users, disk space, Condor, and more

› Available from http://cs.wisc.edu/condor/hawkeye