Patrol Tuning Guide

PATROL Tuning Guide for

PATROL Administrators

by David Spuler and Geert De Peuter

Based on material from - Dave Bonnell - Geert De Peuter - Michael Sharpe - Garland Smith - David Spuler - David Stuart

INTRODUCTION ................................................................................................................................4

IT’S THE DEVELOPER’S FAULT… ...........................................................................................................4 KM PERFORMANCE IMPACT AREAS ......................................................................................................5

SOME THINGS WE NEED TO COVER FIRST ...............................................................................6

PATROL AGENT RUN QUEUE MANAGEMENT ......................................................................................6 Agent Main Run-queue and Internal Command Scheduling ..............................................................6 PSL Run-Queue................................................................................................................................7

PATROL AGENT TUNING VARIABLES ....................................................................................................8 Main Run Queue Tuning Variables (RUNQ_)...................................................................................9 Tuning the Main Run Queue...........................................................................................................10 Priority Variables...........................................................................................................................11 pslInstructionMax and pslInstructionPeriod ...................................................................................11 getProcsCycle ................................................................................................................................12 applCheckCycle .............................................................................................................................12

OTHER USEFUL COMMANDS TO SEE WHAT THE AGENT IS DOING ...........................................................13 %DUMP commands.......................................................................................................................13 Debugging option...........................................................................................................................14

PROCESS CACHE EXTRACTION PROCESS .............................................................................................15 PRE-DISCOVERY AND DISCOVERY.......................................................................................................16 BASIC AGENT RUN QUEUE MONITOR AND AUTOMATIC AGENT TUNING ...............................................17

MAKE SURE THE AGENT IS HEALTHY .....................................................................................18

MAKE SURE PATROL IS BEING STARTED PROPERLY ..............................................................................18 CHECK THE LICENSE KEY ....................................................................................................................19 SCAN THE ERROR LOGS ......................................................................................................................19 REMOVE/EXAMINE THE CONFIG FILES .................................................................................................20 CHECK KM SPECIFIC FILE ACCESS.......................................................................................................20 CHECK THE SYSTEM OUTPUT WINDOW.................................................................................................20 CHECK FOR UNDISCOVERED OR OFFLINE KM’S....................................................................................20 CHECK DEBUG SETTINGS....................................................................................................................21 CHECK FOR CUSTOMIZATIONS.............................................................................................................21 LARGE HISTORY FILE ..........................................................................................................................21 I GET THE MESSAGE “PATROLAGENT IS RUNNING LOW ON MEMORY”...................................................22

MEASURE AND FIND THE BOTTLENECKS...............................................................................23

TEST VS PRODUCTION GOTCHAS.........................................................................................................23 LOOK FOR COMMON CPU HOGS.........................................................................................................24

Printers (UNIX)..............................................................................................................................24 NFS (UNIX) or Windows File Sharing............................................................................................24 Active Processes KM......................................................................................................................25 Incompatible OS.............................................................................................................................25

PSL PERFORMANCE MEASUREMENT TECHNIQUES ..............................................................................26 PSL PROCESSES LIST VIA %PSLPS ....................................................................................................27 THE PATROL AGENT QUERY TOOL .....................................................................................................29

MEASURING OS COMMAND EXECUTION COST. ....................................................................................30 PSL PROFILER SUBSYSTEM.................................................................................................................31

PPV Tool........................................................................................................................................32 Profiling Report Formats ...............................................................................................................33 Profiler PSL Functions...................................................................................................................34

TOOLS FOR EASY PERFORMANCE MEASUREMENT .................................................................................35 Deployment KM .............................................................................................................................35

PSL Execution Counter Menu....................................................................................................................................................35 PSL Processes Watch Window Menu Commands.......................................................................................................................35 Profiler Console Menu Commands.............................................................................................................................................36

PSL Profiler KM ............................................................................................................................37 Purpose.....................................................................................................................................................................................37 Overview ..................................................................................................................................................................................37 OS Platforms.............................................................................................................................................................................37 AGENT_PROFILER.km - Main application class doing all the work..........................................................................................38 AGENT_PROF_KM.km ...........................................................................................................................................................40 AGENT_PSL_PROC.km application class.................................................................................................................................42 Limitations................................................................................................................................................................................42

SOLVING PERFORMANCE PROBLEMS .....................................................................................43

HISTORY STORAGE.............................................................................................................................43 FIXING HISTORY FILES AND HISTORY INDEX POLLUTION.......................................................................43 RECONFIGURE THE PATROL AGENT: ...................................................................................................44 EVENT STORAGE.................................................................................................................................45 CONFIGURATION GUIDELINES FOR DEPLOYING PATROL FOR OPTIMAL PERFORMANCE ......................46

Discovery Cycle time......................................................................................................................47 Changing discovery cycles.........................................................................................................................................................47 More details on discovery cycles:...............................................................................................................................................47 Targeting discovery cycle changes: ............................................................................................................................................47

Collection Parameter Polltimes......................................................................................................48 Targeting polltime changes ........................................................................................................................................................48

Application Disabling.....................................................................................................................49 Changing disabled KMs.............................................................................................................................................................49 Targeting applications ...............................................................................................................................................................49

Again here I would say, look at the profiler data. This valuable source of information can tell you exactly which KM is worst !............................................................................................................49 Collector Disabling ........................................................................................................................50

Targeting collectors...................................................................................................................................................................50

APPENDIXES ....................................................................................................................................51

APPENDIX A : PATROL ENVIRONMENT VARIABLE TABLE ................................................................51 APPENDIX B : PSL OPTIMIZER........................................................................................................52 APPENDIX C : PATROL ARCHITECTURE PERFORMANCE ASPECTS ..................................................54

PATROL 3.0 Performance Aspects .................................................................................................54 PATROL 3.1 Performance Aspects .................................................................................................56 PATROL 3.2 Performance Aspects .................................................................................................57

Introduction In general, administrators don’t really proactively tune their systems. One of the golden rules of system administration seems to be … if it works with an acceptable performance, then don’t touch it! However if you see that a single process occupies the processor of your machine i for 90% or more, you might say that the process consumes too much resources. As always, you have to put things in perspective and see what purpose the process serves. If I have a virus detection tool running on my NT box and it uses 70% CPU, I would say that it consumes too much. On the other hand, if I’m running a corporate database with 1000 users on my system, I would feel it’s OK (of course the latter example is a bit unreal). If the CPU consuming process happens to be PATROL agent, you might feel challenged to tune it down. The purpose of this paper is to guide you through understanding how an agent uses the system’s CPU. It also covers the steps to take to determine bottlenecks and solving performance problems, or at least being able to find the exact cause of the problem.

It’s the developer’s fault… You wouldn’t know how many people intuitively react that way! (Maybe because as long as there is someone else to blame, they will be pardoned). But people will expect from you, the administrator, to find the cause of the problem. It is not necessarily the developer who wrote bad code (of course, there is that slight possibility). There is a chance that the system you are running PATROL on, is a bit bigger than an average system… Anyway, before you can point at the developers, you will have to determine which KM is actually causing the problem, and what the reason for the performance problem is. It is known that people are particularly poor at identifying where the performance bottlenecks are. Therefore the first law of performance tuning should probable be:

Measure before you tune! Only by measurement of the true performance characteristics of an application can you see where to focus tuning efforts. The same idea of measurement can avoid you spending hours tweaking a discovery script that is only run once every 45 seconds, and leaving alone a parameter script that runs every 10 seconds for all 50 instances. The main thing that needs to be understood is that Patrol will use system resources as needed to perform its monitoring/administrative tasks. The more resources/applications/instances Patrol is, the more system resources will be required. Patrol attempts to run in such a way as to no adversely affect the system. It should be understood that the act of scaling Patrol back could be self-defeating in that, if Patrol is not allowed to use the cycles necessary to do its job, it will not be nearly as effective

So before we continue we should know what normal CPU usage is. This is very difficult to say (probably that’s why most system administrators don’t think of tuning, unless they feel something is wrong. As a rule of thumb, PATROL Agent should use leess than an average of 10% on most systems. This includes the UNIX.km (3%) and a DB instances (2%). Of course these numbers are just estimates.

KM Performance Impact Areas The performance of a KM can be measured in terms of many factors in the monitored system and the monitored application. All of these factors should be minimized to achieve an efficient KM in the full sense of the term. System performance is the first area of concern. There are two basic ways that a KM can use up system processes. The first is by the KM’s effect on the PATROL agent executable. The second is by the launching of other external child process executables by the KM. Some of the areas of performance impact of a KM on the system include: 1. CPU usage 2. Sub-Process launching 3. Memory 4. Disk usage 5. Network traffic 6. Other system resources — locks, inodes, etc. The second main area of impact is the effect of the monitoring actions on the database or application that is monitored. This is a highly KM-specific area that depends on how the application is being monitored. However, there are some common issues regarding application performance such as the cost and frequency of queries used for monitoring, and other pragmatic issues such as avoiding tying up a connection license. For now let us turn to the general issues of system impact minimization that are common to all KM designs.

Some things we need to cover first

PATROL Agent Run Queue Management The Patrol Agent sits in an infinite loop waiting for action. In fact the Agent has two internal queues :

Agent Main Run-queue and Internal Command Scheduling The central data structure for Patrol Agent scheduling of jobs is the main run-queue which has all executable commands, including discovery and pre-discovery procedures, parameters, process cache cycle, menu commands, etc. The main run-queue has jobs that are scheduled to run soon, not jobs that are currently running. The run-queue is ordered by the time that a command should run next. This time is only altered by actions such as "Update parameter"; except for this, the command will run at the time scheduled. After a command has executed, it is re-scheduled according to its scheduling policy. For a simple once-only command such as a menu command this re-scheduling does not occur. Periodic processes such as parameters, discovery procedures and the process cache are rescheduled at their next allotted time. Parameter scheduling is the most complicated issue because of the various scheduling options - immediately, day-of-week, day-of-month, etc.). Other periodic commands such as discovery or process cache have a simple polltime. In PATROL 3.0 these are configurable for the application via the "Custom Discovery Cycle" entry in the application dialog, and for the process cache by setting the PSL variable "/processCacheRefreshInterval"). Scheduling is also affected by the agent's internal tuning measures. The agent attempts to ensure that commands run at reasonable intervals, rather than all being executed at the same time, thereby causing a heavy load on the machine. The Agent does its best to schedule jobs to run as close as possible to their “ideal” execution time without overburdening the system with Patrol activity. Hence the scheduled time of a command, particularly a parameter, may be modified by these tuning measures, which may delay its execution until such time as the agent is not so loaded. The most common time for this modification due to heavy load is agent startup (this is whenever the agent is started on a host or when a new console connects requesting an application, which is not already monitored by the agent, which then has to be loaded and quickly started). A command that needs to run immediately, such as a menu command or info box item command, is placed at the front of the run-queue with immediate execution time. Refresh Parameters moves all parameters to the beginning of the Main Run Queue. It will not affect parameters that are currently executing.

PSL Run-Queue When a PSL process is scheduled from the main run-queue and executed, it is placed on the PSL run-queue, which is a queue of currently executing PSL processes. The agent runs multiple PSL processes through its interpreter, one at a time. Each executing PSL process is allotted a time-slice, which are 50 low-level compiled instructions. The PSL interpreter runs the PSL process until its time-slice expires, or the PSL process executes one of the built-in functions that would otherwise cause it to block. In PATROL 3.0 the built-in variable "/pslTimeslice" can be set to any number, allowing further agent tuning. Currently there is no concept of priority in the PSL run-queue for any PSL processes, so the interpreter gives all PSL processes equal share of execution. In PATROL 3.0 there is a wide variety of built-in functions which cause a PSL process to suspend — including locking, condition variables, response, etc. When a PSL process is suspended by one of these built-in functions, it does not appear on either the main run-queue or the PSL run-queue. Instead it waits until it is woken by an external event, before the agent will place it back on the PSL run-queue to continue after completing the particular built-in function. For example, when executing a PSL system call, the process is suspended by the call to system, until the agent detects that the external process has executed and completed. The agent will detect external events like file descriptor activity for a PSL process blocked on a read or write, and will then determine whether the given read/write operation has completed, and if the PSL process is ready to continue. For PSL processes in 3.0 that are blocked on functions such as locking or shared channels, the execution of another PSL process (e.g. holding the needed lock) can wake up another PSL process (i.e. the process waiting for the lock). Another example in 3.0 is the PSL response function, which is completed by receiving a message from the console when the user has entered all the data.

Patrol Agent Tuning Variables The Agent has a number of built-in tuning variables. Some of these variables set the priority at which various Agent activities will execute. Others define the criteria that the Agent uses for scheduling the various “jobs” that it is responsible for executing. For the most part, the default values work reasonably well to ensure a good balance between monitoring the system and affecting it. All Agent Tuning variables begin with /AgentSetup/AgentTuning and map to a macro that allows the value to be set using the %SET command.

Eg. /AgentSetup/AgentTuning/runqSchedPolicy is used to define the policy that the Agent will use to schedule jobs to be placed on the Main Run Queue. It maps to the macro, RUNQ_SCHED_POLICY, and can be set using the command: %SET RUNQ_SCHED_POLICY=<value>.

Other ways to set agent tuning variables are by going to the PATROLAGENT icon and using MB3 ⇒ Tune Agent. In order to obtain the current value of Agent Tuning variables, use %SET with no arguments in the Agent OS window. You can also use pconfig, wpconfig (Win32) or xpconfig (UNIX) to set these configuration variables. The default values for the agent tuning variables are RUNQ_SCHED_POLICY = 1 RUNQ_DELTA = 8 RUNQ_DELTA_INCREMENT = 2 RUNQ_MAX_DELTA = 40 GET_PROCS_CYCLE = 300 PROC_CACHE_SCHED_PRIORITY = 1 APPL_CHECK_CYCLE = 40 AGENT_PRIORITY = 10 USER_PRIORITY = 0 PROC_CACHE_PRIORITY = 10 PSL_INSTRUCTION_MAX = 500000 PSL_INSTRUCTION_PERIOD = 7200 We will now explain what they mean and how they can be used.

Main Run Queue Tuning Variables (RUNQ_) The runqSchedPolicy (RUNQ_SCHED_POLICY) variable defines the policy to be used for scheduling jobs to be placed on the Main Run Queue. Combining one or more of the following scheduling criteria by forming a sum of values defines the scheduling policy. The different scheduling policies are: RUNQ VALUE POLICY NAME STRATEGY 1 SCHED_FROM_END

SCHED_DEFAULT Next execution = finish time + poll time

2 SCHED_FROM_PREV next execution = last execution time + poll time 4 SCHED_OPTIMAL recalculate optimal scheduling upon completion of each

job. Expensive calculation that tries to find a slot that is furthest from its neighbor.

8 SCHED_FORCE_DELTA Force an interval of RUNQ_DELTA between executions. Associated Scheduling Variables: NAME MACRO DEFINITION RunqDelta RUNQ_DELTA represents the gap in seconds between jobs in the

Main Run Queue. It must be greater than 0. Default is 8.

RunqDeltaIncrement RUNQ_DELTA_INCREMENT represents the increment (in seconds) used for checking for a gap in the Main Run Queue. This variable affects the granularity of the search for the ideal slot in which to schedule a job in the Main Run Queue. The smaller the value, the finer the search for the perfect time to schedule a job and the greater the processing required. Default is 2. (Must be less or equal to RunqDelta)

RunqMaxDelta RUNQ_MAX_DELTA represents the maximum delay (in seconds) that the scheduling of a job will be delayed in order to honor RUNQ_DELTA. Can’t be less than 10. Default is 40.

Tuning the Main Run Queue A runqSchedPolicy of 1, schedules execution of a job based on the ending time of its previous execution. A value of 2, schedules execution based on the start time of the previous execution. By setting the scheduling policy to SCHED_FROM_END you will slow down the rate at which parameters really execute. A parameter's next execution is scheduled from the time at which it completed, not from the time at which it last executed (which is the default). This means that no matter what we will leave a one-interval breather between successive executions of a parameter. This is critical on really heavily loaded machines because if the parameter is taking longer than it's poll time to run then by default it would be executed again immediately after terminating. On a less heavily loaded machine it's still important because parameters can run for substantial lengths of time (say 10 seconds out of their 60 second interval) Values of 1 and 2 are mutually exclusive. If runqSchedPolicy is set to 1 or 2, the scheduling algorithm forces a spacing of RUNQ_DELTA seconds between execution times of jobs in the Main Run Queue. runqMaxDelta represents the maximum number of seconds that a job will be delayed from its “ideal” execution time. Adding in a value of 4 (SCHED_OPTIMUM) causes the Agent to use an algorithm that attempts to find the “optimal” execution time by scheduling execution at a time that is furthest from its neighbors. It instructs the Agent to recalculate the runq slot after each execution. The calculation is expensive. Adding in a value of 8 (SCHED_FORCE_DELTA) should be tried only after other measures such as changing poll times of various parameters have failed to reduce Agent load. Including RUNQ_FORCE_DELTA means that that consecutive jobs must be at least RUNQ_DELTA apart and RUNQ_MAX_DELTA does not apply. It literally means: Force an interval of RUNQ_DELTA between execution times. This will affect scheduling of all jobs in the Agent’s Main run Queue and can allow for an indefinite delay in scheduling. With all of the other scheduling policies the agent is allowed to schedule more than one parameter to start executing at the same time. With RUNQ_FORCE_DELTA the agent will not schedule more than one parameter for execution at the same time. Further more, it will not schedule the execution of another parameter until RUNQ_DELTA seconds from the start time of the last parameter. The impact of this is that if all parameters take less than 8 seconds (the default for RUNQ_DELTA) to run then this would mean that you would never see more than one parameter at a time being run by the agent.

Priority Variables NAME MACRO DEFINITION AgentPriority AGENT_PRIORITY Controls the priority of the PatrolAgent

process. Defaults to 10, which means that it executes at a priority that is 10 lower than other processes running on the system. In unix, the higher the priority number, the lower the priority. If the normal priority is 20, the Agent runs at a priority of 30 which is 10 lower than normal.

ProcCachePriority PROC_CACHE_PRIORITY Controls the priority of the process that refreshes the process cache.

ProcCacheSchedulePriority

PROC_CACHE_SCHEDULE_PRIORITY

Controls the priority of the process cache schedule. Defaults to 1.

UserPriority USER_PRIORITY Controls the priority of OS processes created by the Agent.

By default USER_PRIORITY is zero, which causes all commands from KMs to be executed at normal scheduling priority. Both the agent process itself and the 'ps' that it executes are executed at a priority of +10 which means that if a normal user has something to run then we'll take a back seat for a while and let them do there stuff first (up to a limit). You can try setting USER_PRIORITY to 10 also which means that all external commands executed by the agent will operate this way. This will give more time to users of the system (but the commands will take a little longer to run).

pslInstructionMax and pslInstructionPeriod • pslInstructionMax, macro : PSL_INSTRUCTION_MAX • pslInstructionPeriod, macro: PSL_INSTRUCTION_PERIOD They work together to define a rate that is used to determine whether a PSL script might be in an infinite loop. If a PSL process executes more than pslInstructionMax PSL instructions within pslInstructionPeriod seconds, the psl process incurs an internal scheduling delay of 200 milliseconds. If this happens again within this execution cycle, the process is delayed 400 milliseconds, and so on. After the delay, the job is put back on the PSL Run Queue. A value of 0 for pslInstructionMax turns off the delay. A value of 0 for pslInstructionPeriod makes pslInstructionMax the total PSL instructions a PSL process can execute without the delay. pslInstructionMax defaults to 500000. pslInstructionPeriod defaults to 7200. The scheduling delay is accompanied by a message: “PSL may be in infinite loop”.

getProcsCycle • getProcsCycle, macro : GET_PROCS_CYCLE The GET_PROCS_CYCLE setting controls how often the agent executes a 'ps' to refresh it's process cache (default is 300 seconds). If the machine has a very large number of processes (>= 5000) then you might consider increasing this value but otherwise it should be fine.

applCheckCycle • applCheckCycle, macro : APPL_CHECK_CYCLE The application discovery cycle APPL_CHECK_CYCLE (default is 40 seconds) is how often the agent runs the discovery scripts of applications that don't have a custom discovery cycle. If the agent is consuming a lot of CPU or is running very slowly then you can raise this variable all the way up to GET_PROCS_CYCLE. The larger this value, the longer it takes the agent to discover changes in the state of applications. A setting of 60 or 120 seconds is probably not unreasonable and will save considerable CPU time.

Other useful commands to see what the agent is doing

%DUMP commands Reports that might be valuable in determining Agent activity. These reports generate output in the Agent OS window. The output can be significant and might require that the text window size be increased (available in 3.2 only). Remember that increasing the size of text windows affects all text windows and should be reset back to the default when you’re done. Dump Commands COMMAND PURPOSE OF THE COMMAND %DUMP_ALL Dump of all structures. %DUMP_ALL_INSTS Application instances. %DUMP_APPS Application descriptions. %DUMP_ERRORS Backtrace of registered errors. %DUMP_GLOBALS Global variables (registered file descriptors, process cache, etc… ) %DUMP_KM_LIST KMs currently loaded, version, static (y/n), number of consoles currently

attached. %DUMP_PARAMS Parameter descriptions and instances %DUMP_RTLIST List of currently executing parameters, tasks, etc… %DUMP_RUNQ Queue of runnable processes %DUMP_TASKS List of executing tasks.

Debugging option Debugging options that can be set used from the command line or set dynamically using %SET. Remember to start the Agent/Console in such a way as to capture STDOUT (and STDERR). For korn shell: PatrolAgent -p <port> -debug <debug options> 1><filename> 2>&1 For c shell: PatrolAgent -p <port> -debug <debug options> >& <filename> DEBUG OPTION %SETVALUE PURPOSE PARAM 1 Useful for why a parameter is not running or why it is

suspended. APPLS 2 Application or application instances debugging. Useful for why

an application instance is running or not running. COMM 8 Equivalent to -c option. Show messages between console and

agent. IDENTITY 16 Useful for username/password debugging. RUNQ 32 Useful for scheduling problem or high cpu problem. PASSWD 64 Useful for username/password debugging. SELECT 128 Useful for high cpu problem. PROC 256 Useful for process cache problem. FDS 512 Useful for debugging I/O problems for subsequent process. MAIN 4096 Useful for startup problem (i.e. agent won’t start) SESSION 8192 Useful for connection problem. PSLA 16384 Useful for determining which psl process is running. Also useful

for high cpu problem. UDP 32768 Useful for connection problem. PCFG 65536 Useful for pconfig or xpconfig problem. MENU 131072 Useful for menu buttons problem on the console. EXEC 262144 Useful for determining what process is created by the agent. FTP 1048576 Useful for commit debug (probably not useful in 3.2). KM 4194304 Useful for determining which KM is being loaded. HISTORY 8388608 Useful for history corruption problem (3.1 only).

Process Cache Extraction Process One of the main underpinnings of the Patrol Agent is its internal extraction of the process cache. It does this using a periodic internal command which forks off some form of the UNIX "ps" command (or equivalent commands on other platforms), performs some manipulation of the output from "ps" and stores it in an internal table. This internal process cache table can be accessed by the PSL process() function, and is also used in simple discovery for process detection, as well as the PSL proc_exists() function. The use of "ps" ensures portability of Patrol Agent across multiple platforms, but to some extent, trades off some performance issues, since "ps" is not always the quickest way to get some forms of information. However, this method is still used in Patrol 3.0 Process cache loading can be affected using various enhancements, to work around problems we have had in the past supporting all platforms with different "ps" output formats. All of the different Patrol Agent executables we ship for various platforms have hardcoded format processing for the particular platform (i.e. "ps -ef" on some platforms, "ps -aux" on others, etc.), which solves most platform-dependent problems. There is also the PS_FIELDS text file, which can be used to override the offsets used. In addition, there is also the "pscorrect" executable, which can solve occasional problems, and the environment variable "PATROL_PS_COMMAND" which can override the "ps" command executed. The process cache internal process is a periodic agent process that runs every 300 seconds (default). However, this value can be changed to any required interval by setting the internal variable "/processCacheRefreshInterval”. The process cache cycle affects the issue of whether a process cache cycle is a "full" discovery cycle or a "partial" discovery cycle. The PSL boolean function "full_discovery" can be used in a PSL discovery script to determine if the current application is considered to be in "full" discovery cycle. A full discovery cycle is a discovery cycle that is the first discovery cycle for a given application since the process cache was refreshed (i.e. the first time a given application's discovery process has executed since the most recent update of the "ps" internal process cache data).

Pre-discovery and Discovery The concept of pre-discovery scripts applies to PATROL 2.0, but is largely obsolete in PATROL 3.0. In PATROL 2.0, a pre-discovery script executes to determine whether it is likely that the given application exists on an agent. This is basically an optimization to prevent the need for a 2.0 console to send all the knowledge (i.e. a large discovery script) to agents that may not need it. Since Patrol 3.0 has locally stored knowledge with each agent, the performance issue disappears since knowledge is not sent over the network by a 3.0 console (except when running a development console under some conditions). Once pre-discovery sets the "active" variable to 2, this starts discovery.

Basic agent Run Queue monitor and Automatic Agent Tuning The PATROLAGENT application includes parameters for monitoring the load and work rate of the Patrol Agent. These parameters can be viewed by drilling down into the PATROLAGENT application icon. PATROLAGENT monitoring parameters. PARAMETER NAME PARAMETER FUNCTION PAWorkRateExecsMin %{/execsPerMin}

Average number of external processes spawned by the Agent per minute. This value should be as low as possible.

PADeltaBetweenExecsSecs %{/timeBtnExecs}

Average time between job executions. This value should be as high as possible. Ideally, this would be greater than RUNQ_DELTA.

PAOutstandingJobs %{/executingProcs}

Number of processes started by the Patrol Agent that Are still executing. This includes OS processes as well as PSL commands.

PAWorkRateExecsMin has a built-in recovery action that sets RUNQ_DELTA to 300 seconds and RUNQ_SCHED_POLICY to 9 if it goes into alarm (by default when value reaches 25). This RUNQ_SCHED_POLICY will be reset of the alarm goes away. This is done in the collection command itself. Setting the policy to 9 has the effect of spreading out the workload by trying to reduce any peaks the PATROL Agents run queue has. Note this form of tuning isn’t related to or takes into consideration the cpu usage of the machine the PATROL Agent is running on. If the PATROL Agent has “n” number of jobs to execute in the run queue then it will still execute them. However it will try and schedule the work such that any peaks are reduced or “smoothed out”. Actually 9 is a schedule from start with a force of RUNQ_DELTA… In some cases, forcing the delta might have an undesirable effect (actually slowing down collection!). In the cases where you don’t want this type of rescheduling (like on development machines where peak CPU’s are considered normal), you might want to disable this recovery action. If you disable this recovery action, the agent will not reschedule and will not try to spread the load ! It is very important to run only one PATROLAGENT.km on your system. There is an application PATROL_NT.km. If both KM’s are active on your system, this might cause undesired results. For more information read paper: “Concurrency Issues In PATROL” by Geert De Peuter

Make sure the agent is healthy A healthy installation is always the best starting point. Most installations have something a little wrong. The littlest things can cause the most CPU consumption. A good example is a single file that does not have the proper read/write permissions, and the Patrol Agent continuously tries to write to that file. Hopefully, these steps will help create a healthy environment.

Make sure Patrol is being started properly If Patrol is configured to start up automatically from the /etc/inittab file, there may be a problem. UNIX executes all programs from the inittab file as "root". "root" does not have environment variables which are appropriate for Patrol. It also may change the ownership of some files of Patrol, so that when it is executed as another user, Patrol will not be able to write to those files. If you want to continue starting Patrol from the inittab file, You'll need to change some things. Some of the things to change are:

a] Make sure the $PATROL_HOME variable is set to the appropriate location b] Change the PATROL_ADMIN environment variable to equal the install account for

Patrol.... for example : export PATROL_ADMIN=patrol; This will tell the Patrol Agent to write all of its files as patrol.

c] If you have the agent started through inittab on UNIX, remove the "respawn" option from the

entry in the inittab to start the Patrol Agent. In case something goes wrong with the agent (typically a demo license that expires), it will not respawn the whole time.

d] If you are able to execute the provided PatrolAgent startup script, do so. This will ensure the

correct setting of all the necessary environment variables. e] If available, switch your script to the rc.d method of startup. This method provides a much

better control of when and how to start the Patrol Agent. It also allows for a stop script.

If you are starting the Patrol Agent manually, you should be using the provided PatrolAgent startup scripts. This script will set all the environment variables and start the Patrol Agent process. If you use your own scripts, please pay careful attention to the environment variables that are set before the PatrolAgent process is executed. Always stop the Patrol Agent in a nice way Never kill the Patrol Agent using the "-9" option. The "-9" option is an abort option for the kill command that does not allow the Patrol Agent to close its history files. If you have used the "-9" option, there is a good chance that you may need to clean out the history files. For documentation it is often wise to recommend the end user to use the "-15" option. This option is the default, but it makes it very clear that this is the option to be used. (i.e. kill -15 30293 (Patrol Agent)). Patrol also provides a way in

which non-root users can shutdown the agent safely. The pconfig utility, located in the $PATROL_HOME/bin directory has an option that allows for the safe termination of the Patrol Agent. The syntax is:

pconfig +KILL. Of course, your environment variables must be setup properly. Everyone should use this method, even in scripts. On NT, you might be tempted to use one of the kill commands as well. It is better to use the stop service from the services control panel. If you would like to use a command line method of stopping a PATROL agent on NT, the “sc” tool from the NT resource kit can do the job.

Check the license key An expired license key will cause the Patrol Agent not to start. Sometimes the new license key that is received from BMC Software is not applied to all the Patrol Agents. Many things can occur if the new key is not applied to all the Patrol Agents. They range from “connection denied due to license” to “License Expired” and the Patrol Agent will not start. The first message happens when you are using the demo license and you try to connect with more than one console. The Patrol Agent will refuse the connection. The license file is located in the $PATROL_HOME/lib directory. It should have the name “license.`hostname`”. If it does not exist, you are using the demo license key. For convenience the agent will also look in the file “license” without the “.`hostname`” extension. This is an easy method for sending the same license key to all systems using a software distribution package.

Scan the Error Logs Often, a Patrol Agent having problems will log its problem into an error log. The default error log location for the Patrol Agent is in $PATROL_HOME/log. The file name will be PatrolAgent-`hostname`-`portnum`.errs. You should scan this file for obvious errors. If you do not see any proceed on; we will look at it again later. Make sure all the current patches are applied Patches make a difference. Patrol has patches for the Patrol Agent, Patrol Console, and for all the Patrol Knowledge Modules. It is a must to apply all the current patches before proceeding. There is no since in finding a problem that has been discovered before. No awards will be given. It is best to start at the current patch level and move on. People are very curious as to “Why is it doing that?” and will try to figure it out. In a production environment, it is often better to apply patches for the elimination of the known and move on. You might get a reward for your hastiness. Each patch has a synopsis of the problems which are fixed. Your problem could be a direct problem or an effect of a direct problem. For example: The UNIX KM has a fix in the 3.1.9a7 patch that includes an improved “ps” collector for some platforms. You will not find a direct bug in the ps collector, but it will consume much less CPU on the affected platforms after the patch. Make sure file permissions are correct (UNIX) It is definitely possible that permissions will get changed. Applying patches can cause permissions to change. I have even seen backup scripts change the permissions on files. If the permissions or ownership of the Patrol Agent files change so that Patrol cannot read or write to the files, then there will be a problem. The best thing to do is not to even check the permissions. Just change them all to the proper values. Do this by logging in as root. Change directory to the Patrol installation directory (note: This should be one level back from $PATROL_HOME). Perform a “chown [-r|-R] patrol *”. Then

execute “chmod 755 [-r|-R] *”. This will put all the permission to the proper starting value and change the ownership to the userid “patrol”. If you are using a different userid, change the chown statement to fit your needs. After these two commands have been run, execute the “configure” script that was supplied by BMC Software. It should be in the Patrol installation directory.

Remove/Examine the config files The config files can create a problem, too. The config files sometime contain old variables or variables set to the wrong value that causes high CPU in the agent. If you do not have single control over the configuration variables in the agent, this is also a prime target. For example, if someone configured the agent to load all the NT KMs on a UNIX box, then those KMs will cause the agent to consume memory. On a low memory box, the Patrol Agent could possibly be swapped in and out of memory constantly. This would cause high CPU. Another example is that the Patrol Agent has been configured to monitor an RDBMS, which has since been removed. Patrol will constantly schedule a discovery script that looks for these applications. Starting with a fresh config file can sometimes be desirable. If you want to see what is configured, run the command pconfig +get -port `portnum`. This will tell you all the customizations that are different than the default settings. You can save this output to a file; edit the file; and then reapply the new configuration back to the agent using the command pconfig +set -port `portnum` `filename`. You must remove the old configuration database before reapplying the new configuration. Sometimes the config files get corrupted for the same reason the history files do. It is best to remove the config files and start over in this case. (rm -rf $PATROL_HOME/config). The Patrol Agent will create new ones. The reconfiguration of the Patrol Agent will happen at a later step, so please just save your configuration files at this time.

Check KM specific file access A knowledge module has been loaded that is trying to read/write to a file that it does not have access to. The Informix KM is notorious for this type of behavior when the Informix error log file is not readable by the user/password that has been placed in the User Name/Password area for the KM. Other KMs that do file operations may exhibit the same behavior. In most cases this will effect in parts of a KM remaining OFFLINE or not being discovered. Also most KM’s issue messages on the system output window.

Check the system output window If a KM finds a serious error, it normally writes the error to the system output window. Furthermore, if nothing is wrong with your installation, the system output window will only show the KM loading lines and maybe initialization messages. Even these messages should be read because these could indicate something is wrong.

Check for undiscovered or offline KM’s KM’s that are loaded on your system, and for which you have the application installed, should show up and collect data. If they don’t look as shown in the manual, there is probably something wrong. This can cause extra CPU on your system. It is therefore advised to compare the KM hierarchy as you see it in your console with the hierarchy or explanation from the manual.

Check Debug Settings Ensure the PatrolAgent process is not running with the -debug option enabled, as the I/O from debug output can be costly. Ensure PslDebug PSL runtime tracing is not enabled, as this can cause large volumes of output sent over the network to the console, and waste CPU, if this is not used properly.

Check for customizations Sometimes, during tests, the out-of-the-box KM’s get changed for functionality testing. These tests are sometimes exhaustive and certainly not meant for production. You should check the customizations you have made to the out-of-the-box KMs. Certainly check for modified discovery cycles or parameter cycles that cause parameters or discovery to run more often than out-of-box.

Large history file If you have run PATROL for ages and tried multiple KM’s on a certain agent, you can have the problem of history pollution. For each parameter, the agent have a record in an index file and a parameter data file. When the agent keeps using the same instance sid’s (this is done by almost all KM’s), history will be cleaned by the history retention period setting. However KM’s like active processes, use the process name as instance SID. That means that for every process that ever existed on a system, there will be history records kept. To find out how much index entries you have in your index files, type the following command dump_hist –s 0000000 –e 00000000 (The eight zero’s are very important!) If you see a lot of ancient instances in the index, you should run fix_hist. This is explained later in the paper.

I get the message “PatrolAgent is running low on memory” The amount of resources such as CPU and MEMORY used by Patrol is is directly related to applications that are loaded, instances that are being monitored, and active parameters defined for the application. Resource utilization will also be affected by parameters for which history is being collected, history retention period, how much history is retained in memory before being flushed to disk (/AgentSetup/prmHistCacheSize), and how often the history is flushed to disk (/AgentSetup/prmHistCacheFlushTimer). The specific command that is run to check memory for Unix is "malloc". The Patrol agent reserves a portion of memory (256K at this time) when the agent is started. The "PatrolAgent is running low on memory" error message is displayed when malloc fails and starts using the reserved memory. This is an indication that the system is running out of memory resources. Possible causes include:

- Kernel memory parameters need tuning. Specific memory parameters will vary depending on the OS and version.

- Lack of physical memory. - Not enough swap space. - There is a "memory leak" with the Patrol agent or some other process. If this is the problem, the

"ps" command can be used to identify resources that continue to increase the amount of memory used.

- One reason agent memory usage may be high is that there are many KMs loaded or single KMs that have many instances (and therefore also the parameters for each instance).

Measure and find the bottlenecks First of all we need to define how much we want to tune the agent, and what should be considered a “normal” CPU usage. The target CPU level for a normally operating Patrol Agent is an average of 5-6% with one OS KM and one DBMS KM. Application KMs tend to take a little more CPU. In any case, the more KMs loaded on the machine, the more computer resources that will be consumed. When measuring the CPU that the Patrol Agent is consuming, it is always best to take in account that instantaneous CPU and average CPU can be extremely different. The Patrol Agent will sometimes execute a command that takes a large amount of CPU for a short time. There is no way around it. The collection must be done to make sure that the machine is healthy. Even the most efficient collection routine written can take a large amount of CPU. It’s just not wasting any of the CPU that is reflected during the collection. Also, make sure that you know how to interpret the CPU data. For example, the most common interpretation error occurs when a “ps” statement that is executed on a multiprocessor box. The ps command will collect the true number of CPU seconds that the process takes, but the %CPU will be calculated against on one CPU. For example, if you have a 6 CPU system, you have 360 CPU seconds available for each minute. So a process that takes 60 seconds every minute will consume 100% CPU seconds on a single CPU, but will actually consume only 16,7% of the available CPU of the machine.

Test vs Production Gotchas It is important to note that although the cost of retrieving the data from poll time to poll time is pretty much the same on any system. This is not the case for the same parameters on different sized systems. A good example of this would be with the ORACLE7.KM where the space parameters can be expensive due to the work that needs to be done by the RDBMS in order to collect the results. With a small databases the resource requirements can be minimal. However for very large databases the resource requirements can be noticeable.

Look for Common CPU Hogs If the agent is consuming a lot of CPU, then a probable cause is an errant Knowledge Module. Knowledge Modules tell the Patrol Agent what to do. If the Patrol Agent cannot perform the task, or the task that it is asked to do takes a lot of CPU, then the Patrol Agent will consume a lot of CPU. Take a look at these steps to see if your installation looks similar:

Printers (UNIX) If there are many printers on the box that is being monitored, you might see a CPU increase. The “lpstat” command is very inefficient and inconsiderate when trying to get a status of a large number of printers. Some companies manage a single printer definition file and place that file on every box for simplicity. Each box may end up with 100 printers or more that do not even belong on that sub-net. They may even be in another town across a slow dial-up line. Patrol will try to retrieve a status of all those printers on all those machines across slow lines. It is recommended to quickly turn off printers that are not residing on that machine for efficiency. Using the same scenario of printers defined on every box, lpstat is very inefficient if a print queue does not exist. lpstat will return error messages to Patrol on every printer that is defined, but does not have a print queue defined. In this case, it is advisable to perform the same steps of turning off printers that do not pertain to that machine. If you do not want to see printers at all, it is advisable to remove the printer KM after you load the UNIX3.km from the Application Classes List. This will remove all checking for printers from the agent. It is normal for people not to monitor printers. Printers are usually handled on a very local (departmental) level, and shouldn’t necessarily bother the people that are responsible for enterprise wide monitoring. It is important to tune Patrol to only flash red when real attention is necessary. A good test is to ask "If that printer runs out of paper, am I the one responsible for putting more in." If you answer no, Printer Monitoring may be a good candidate for removal. In any case, only monitor the printers that are locally defined to that box, or at least defined on the same subnet.

NFS (UNIX) or Windows File Sharing On UNIX there have been a lot of problems when running Patrol in a NFS environment. Much of the problems encountered, deal with that "root" does not cross a NFS environment. The areas attempted to be written to do not allow write access from the user that is trying to write log files (this could be "bin" if Patrol is not configured correctly), the Patrol Agents are trying to update the same log file at the same time. The last problem is usually isolated to the Patrol Event Manager. If any of these symptoms sound suspicious, try to make sure the Agent is running as a valid user (configure agent). Also make sure that PATROL_ADMIN is set to a users that can cross NFS mounts (not root or bin, try patrol) and try setting the TMP_PATROL environment variable to a different location for each agent. This is the environment variable that Patrol uses to place the event manager error log files. Issues you can encounter on both UNIX and NT with file sharing is the network speed. If the history files are on a shared disk, and network connection to that disk is slow, you will see this reflected in the agents operation. For more information, read the white paper: “Using Patrol in a NFS environment” by David Spuler.

Active Processes KM The ACTIVEPROCESSES.km is used for monitoring processes that are potential users of high CPU. By default it identifies and keeps track of the top 10 CPU users. This KM can be useful for identifying potential problem processes. However, it can add a significant load to the system. It should be used for specific reason and turned off the rest of the time. Besides adding load to the system, this KM is in a lot of cases one of the worst “history polluters”, but more about that later. If the “Active Processes” KM is enabled, just disable it, and see if the CPU problem goes away. If so, you will have to tune the KM itself, and see if it is really worth the cost running it.

Incompatible OS It might happen that the Agent is incompatible with the OS / hardware upon which it is running. This does not happen very often, but it does happen. Usually, if the agent is incompatible, it will not run at all, or it will fail quickly after startup. There have been known instances of OS’s that cause the agent to "hang", "drop", or consume a lot of CPU after a period of seemingly normal operation. It is very important to notify BMC if this occurs.

PSL Performance Measurement Techniques There are two logical steps in the measurement procedure. The first is to determine which PSL script is consuming most resources to focus efforts on it. The second step is to localize the area within that PSL script to lines of code or a particular function or loop. Generally speaking it is a small area of the PSL script that is consuming most of the computation time. The PSL profiler is the best method of measuring the top PSL processes. The PSL profiler is available in PATROL 3.2 releases. For PATROL 3.1 and earlier there are a number of less accurate methods of measuring the top processes based on sampling the agent process list via the %PSLPS command. Once the PSL script is chosen for optimization, it is not so easy to determine which areas are heavy. Although the PSL profiler does show built-in function call counts, it does not offer any low-level line-by-line or loop count analysis of the code. This means the only method for measuring time usage within a script is to explicitly annotate blocks of the code with timing snippets of PSL code, which is effective, but time-consuming. In PATROL 3.2, this timing code can use the PSL functions that generate profiling reports, but in PATROL 3.1 and earlier, the main primitive is the PSL time function. You can also add a “PslDebug=-1;” line on top of the PSL script and analyze the output. This can give you a more detailed idea of what the script is actually doing. This requires some experience in reading this output. A good way to start learning it, it by adding “PslDebug=-1” to scripts you know very well, and see what output it produces. Another possibility for locating bottlenecks within the script is analysis of the quad-codes for the script. The “-q” option to the ‘psl’ standalone compiler can generate a quad-code listing. The PSL optimizer options “-OP2” and “-OP3” to the ‘psl’ compiler will also produce useful reports related to quad efficiency.

PSL Processes List via %PSLPS In the absence of the PSL profiler (i.e. for PATROL 3.1 and earlier), we can use simpler sampling methods. The list of PSL processes currently running in the agent can be sampled and used to generate a rough picture of the cost of each PSL process. We can view the PatrolAgent’s PSL process list via the “%PSLPS” button in the Patrol Debugger GUI, or by executing the command “%PSLPS” as a PSL command or task. Below is an example of the condensed results returned from the %PSLPS command. ============== PSL Process List ============================== PID NAME STATUS TYPE APPLICATION INSTANCE PARAMETER --- ---- ------ ---- ----------- -------- --------- 18 PATROL_NT HALTED DISCOV PATROL_NT - - 19 NT_SYSTEM HALTED PREDISC NT_SYSTEM - - 20 NT_SYSTEM HALTED DISCOV NT_SYSTEM - - 21 SYSObjectsColl HALTED PARAM NT_SYSTEM NT_SYSTEM SYSObjectsColl 22 SYSSystemColl HALTED PARAM NT_SYSTEM NT_SYSTEM SYSSystemColl 23 NT_SERVER HALTED PREDISC NT_SERVER - - 110 ELMAppEvDriven IOWAIT PARAM NT_EVLOGFILES NT_EVAPP ELMAppEvDriven 112 ELMAppColl HALTED PARAM NT_EVLOGFILES NT_EVAPP ELMAppColl 119 NETNetworkInterface… HALTED PARAM NT_NETWORK 1 NETNetworkInterfaceColl 126 ELMSysColl HALTED PARAM NT_EVLOGFILES NT_EVSYS ELMSysColl 131 ELMSysEvDriven IOWAIT PARAM NT_EVLOGFILES NT_EVSYS ELMSysEvDriven 132 NETNetworkInterface… HALTED PARAM NT_NETWORK 2 NETNetworkInterfaceColl 139 ELMSecEvDriven IOWAIT PARAM NT_EVLOGFILES NT_EVSEC ELMSecEvDriven 141 ELMSecColl HALTED PARAM NT_EVLOGFILES NT_EVSEC ELMSecColl 146 NETNetworkInterface… HALTED PARAM NT_NETWORK 3 NETNetworkInterfaceColl 153 NETNetworkInterface… HALTED PARAM NT_NETWORK 4 NETNetworkInterfaceColl 163 NETNetworkInterface… HALTED PARAM NT_NETWORK 5 NETNetworkInterfaceColl 169 NT_CPU HALTED PREDISC NT_CPU - - 170 NT_CPU HALTED DISCOV NT_CPU - - 171 CPUProcessorColl HALTED PARAM NT_CPU CPU_0 CPUProcessorColl 172 MNUTST HALTED PREDISC MNUTST - - 173 MNUTST HALTED DISCOV MNUTST - - ============================================================== TOTAL: 99 ==============================================================

The above list is shortened quite considerably here. The PSL Process List shows us some key attributes of each PSL process. Each PSL process has a unique numerical process id and this is shown in the PID column. Associated with the PID is the PSL process name, which is shown in the NAME column. Each process has a state shown in the STATUS column. The above example shows that all processes were in a state of HALTED at the time when the process listing was requested. This is quite normal and refers to them all in

a condition of currently halted and waiting for their next scheduled execution. If when we requested a process listing a PSL process was being executed, the STATUS column would have identified that process with a RUNNING status. If a process continues to be in the RUNNING state, it might be an indicator that this process is actually causing the problem.

The Patrol Agent Query Tool This is another tool that can assist you in understanding how much work the PatrolAgent is performing. The resource consumption of the PatrolAgent is like most things. Nothing is for free. The more parameters the PatrolAgent is expected to monitor, the more cpu, io and memory it will take to do so. The Patrol Agent Query Tool and the PSL Debugger (available from a development console only) can be used to get a good understanding of how many applications, instances and parameters are being monitored. From the Agent Query Tool : “SELECT Parameters FROM PATROL WHERE Computers-Name LIKE ‘hostname’ The Search will return the number of objects found in the Query Results window. All of these objects will be parameters. However not all will necessarily be active. The parameter summary line at the bottom of the Query Results window will show # Total Objects: #=Alarm #=Warn #=Offline #=OK The number of active parameters = Total Objects - Offline

Measuring OS command execution cost. Any KM that makes a call to any OS command can cause the Patrol Agent to reflect the CPU that the command which is executed consumes. This is possible because the command that is executed is a child process of the Patrol Agent. The best process is to profile all the OS commands that are executed by the Patrol Agent. A good start is to use the “grep” command to look for “system(“, “execute(“, and “popen(“ through all the KM and psl files that are on the Patrol Agent machine. (remember: You may have to look at the Patrol Developer Console also, because those KMs will be downloaded every time a connection is made to the Patrol Agent.) Unfortunately, some OS commands can be hidden within Patrol lib and bin files. We won’t worry about these for now. For all the OS commands, create a chart that shows the USR, SYS, COMBINED CPU, and Clock time for each command. To do this, execute the same command that is executed by the Patrol Agent for your platform with a “time “ statement in front of the command (i.e. time ps -efl). You should produce an output that is similar to this:

Command Real Time User CPU Sys CPU Total CPU sar -A 5 2 18.63 sec 0.10 sec 2.16 sec 2.26 sec vmstat 5 2 5.21 sec 0.03 sec 0.18 sec 0.21 sec lpstat -v 0.21 sec 0.12 sec 0.08 sec 0.20 sec ps -ef 2.07 sec 0.30 sec 0.69 sec 0.99 sec lsps -a 5.27 sec 0.57 sec 0.60 sec 1.17 sec

On some machines (i.e. HP-UX and AIX), the ps command consumes a lot of CPU.

The ps command is often used in KMs written by the customer to look for processes. At BMC, we try to limit the number of system calls for this reason. We establish a single collect that executes the system call and builds an internal table for all other collectors to operate upon. If the machine has more than 500 process, it can cause a problem even with the KMs that are provided by BMC. Your only solution is to modify the schedule of the KM to slow it down until an acceptable CPU consumption is reached. (Note: The UNIX KM a7 Patch replaces the external ps command with a BMC written command on platforms which ps uses a lot of CPU).

If just this command is executed 6 times a minute, it will account for 10% of the CPU consumed by the

PSL Profiler subsystem The PSL profiler is available in the PATROL 3.2 release of the agent. This is a measurement component built into the agent to measure how much resources each PSL script consumes. This allows the determination of which scripts have the largest cost, and with some information as to where the cost is distributed within the PSL script. There are a number of interfaces to the PSL profiler. 1. PSL functions — The profiler can be dynamically enabled via PSL, and reports can

be printed in PSL. 2. Menu commands — A number of profiler menu commands are available that make

use of the PSL primitives to produce useful profiling reports. 3. ppv utility — The ppv executable is an off-line reporting tool that analyzes a profiling

data file which the agent can generate during profiling. In all cases, the output report is of a similar format. The PSL profiler reports will tell you which PSL scripts have the highest CPU usage, and also measure how much each builtin PSL function has used. Scheduling frequency of the PSL processes clearly affects the result, as they are cumulative timers. However, since the number of executions is also measured, the average costs per execution can also be easily generated in reports. The main agent resource measured by the PSL profiler is CPU time. This is broken into user time and system time, to indicate that the system calls may take some time in some of the PSL builtin functions. The profiler does not explicitly measure other cost measures such as the number of child processes, the amount of memory, PSL locks, global channels, and so on. However, some of this information can be gathered from the execution counts and the builtin function call counts. For example, child processes can be measured reasonable accurately as the number of calls to PSL system, execute, and popen. The main limitation of the PSL profiler is that it does not go very deep within each script to examine performance. For example, there is no line-by-line report showing which lines are executed most. The analysis of builtin functions is useful, but there is no similar analysis of PSL user-defined functions. The profiler can be dynamically enabled and disabled during agent execution. Enabling the profiler will turn on the measurement for the specified PSL scripts. Disabling the profiler will discard all measured data and return the agent to the normal non-profiling state. The impact of turning on the PSL profiler is a marginal degradation in PSL execution performance, since each PSL timeslice must also add to measurement counters. However, this is mainly an in-memory cost that is low, whereas the largest costs are when the reports are generated, and these reports are not occurring often because they are only useful when the data has been collected.

PPV Tool The dynamic control of the PSL profiler via PSL is not the only method of profiler usage. The profiler can be enabled in a more controlled batch manner where a command-line option starts the agent in a profiling state, and the agent writes a data file on agent termination. This data file of profiling measurements can be used to generate reports from the standalone ‘ppv’ profiler reporting tool. Since the results of ‘ppv’ are identical to those of the PSL profiling report functions such as ProfGet and ProfTop, and this can be done dynamically, there is usually no reason to use the more complicated process involving agent startup and shutdown. Nevertheless, some occasions are more suited to batch processing and off-line analysis. The PSL profiler can be activated by the “-profiling” command-line option when the PatrolAgent is started. The environment variable “PSL_PROF_LOG” must also be set with the name of a file in which to store the profiling measurements. This file is a binary data file generated by the agent that is used with the ‘ppv’ utility. The standalone interpreter executable ‘psl’ can also run with profiling via the “-P” command-line option. However, don’t forget that not all PSL builtin functions will work in the standalone interpreter. Therefore, off-line profiling using ‘psl’ is not usually successful for many PSL scripts. The profiling data is written to a binary file when the agent terminates (or when the PSL stand-alone interpreter completes). The ‘ppv’ utility can be used to view the binary files. The ‘ppv’ utility displays statistics from the PatrolAgent for each PSL process the PatrolAgent executed. The statistics collected and displayed by ‘ppv’ are displayed below. • # of PSL processes the PatrolAgent profiled. • elapsed time of the PatrolAgent. • cumulative CPU time of the PatrolAgent. • the average %CPU load of the PatrolAgent. The PSL process statistics include: • the number of executions of the process. • the real time and the CPU time of each process. • the number of calls along with the real and CPU times for each PSL builtin function a

process calls. This information is displayed in descending order, displaying the process with the largest CPU first. This format of the ‘ppv’ output is almost identical to that from the PSL ProfTop function.

Profiling Report Formats The format of the PSL profiling reports is similar for the PSL ProfGet and ProfTop functions, and also the ‘ppv’ standalone reporting utility. Below is an example report for a single PSL process.

Process Name: ProfGet.psl Executions: 1 times Process Summary: Code Section Real Time User Time System Time %Proc %Total ---------------- ----------- ---------- ----------- ----- ------ PSL 0:00:00.10 0:00:00.0 0:00:00.0 0.00 0.00 Native 0:00:00.0 0:00:00.0 0:00:00.0 0.00 0.00 ---------------- ----------- ---------- ---------- ----- ------ TOTAL 0:00:00.10 0:00:00.0 0:00:00.0 100.00 0.00 Per-function Summary: Name # Calls Real Time User Time System Time %Proc %Total ------------ -------- --------- --------- ----------- --------- -------- ProfOptions 1 0:00:00.0 0:00:00.0 0:00:00.0 0.00 0.00 getpid 1 0:00:00.0 0:00:00.0 0:00:00.0 0.00 0.00 getpname 1 0:00:00.0 0:00:00.0 0:00:00.0 0.00 0.00 nthargf 2 0:00:00.0 0:00:00.0 0:00:00.0 0.00 0.00 print 3 0:00:00.0 0:00:00.0 0:00:00.0 0.00 0.00 ------------- -------- ---------- ----------- -- --------------- -------- --------

The "PSL" data row is the time spent interpreting all PSL instructions except for "native" (i.e. built-in) function calls. The “native” functions are the PSL built-in functions like ntharg and grep that are functions implemented in C code; i.e. not user-defined PSL-coded functions. Actually, the “PSL” values include some of the time taken to setup the function call, but not the actual time spent inside the call. The "Native" row is the time spent in executing native (built-in) functions. The first percent column, “%Proc”, is the percent of the total CPU time for the process. The second column, “%Total”, is the percent of the total CPU time for the Agent. The total agent CPU time is available at the top of the report. It shows the elapsed (wall clock) time of the Agent and the total CPU time for the Agent. The percent CPU load caused by the Agent (CPU/elapsed) is shown in parentheses.

Profiler PSL Functions The PSL profiler has a number of PSL functions to control the profiler. The profiler can be turned on and off via PSL, and reports can be generated. The PSL functions relating to the PSL profiler are: • ProfOptions — turn profiling on and off for one, many or all existing PSL processes. • ProfDefaultOptions — handle profiling defaults for PSL processes not yet created. • ProfGet — generate reports for one or many PSL processes • ProfTop — generate reports for one or many of the top “n” PSL processes These PSL commands are used by the deployment and profiler km.

Tools for easy performance measurement

Deployment KM The Patrol Deployment Knowledge Module, as the name suggests has been designed to aid the Patrol consultant with the issues surrounding large scale deployment. In addition to supplying deployment information, the Knowledge Module also provides a couple of great tools for gaining a better understanding of what the PatrolAgent is doing or waiting on in regards to its PSL processes. These tools are very useful when needing a better understanding of the PSL processes on a PatrolAgent. However these tools can be expensive in regards to cpu usage. So care should be taken as to when the tools should be used for PSL process analysis.

PSL Execution Counter Menu Repeatedly running the %PSLPS command can give you useful information about the top PSL processes. However, this can be automated, and one example of this in PSL is the “PSL Execution Counter” menu command that is available in the menu commands under the “Development Tools” menu on the computer icon (if you have installed these tools). The PSL Execution Counter is used to get an understanding of the number of timeslice executions % time each PSL process is consuming within the agent. This counter needs to be run for a reasonable number of minutes before the results can be accurate. This is due to the cumulative nature of the report and the fact that different parameters have different poll times. The task should be run for at least the amount of time that the longest poll period is for a parameter on that agent, and the longer it is run, the more accurate it becomes. Unfortunately, the PSL Execution Counter is dependent on two performance variables rather than only one. The first is the amount of time each PSL script is consuming. The second is the polltime which controls how frequently the script is rescheduled. Therefore the highest PSL scripts can be “tuned” either by real code tuning or simply by changing the parameter polltime or discovery cycle time.

PSL Processes Watch Window Menu Commands There are a number of useful menu commands available in the Development Tools menus. The PSL process monitor displays the top 10 currently RUNNING processes as well as the blocked I/O processes identified by the IOWAIT status type. This tool is great for showing any PSL processes that may continuously be in the RUNNING state. It would pay to further investigate these PSL processes as to why they are always in the RUNNING state. Perhaps their poll time needs looking into or the collection routine itself.

The deployment knowledge module also has a PSL process monitor that shows those PSL processes that are waiting on the external system processes, as well as the external system processes themselves. These PSL processes will typically be in a state of IOWAIT which is to be expected. These tools are very useful when needing a better understanding of the PSL processes on a PatrolAgent. However these tools can be expensive in regards to CPU usage because of their rapid sampling techniques. Care should be taken as to when the tools should be used for PSL process analysis.

Profiler Console Menu Commands A number of menu commands have been developed to control the PSL profiler functions and to generate more complex reports from the basic PSL reports. These are available in the “Development Tools” menu commands on computer icons, and also in the “Deployment” menu commands, if they are installed. The main sequence of using these menu commands is: 1. Choose “Enable Profiler” to start profiling on all current PSL processes. 2. Wait! The results from profiling reports are not immediately useful. At least 5

minutes is recommended, but longer time periods make results more accurate. 3. Choose one or many of the reporting menu options. For example, there are reports

about the top CPU consumers, child process launch reports, and PSL builtin function counts.

4. Choose “Disable Profiler” to return the agent to non-profiling state. There is nothing magic about these menu commands. They are using the PSL interface to the profiler to control it, and using string operations to generate summarized or specialized reports. Examining these menu commands can give you ideas for extending the profiling ideas to your specific needs.

PSL Profiler KM

Purpose Automated KM profiler tool to measure various performance metrics about KMs in order to help KM developers tune their KMs.

Overview This tool is a KM that uses the PSL Profiler introduced into the 3.2 agent accessed via the new PSL functions to control the profiler. This tool takes the data returned by the profiler and massages it into useful information by creating a nested view of all the KM and PSL process objects. This makes it easy to determine how much resources a KM is using, and then drill down to see how much resources each PSL process within that KM are using. The primary purpose of this tool is a KM Development tool to aid in the KM performance tuning task. This profiler tool can also potentially be used as an agent configuration or deployment tool in production systems to determine which KMs and PSL scripts are taking too much resources, however this is not its main purpose. Out of the box, this KM has alarm ranges that raise warnings and alarms for a total agent CPU% usage of more than 5%, and for more than 2% used by a single KM or a single PSL process. There is also alarms if a KM launches more than 5 child processes per minute on average.

OS Platforms

• Supported platforms: UNIX, NT, OS/2, VMS • Unsupported platforms: MVS, Netware

PATROL Versions:

• PATROL 3.2 and later agents are the requirements. • PATROL 3.0 and 3.1 agents do not support the PSL profiler PSL API.

Application Classes

• AGENT_PROFILER.km - Main application class doing all the work. • AGENT_PROF_KM.km - Represent each KM's profiling information • AGENT_PSL_PROC.km - Represent each PSL process's profiling information.

AGENT_PROFILER.km - Main application class doing all the work.

The AGENT_PROFILER class is the main application class that does the work of the profiler. This class creates all the sub-objects for the AGENT_PROF_KM and AGENT_PSL_PROC classes. The parameters of this class also do the main analysis via the PSL profiler API calls. Ø AGENT_PROFILER.km Parameters AP_DetailReport - Consumer text parameter giving a detailed reports of the top PSL processes and their profiling statistics. Includes CPU usage, and details on builtin function calls. AP_Collector - Main collector gathering all the profiling measurements. This collector uses the PSL profiler agent's PSL API calls such as ProfGet and ProfTop to gather the raw profiling data. This data is then massaged into more useful reports including building per-KM counts, and sorting the results according to resource usage. This collector parameter is also responsible for the creation of the nested instances in the AGENT_PROF_KM and AGENT_PSL_PROC classes that represent the various KM and PSL process objects and their performance characteristics. AP_AgentCPUPercent - Consumer graph parameter that measures the percentage of system CPU used by the agent in total. Since the percentage is based on internal measurements from the PSL profiling API, this percentage may differ from that of external CPU% measuring tools, such as performance monitor on NT and "top" or "glance" on UNIX. This parameter has a warning range from 5%-10% system CPU% and an alarm range for more than 10%. When it warns or alarms, this parameter is annotated with a textual report on the top PSL processes consuming time resources. This helps identify what PSL process is causing the load. AP_TopProcs - Consumer text parameter that contains a textual report on the top PSL processes. AP_InternalFailure - Consumer text parameter that reports internal problems with this KM itself, such as if the agent's PSL profiler API returns strange results. This is mainly used for debugging of this profiler KM itself. AP_TopKMs - Consumer text parameter that displays a text report of the CPU percentage used on a per-KM basis. This report is built by the AGENT_PROFILER collector by summing up the data for each PSL process that belongs to the KM class. The attributes reported per-KM in this "top KMs" report include: Class Name, System CPU% Used, Agent % Used, and Child Process Launch Count. AP_ChildProcsReport - Consumer text parameter that displays a report of the child process launching statistics. This can be used to measure the impact of using PSL system, execute, and popen in KMs that launch child processes of the agent. Ø AGENT_PROFILER.km Menu Commands Refresh Parameters - Refresh the collectors of the AGENT_PROFILER class to get a new set of profiling data via the PSL APIs to the agent's profiling data. Has the effect of

updating all the profiling reports in all the sub-objects including AGENT_PROF_KM and AGENT_PSL_PROC classes. Kill KM objects - This menu command destroys all instances of the AGENT_PROF_KM class. These objects are then re-created by the next discovery cycle of the AGENT_PROFILER class. This menu command is mainly useful for debugging this tool itself. Clear Profiling Data Counters - Turns off the PSL profiler. This is then turned on again by the next AGENT_PROFILER discovery cycle. This has the effect of resetting all profiling counters since the agent throws away profiling data when the profiler is turned off. Clear Internal Errors List - Turn off the messages & alarm red icon in the AP_InternalFailure parameter. Mainly used for debugging of the KM itself.

AGENT_PROF_KM.km The AGENT_PROF_KM class is used as an iconic representation of each KM. That KM's profiling information is stored in its parameters. All AGENT_PROF_KM objects are nested under the main AGENT_PROFILER object. Ø AGENT_PROF_KM Parameters AP_TopProcsCPU Consumer text parameter recording the top PSL processes attributed to this KM class and how much CPU they used. This can be useful to determine which PSL processes within a KM are using the most resources. AP_AgentPercUsed Consumer graph parameter recording the percentage of the agent's time that is used by this KM. This can help determine whether this KM is the one taking the most resources of the agent. There is currently no alarm ranges for this parameter, because the KM can be using a high % of the agent's time, but this can still be small if the overall time by the agent is also low. Therefore, this parameter is not that a significant performance metric, and the system CPU% measures are more useful. Assuming you enable an alarm range on this parameter.... Whenever it alarms, this parameter has annotations showing the full detail list of PSL processes for the current KM, and also a full detail report of all the PSL processes in the agent. This helps determine what is causing the high system CPU% usage. AP_SysCPUPercent Consumer graph parameter recording the percentage of the systems's time that is used by this KM. This can help determine how much impact the KM is placing on the real system. This parameter has a warning range for 2%-5% and alarm for greater than 5% system CPU usage just from this KM. Whenever it alarms, this parameter has annotations showing the full detail list of PSL processes for the current KM, and also a full detail report of all the PSL processes in the agent. This helps determine what is causing the high system CPU% usage. AP_ChildPerMinAvg Consumer graph parameter recording the number of child processes launched by this KM per minute. This parameter can help determine if the KM is launching too many child processes via PSL system, execute or popen function calls. When it alarms, this parameter is annotated with a report on the child process launching characteristics. This parameter currently warns for 5-10 child processes per minute, and alarms for more than 10 child processes per minute. AP_ChildProcCount Consumer graph parameter recording the number of child processes launched by this KM in total. This parameter can help determine if the KM is launching too many child processes via PSL system, execute or popen function calls. This parameter currently has no active alarm ranges unless customized. However, if it alarms, this parameter has annotations showing the full detail list of PSL processes for the current KM, and also a full detail report of all the PSL processes in the agent. This helps determine what is causing the high system CPU% usage. Ø AGENT_PROF_KM.km Menu Commands

Kill KM Profiling Objects - This menu command destroys all instances of the AGENT_PROF_KM class. These objects are then re-created by the next discovery cycle of the AGENT_PROFILER class. This menu command is mainly useful for debugging this tool itself.

AGENT_PSL_PROC.km application class

The AGENT_PSL_PROC class is used as an iconic representation of each PSL process. That profiling information about that PSL process is stored in its parameters. All AGENT_PSL_PROC objects are nested under the AGENT_PROF_KM object that represents the KM to which the PSL process belongs. Ø AGENT_PSL_PROC.km Parameters AP_PSLSysPercent parameter Consumer graph parameter recording the amount of system machine CPU percentage used by this particular PSL process. This parameter has alarm ranges for high values with a warning from 2%-5% and an alarm for more than 5% system CPU used. Whenever it alarms, this parameter has annotations showing the full detail list of PSL processes for the current KM, which helps determine what is causing the high system CPU% usage. AP_NonBuiltinPercent parameter Consumer graph parameter recording the percentage of the time used by this PSL process that can be attributed to ordinary PSL commands (e.g. arithmetic operators), rather than to builtin functions such as grep/set/ntharg/etc. AP_BuiltinPercent parameter Consumer graph parameter recording the percentage of the time used by this PSL process that can be attributed to builtin functions such as grep/set/ntharg/etc. AP_DetailedProfile parameter Text parameter (consumer) showing a detailed report of the builtin function call usage in human readable form. Ø AGENT_PSL_PROC.km Menu Commands None

Limitations Performance. This tool is not intended for use on a production system, and because it does a lot of computation, it can consume resources itself. This profiler KM is for a KM development environment. It is not recommended that this tool be used on a production system for agent CPU tuning reasons, although in theory it can be briefly turned on in a production system.

Solving Performance Problems

History Storage The cost of storing parameter values in the history database has some effect on agent performance and disk space usage. However, the effect on runtime performance is usually minimal, and the disk space used is approximately 8 bytes per parameter value for the value and timestamp, with slight extra overhead for indexing. Tuning the history storage issue is a marginal improvement in some cases. Parameter polltime changing is one way to reduce history storage. By reducing the frequency of parameter sample, the number of values stored in the history database over time declines. This optimization has already been discussed previously. Disabling history storage for a parameter is another possible tuning mechanism. Each parameter has a history level configuration setting. This can be edited via the Developer Console GUI in the parameter editing dialog. By changing the setting from “Inherited” to “Local”, and the number of days to “0”, it is possible to ensure no parameter values are stored in history for a particular parameter. This is a possible choice for less important parameters, or parameters for which history has no particular meaning. On the other hand it is good KM style for parameters to offer history for analysis, and turning off history storage should perhaps be left to a later KM configuration stage if performance becomes an issue. Annotations are also stored in the history database. Therefore the KM developer should also keep in mind the frequency and size of text annotations stored for parameters via the PSL annotate function.

Fixing history files and history index pollution If you have previously loaded KM’s that had a lot of instances (Active processes, badly configured PRINTER KM), you will have a lot of index pointers for those instances in the history files. Also “unreachable” annotation data can still be in the history files. If you don’t want to remove the history files, you can try to fix them. To do this, you must stop the Patrol Agent and run the fix_hist command. The fix_hist program will attempt to strip out index pointers that no longer have history data attached to it. It will also try to pack annotation data. If the CPU is acceptable after an agent restart, then you were successful. This may be another reason why the CPU consumption increases by the agent after a period of monitoring.

Removing configuration files If you are going to change the configuration of multiple KM’s, you might decide to remove the config file (config_`hostname`-`port` file in PATROL config directory). However, the configuration file is agent-wide, so removing it will remove all configurations you have done before. This might be desirable to return to a clean situation. You can save the changes you made in the agent configuration database by using the command pconfig +get

Reconfigure the Patrol Agent: You don’t know what your agent is doing exactly, it’s time to reconfigure the Patrol Agent. Configuration of the Patrol Agent is the number one way to make sure that the Patrol Agent is using only the resources that you intended it to use. So often, during a Patrol trial, the configuration is left out. If you had a previous configuration file, now is the time to apply it. Before you apply it, make sure that the below configuration variables are not a part of the file. You can do this by simply editing the file. The configuration file is the file that was created when performing the pconfig +get -port `portnum` command from a previous step. If everything is OK, apply the configuration using the command pconfig +set -port `portnum` `filename`. If you don’t have a config file, you will connect to the Patrol Agent w/ a Patrol Console and reconfigure the Patrol Agent. Either method should apply the following variables. The recommended Agent variables to set are:

/AgentSetup/defaultAccount = `install account/password` (usually patrol) /AgentSetup/preloadedKMs = `knowledge modules that are to be loaded automatically during agent startup. /AgentSetup/staticApplications = `application classes that are to remain resident (or static) even if a console disconnects. All pre-loaded KMs automatically fall under this category. There may be a case when you want a Patrol Console to initiate the load, but the Agent to retain the knowledge after the console disconnects. /AgentSetup/historyRetentionPeriod = `number of days to store history`. Make sure this number is equal to the number in the Patrol Console Preferences Box. It must be the same on all consoles that attach to the Patrol Agent. /AgentSetup/disabledKMs = `list of Application classes (NOT KMs!!!!!) that are not to be loaded into the agent.

It is very import to set the disabledKMs variable. The Patrol Console has Knowledge Modules loaded for every machine that it connects to. These machines consist of different OS’s and RDBMS’s. If you do not disable the appropriate KMs, every Patrol

Agent will receive all the Knowledge Modules that are loaded by the Patrol Console (in case of a developer console). When you only work with operator console, the agent will be requested to load any KM We do not want this. After all of these steps have been done, you should have a healthy Patrol install.

Event storage The event subsystem and the event database are usually not very costly. The only exception to this case would be a KM that makes over-use of a user-defined KM event catalog for KM-specific events. For example, if the KM is pumping hundreds of events into the event subsystem, this can cause a performance bottleneck. However, reasonable use of KM-defined events is not usually a cost concern. Here are some AgentSetup variables that define the behaviour of the event manager. AgentSetup/pemCacheSize. Size of the cache used by the PatrolAgent for event management. A bigger cache size improves the performance of the PatrolAgent, however this is at the expense of memory. Use a size of 250 bytes per event for determining event cache size. AgentSetup/pemEvMemRetention. Number of events the PatrolAgent keeps cached for each object. The larger this value the better the performance of the PatrolAgent at the expense of more memory.

Configuration Guidelines for Deploying PATROL for Optimal Performance PATROL is designed with scalability and performance as foremost considerations. The out-of-the-box configuration of PATROL aims to achieve minimal server performance impact for the majority of machine hardware and monitoring configurations. The agent’s various self-tuning mechanisms also help to extend the breadth to less typical configurations without impacting performance. However, the automatic tuning capabilities can only extend so far and the human touch is sometimes needed to achieve optimal performance on some monitored machines. This document addresses some of the considerations required to address the deployment of PATROL in such circumstances. The performance of PATROL depends on the level of monitoring burdened upon an agent and the frequency of the data collection or other actions that is required to achieve this monitoring. This leads to a number of decisions regarding configuration settings:

Discovery Cycle time The default setting is for the agent to run ever KM discovery script approximately every 40 seconds, subject to variations due to scheduling smoothing algorithms. Since the discovery cycles are doing work to detect instances and validate statuses of instances, if there are many objects, these can be lengthy calculations. Therefore it can be valuable to increase the interval of the discovery cycle on KMs for which the immediate status of an object is not needed. For example, the interval of the PRINTER KM in the UNIX KM is a common candidate because it is not as crucial as the FILESYSTEM KM. Changing the cycle from the default 40 seconds to 5 minutes might be a worthwhile improvement with negligible value lost in monitoring.

Changing discovery cycles This is achievable via the PATROL Developer Console GUI for each KM. From the applications list drill into an application to edit the application knowledge. In the application editing dialog change the “Custom Discovery Cycle” setting to whatever time frequency you feel is adequate for information on statuses of this object class.

More details on discovery cycles: The 40 second cycles are actually called partial discovery cycles, and since the full discovery cycle interval is related to the agent process cache refresh interval which is 300 seconds by default, a discovery cycle is in full discovery about 1 in every 7 or 8 times. However, in some KMs the partial discovery cycle can involve significant effort, usually to re-validate the status of existing objects rather than discovery new ones. Increasing the value of the cycle through the GUI setting actually affects the partial discovery cycle, but since a cycle will be in full discovery whenever the process cache refreshes, it is possible to change this cycle time to more than 300 seconds if desired.

Targeting discovery cycle changes: Generally speaking, the best candidates for changing discovery cycles are those with the largest number of instances. On UNIX this typically includes PRINTER and FILESYSTEM, with the former the most frequent candidate because of its less crucial nature. Of course the best thing to do is gather profile data for your agent (using any of the earlier described tools, like deploy KM or profiler KM)…

Collection Parameter Polltimes With collection parameters we mean every parameter that has a certain schedule interval and a script to execute. This recommendation is analogous to that regarding discovery cycles, but in this case refers to the data collection for parameter value population. Each KM usually has a set of parameters designated as “collectors” that perform data extraction for measurement of the application. These collectors can be in the KM class itself, or in some cases such as the UNIX KM, can be in a separate application class (i.e. the UNIX KM has the COLLECTORS application class separately for performance reasons).

Targeting polltime changes Changing poll times involves a tradeoff between performance and monitoring delays. Therefore the first targets should be those that are less crucial to the monitoring requirements. In the UNIX KM, some of the candidates might include USRProcColl (for user process monitoring), NFSColl (for NFS information), PrinterColl (for printer information), and PSColl (for process information). Another way to target collectors is to look for those that are running most frequently according to their polltime setting. Of course, the best way is again to start from your profile information.

Application Disabling Clearly the most effective way to reduce monitoring impact is to disable some unnecessary KMs and leave only the crucial requirements. On a mission critical UNIX server the most crucial applications might be ORACLE, CPU, FILESYSTEM and DISK. Other applications such as USERS, KERNEL or MEMORY might be less crucial and therefore candidates for disabling. Note that a plausible alternative to total disabling of KMs is to increase both discovery cycle and collector polltime settings to a long interval.

Changing disabled KMs Each agent has a “disabledKMs” configuration variable that lists the KMs that are not to run under any circumstances on the machine. Adding a KM name to this list ensures that the KM is never loaded by the agent nor accepted from a Developer console download. Another way to disable KMs at a global enterprise level is to edit the Active setting in the application editing dialog.

Targeting applications There are two main ways to choose applications to disable. Firstly, there are applications which are valid for the machine are not considered important. Targeting these applications is valuable, particularly for applications with many instance icons. Secondly, there are applications which are not valid on the given machine (e.g. SYBASE on a machine with no Sybase installation). Targeting these is probably less effective in reducing monitoring resources because the agent’s prediscovery mechanism minimizes wasted resources. Disabling these also loses the advantages of auto-discovery in the KMs. However, these scripts do still consume some resources and because they are clearly not needed, they are a safe way to get a marginal gain. Thirdly, there is the category of KMs that are known to have performance problems. Fortunately, the ACTIVEPROCESS KM is the only such item, and is currently shipped deactivated for the reason that its use is only valuable under certain special circumstances.

Again here I would say, look at the profiler data. This valuable source of information can tell you exactly which KM is worst !

Collector Disabling The disabling of collectors is the more extreme version of changing their polltime. Turning off a collector is possible via the PATROL Developer Console GUI. This involves drilling down into an application class to edit the parameter and changing its “Active” setting.

Targeting collectors The choice of collector is based on upon its impact in the monitoring environment in terms of which consumer parameters will no longer be populated. Candidates within the UNIX KM include the USRProcColl (for user settings), and PSColl (which populates the PROCESS application). Those collectors that have the shorter polltime cycles, or those, which refer to application classes with the largest number of instances, will often yield the most benefit.

APPENDIXES

APPENDIX A : Patrol Environment Variable Table Environment Variable Default Setting Description Console AgentCOMSPEC c:\command.com COMMAND Interpreter used for Windows/NT and OS/2 No YesDUMP_CORE_ON_XT_ERROR Dump core due to malloc errors Yes NoETC_HOSTS_SCRIPT NULL Something to do with an alternate method to retrieve

hostnames?? Yes No

GETHOSTENT_WORKAROUND NULL Something to do with an alternate method to retrieve hostnames??

Yes No

HOME $HOME Environment variable where the Patrol Console writes its customization

Yes No

PATROL_ADMIN bin Set to user that is to own log files that the Patrol Agent writes. No YesPATROL_ARCHIVE $PATROL_HOME/../archives TOC files are stored here Yes YesPATROL_BIN $PATROL_HOME/bin Location of the Patrol binary files Yes YesPATROL_CACHE $HOME/patrol Local cache directory for Agent PSL and KMs Yes NoPATROL_CFG $HOME/patrol/config Local area where xpconfig stores configuration files Yes NoPATROL_CONFIG $PATROL_HOME/config Directory where agent configuration databases are No YesPATROL_CORE_ON_ASSERT FALSE Instruct the Patrol Agent to core dump if Assertion Fails are

noticed. No Yes

PATROL_DEBUG ??? Variable that sets the debug level when the -debug option is used and a level is not supplied

Yes Yes

PATROL_DESKTOP $HOME/patrol/desktop Local directory where desktop configurations are stored Yes NoPATROL_DEVELOPMENT FALSE Enable assertions and debugging stuff Yes YesPATROL_DUMP_CORE $TMP Location to place the CORE dump file. Yes YesPATROL_GLOBAL_LIB $PATROL_HOME/lib Global Patrol library directory Yes YesPATROL_HELP $PATROL_GLOBAL_LIB/app-

defaults/help Directory containing help files No Yes

PATROL_HISTORY $PATROL_LOG/history Location of parameter history data files No YesPATROL_HOME `patrol install`/`platform` Main install directory for Patrol Yes YesPATROL_KM $PATROL_GLOBAL_LIB/knowledge Global Knowledge Module directory Yes YesPATROL_LICENSE $PATROL_GLOBAL_LIB Location of the Patrol License file Yes YesPATROL_LOCAL_CHART $PATROL_CACHE/chart Location where the Chart facility stores preferences Yes NoPATROL_LOCAL_KM $PATROL_CACHE/knowledge Local Knowledge Module directory Yes NoPATROL_LOCAL_PSL_APPS $PATROL_CACHE/psl Local PSL directory Yes NoPATROL_LOG $PATROL_HOME/log Patrol Agent Log directory No YesPATROL_MACHINE_TYPE `platform` Override for the Patrol Machine Type response. Useful on

SVR4 machines that need a different icon. No Yes

PATROL_MIBFILE ??? MIB filename for SNMP support (used only if not set in agent's configuration)

No Yes

PATROL_PSL_APPS $PATROL_GLOBAL_LIB/psl Global PSL directory Yes YesPATROL_QRY $HOME/query Local directory where query results are stored from Agent Query Yes NoPATROL_REMOTE $PATROL_HOME/remote Directory where received remote file transfers go No YesPATROL_SOUNDS $PATROL_GLOBAL_LIB/sounds Location of Patrol Sounds directory Yes NoPATROL_TMP $PATROL_CACHE/tmp Override for $TMP Yes NoSPEAKER $PATROL_HOME/bin/player Executable location to play sounds Yes NoTMP $TMP Override for /tmp ?? Yes YesTMP_PATROL $TMP Event Manager error log file location. Note: This is required if

you do not have a writable C: drive. Yes Yes

XBMLANGPATH $PATROL_GLOBAL_LIB/images Images sub-directory Yes NoXKEYSYMDB $PATROL_GLOBAL_LIB/app-

defaults/XKeysymDB File with X defaults Yes No

YPCAT_HOSTS_SCRIPT NULL Something to do with an alternate method to retrieve hostnames??

Yes No

APPENDIX B : PSL Optimizer The PSL Optimizer in PATROL 3.2 is designed to provide a transparent optimization step. It aims to take the generated quad-code from the PSL compiler and optimize the quad code where possible. The PSL Quad Optimizer is in itself a subsystem that accepts PSL quads as input, optimizes the quads, and produces PSL quads as output. Below gives some examples of what the PSL Optimizer can do to improve the efficiency of the executed quad-code. • collapses jump chains • removes redundant and unreachable code • removes redundant assignments • collapses “.” concatenation operator chains — e.g. a.b.c.d or [a,b,c,d] turns into

join("",a,b,c,d); • re-arranges all looping constructs so that they are more efficient. • parameter packing so that there is less overhead associated with function calls. • folds and propagates expression values The PSL quad optimizer performs two basic functions: • reduce the number of quads to execute, and • reduce the number of intermediate values used in evaluation. The changes performed by the PSL optimizer can be examined via the “-q” option to the “psl” executable. By using “psl” with and without a “-O” option, you can see what changes are occurring. Preliminary tests using the optimizer on average that the PSL scripts were 15-20 percent more efficient in terms of number of quads executed. However, this may not be that noticeable unless you have a heavily loaded machine with many scripts. In this case you will see an improved throughput of PSL scripts. The optimizer “-OP2” and “-OP3” options to ‘psl’ can also be used to generate reports about optimization results. When building PSL binaries or libraries, you can choose the optimization level via the “-O’ option to the “psl” command. The “-O0” option disabled optimization and “-O3” chooses the maximum level. Clearly in all cases of pre-compilation, the maximum optimization level is worth pursuing. In addition, the optimizer will also actually reduce the total number of quad-codes, and thereby reduce internal agent memory usage. The quad-codes will be different after optimization but this only really affects the PSL runtime tracing output, which may become a little harder to decipher. This issue is certainly secondary to runtime performance improvement. For the situation of KMs where PSL source files are used, the optimizer is more of a configuration issue than one for KM development. Since the agent compiles all PSL source before execution, and will apply optimization if configured to do so, the benefit

depends on what optimization level the agent is configured to have, rather than on anything done by the KM developer to the PSL source. The “pragma” statement can be used to over-ride agent or “psl” optimizer settings. Since optimization is a post-compilation step, the settings are not used until after the first phase of compilaton, and the last “pragma” statement found in a PSL script will be the optimization setting that is applied. Therefore it does not matter where in the script a pragma statement is added. The pragma statement is purely a compilation statement and has no runtime impact. The syntax for the pragma statements would be:

# Turn on maximum optimization for this PSL script pragma “O3” # Turn off optimization for this PSL script pragma “O0”

The pragma statement can have either “O3” or “-O3”. The dash is optional. The “P” option is also available and allows printing of optimization information. The option could be “-OP3” to get level 3 optimization with printing. This pragma statement allows the KM developer to have control over optimization even in a pure PSL source KM. It is a matter of style whether to mandate maximum optimization in the scripts, or to leave this to the agent configuration settings for the PATROL administrator to consider.

APPENDIX C : PATROL Architecture Performance Aspects With the focus of PATROL on scalability, the issue of performance has been a critical concern for the architecture. The minimal use of resources in terms of both server CPU and network bandwidth has been and continues to be a central goal in PATROL design concerns. This document discusses the design elements and their relation to performance in terms of the existing releases of PATROL 3.0 and 3.1, and the upcoming PATROL 3.2 release.

PATROL 3.0 Performance Aspects The central aim of PATROL 3.0 was to achieve scalability and goals to allow monitoring of thousands of agents. The console centric architecture of previous PATROL versions had several performance problems that arose in the monitoring of large environments in terms of network bandwidth and console performance. The PATROL 3.0 autonomous agent architecture arose mainly to solve these problems by limiting the network packet requirements and improving the console to enable it to handle many thousands of objects. a) Autonomous agent — agents run in a detached, autonomous mode in PATROL 3.0

with locally stored knowledge which removes the need for the initial sending of knowledge by the console. Although the need to send knowledge arises in a KM development environment where the KM is changing, the knowledge sending is removed from a production environment. This reduced network traffic requirements at startup to only an initial handshake message, and combined with the new protocol enhancements reduced the console-agent bandwidth requirements incredibly.

b) Console object optimizations — The PATROL 3.0 console does not store all of the

objects for each of the agents it is connected to. Only those objects that the console must know about are stored, thereby significantly reducing the memory requirements for the console, and the CPU time required for internal algorithms on these objects.

c) Console database optimizations — In addition to reducing the number of objects to

store by a better algorithm, the PATROL 3.0 console also received a new database method for storing those objects that it has to manage.

d) Console-agent protocol extensions — A console centric architecture suffers both

network bandwidth and console performance bottleneck problems due to a continuous stream of messages from agents regarding parameter values and object statuses. PATROL 3.0 architecture removed this by initiating the autonomous agent architecture. This involves storing agent history locally, and only sending the console statuses of objects that it needed to know about, which is only those in alarm or warn state or visible on the screen if the console user has drilled down. The network bandwidth becomes minimal and the console is not swamped with messages it must process.

e) Global channels — The new architecture supporting global channels improves the infrastructure for KMs collecting data from an application. This has improved the performance of data collection mechanisms such as SQL*Plus and ISQL sessions. The KM can make multiple requests through the global channel architecture leading to significant performance improvement over and above that which would occur if all KM parameters used independent collection mechanisms.

f) Collector-consumer paradigm — The distinction between PATROL parameters

into collectors and consumers reduces performance needs significantly. All PATROL KMs separate the parameters into collector parameters that collect data from the application and consumer parameters that display data. This separation brings together all the collection requirements of a KM, and combined with global channels, leads to significant savings in the internal scheduling of data collection processes.

g) Agent self tuning — The agent’s internal scheduling mechanism for both PSL and

operating system processes has a number of self-tuning features. The agent responds to variations in the number of PSL and OS processes it must schedule by scheduling them according to its tuning policy. There are a number of self-tuning options available, but the default out-of-box method is that the agent schedules processes according to variables including the poll-time, the number of processes waiting, and the values of tuning variables such as the interval. Internal processes are delayed if necessary beyond their normal tuning period in order to smooth out the load from processes in the agent.

h) One agent process only — The PATROL 2.0 architecture spawned a separate agent

process to accommodate each console. The PATROL 3.0 architecture changed to the far more server friendly architecture of one agent process only. The single agent handles all consoles and threads the requests it may receive from each console for efficient response times.

i) Prediscovery scripts — The support for the concept of “prediscovery” scripts

resolves an important performance concern. There is a conflicting aim between automatic discovery and performance, since one would like to load a KM on all agents to auto-discover an application, but it is not desirable to load an agent with work for an application it does not need to monitor. The prediscovery script idiom permits KMs to perform automatic detection efficiently by failing to execute more expensive discovery scripts if a number of simple prediscovery steps indicate the application is absent.

j) Disabled KMs — Unfortunately, there are cases where prediscovery steps do not

prevent discovery running on a machine where an application is present, such as when the given application simply does not have any simple steps. In this case, the agent can be configured to avoid discovery execution by adding it to the agent’s configuration variable listing all “disabled” KMs.

PATROL 3.1 Performance Aspects Whereas PATROL 3.0 was characterized by an architectural paradigm shift that permitted the scaling of PATROL monitoring to install bases never previously possible, the PATROL 3.1 release was a consolidation release building upon the proven 3.0 architecture. Although there were no major changes in the overall architecture, performance improvement is a continuing process and a number of the new capabilities were aimed at performance. a) TCP protocol support — The addition of the TCP protocol support between console

and agent improved performance of network connections. The performance of the reliable UDP protocol was adequate in most situations but slower in situations under heavy load leading to occasional connection time-outs, mainly in development environments where knowledge sending occurred. The TCP protocol showed itself to be superior in performance to reliable UDP, because although our reliability layer and the TCP reliability layer have similar functionality, the TCP layer is implemented at a kernel level, whereas the UDP reliability layer is by necessity in the user level. Hence, TCP is faster to respond to messages with acknowledgments should the user process (i.e. the agent) receive less than optimal scheduling by the operating system scheduling algorithm.

b) PSL builtin function optimizations — Measurements of the performance of the

agent using the Quantify profiling tool showed that a number of PSL functions were heavily used by KMs. Hand optimization of the implementations of PSL grep, ntharg, nthline, and string concatenation were performed to reduce their requirements and provide a gain across the board for all KMs.

c) XPC support — PATROL 3.1 added a mechanism whereby libraries written in C or

C++ could be more easily incorporated into PSL directly. This allows C functions to be written and called as if they were native PSL functions using the same syntax. This facility has been used by a number of KMs to improve performance of the more computation intensive aspects of monitoring. The use of these libraries allows PSL to be used where flexibility is important, and C/C++ to be used for those algorithms for which PSL is too expensive.

d) PSL profiler — Although not part of the standard product set, a PSL profiler has

been available as a special build of the PATROL 3.1 agent. This has been used as part of the standard KM development tool suite internally. This measures the CPU time and other metrics on a per-PSL process basis and allows KM developers to detect and correct any areas of PSL requiring tuning.

e) PSL scheduling — Although the agent self-tuning mechanisms remain similar to

those in 3.0, there was a few important changes to the scheduling of PSL processes internally within the agent. The changes were mainly aimed at addressing issues with long-running PSL processes, usually the result of error, to change their scheduling to lower the PSL process priority after a long period of execution.

PATROL 3.2 Performance Aspects The PATROL 3.2 release also has a number of new features related to performance. Notably there are important enhancements to the PSL tool suite with a new optimizer and the profiler now made dynamic. a) PSL optimizer — The PSL optimizer is a post-compilation optimizer based on well-

known optimizations from compiler theory. Some of the optimizations performed include peephole optimization, branch elimination, and loop optimizations.

b) PSL profiler enhancements — The PSL profiler, which was a special build of the

agent for PATROL 3.1, is now standard in the 3.2 agent. This has required profiler implementation changes so that PSL execution is not affected unless profiling is enabled. The profiler can be enabled dynamically at any time via PSL during agent execution to measure the CPU usage (and other metrics) on a per-PSL process basis. This can show a “top PSL processes” measurement for ad-hoc performance diagnosis, and also continues to provide the power of profiling for KM tuning during development.

c) Agent history database — The database mechanism used for the storage of the

agent’s parameter values has been reworked for performance. The new database offers improved performance under conditions when the database is full of values after a long period of agent execution.

Patrol Tuning Guide

Documents

Transcript of Patrol Tuning Guide