Operational Performance Metrics in a Distributed System ... · PDF file3.1 System Throughput...

Operational Performance Metrics in a Distributed System:Part II - Metrics and Interpretation

Robert L. Braddock, P. E., Michael R. Claunch, J. Walter RainboltLORAL Space Information Systems

and

Bert N. CorwinBOOZ*ALLEN & HAMILTON Inc.

Abstract

This paper presents a set of operational performancemetrics to characterize operational distributed systems.Two high-level aggregale metrics, System Availability andSystem Throughput, are described, with subordinatemetrics for each of the two groups identified. Definitionsand goals are presented for each specific operationalperformance metric. Guidelines for interpreting thesemetrics are described and suggested corrective actions areprovided. These metrics assist operations personnel in

managing large, compiex distributed systems as integratedsyslerns.

1.0 Introduction

Monitoring lhc health and status of a large, distributedsystem is a complex task. The increasing prevalence ofdistributed systems in current computer installations hasincreased the difficulty of performing systems managementand operations. An approach is needed to monitor theperformance of a distributed system and to provide forfuture capacity planning. Monitoring [he health and statusof such a system requires the implementation ofoperational performance metrics. While operationalperformance metrics for centralized systems have beenstudied extensively [1, 2], little work has been reported onmonitoring the performance of operational distributedsystems in an integrated manner.

This paper proposes a suite of operational performancemetrics for distributed systems. Descriptions for eachmetric include the data being collected for the metric andthe goals in monitoring the metric. Examples of theoperational performance metrics are presented, furtherdescribing interpretation of the metric, corrective actions tobe taken based on protrlcms identified by interpreting themetric, and definition of drc appropriate target audience.Permission to copy without fee all or part of this material isgranted provided that the copies are not made or distributed fordireot commercial advantage, the ACM copyright notice and thetitle of the publication and its date appear, and notice is giventhat copying is by permission of the Association for ComputingMachinery. To copy otherwise, or to republish, requires e feeand/or specific permission.e 1992 ACM 0-89791 -502 -X/9210002 f0873... $ 1.S0

2.0 Background

The operational distributed system upon which this studyis based provides computing resources for Space Shuttleflight trajectory planning at NASA. The system consistsof 270 desktop workstations (WS), 27 fileservers (H), 2data nodes (DN), and 4 compute nodes (CN). The datanodes serve as the central data repository and databaseengine, while the compute nodes provide a centralizedcomputational complex. The components are con,mtcdby a high-speed Fiber Data Distribution Interface (FDDI)backbone and a number of Ethernet subnets. Thebackbone provides connectivity between the subnets; italso connects these subncts to the rest of the system.Each subnct is populated with desktop workstationslocated in office cnvironmenw, providing the user interfacefor end-users. Each subnct is anchored by a file server thatprovides mass storage for the workstations and connectsthe workstations to the FDDI backbone. A high-leveldepiction of the architecture for this system is shown inFigure 2.1.

3.0 Operational Performance Metrics

A set of twelve metrics arc proposed to measure theoperational performance of a distributed systcm (Figure3.1). This set of metrics treats the system as a whole, andwhen used in combination can point to problems ordeficiencies in the system. In general, the metrics consistof two major classes: metrics that affect SystemThroughput, and metrics that affect System Availability.System Throughput is defined as the quantity of jobscompleted by the overall systcm per unit of time [7].System Availability is the time the system is available toperform work. Although System Throughput and Sys[emAvai/abi/ity can be directly measured, subordinateoperational performance metrics provide early indicationsof specific System Throughput and System Availability

problems.

As shown in Figure 3.1, mcasurcmcnt of these operationalperformance metrics spans all nodes in the systcm. The

873

Epson

I

MacintoshEquity

IIE Epoch

1! I

Systems

i Archiver r Server

CD4680 CD4680Compute Data * ---;=---=-; -=-= -= =-=7

Nodes.

Nodes a

FDDIBackboneLAN

9’“ ‘-CD4320

CD4360

External ILFFile Server

Server CD4320

CD4320 Network

Workstations Manager

.

r-

$1

3i

:11CD4320 :

CD4360 JSC

Server Workstations \

;1 NASABldg.30 :File

h. . . . . . . . . . . . . . . . . . . .*

Legend

~

CD4XX0 - Control Data 4000 series

Figure 2.1. Architecture,

goal of the operational performance metrics is to provide might be added to this suite, depending on unique systemtools to manage a distributed environment as an integrated needs and experience with the system. The operationalsystem. This is accomplished by implementing multiple performance metrics are intended for system-level diagnosismeasurement capabilities on each node, allowing for of problems. In the event of a problem, detailed, low-levelcorrelation and comparison among metrics. Other metrics metrics would be employed to isolate the cause.

Operational Perf. Metric Descrlptlon Node

1. System Throughput Aggregate - jobs per unit time all nodes

Network Utilization Amount of data being transferred and percent Network

of theoretical maximum rate attained

CPU Utilization Percent utilization over a time interval all nodes

Swap Space Swap space used vs. physical memory available all nodes

Batch Queue Length Number of jobs in queue CN

Turnaround Time Time for a job to execute CN

Interactive Response Time for a trivial operation to execute Ws

Il. System Availability Aggregate - system “up-time” and stability allnodesDi8kUtilization Percent volume of disk used CN, FS, DN

Workstation Utilization Number of users on the system at an instant of Wstime and time of maximum number of users

Print Volume Number of pages printed per printer per unit time Printers

Discrepancy Reporls Number of total, solved, outstanding DR’s and all nodes

(DR’S) the closure rate

Figure 3.1. List of operational Performance Metics.

874

3.1 System Throughput Metrics

Systetn Throughput metrics provide an overall indicationof the amount of work an operational computer system isperforming, and the sub-factors affecting this overallquantity of work. This metric, although measured directly,is actually an aggregate mernc. The puqose of this metricis to measure the overall performance and productivity ofthe distributed system.

Effects of system tuning and reconfiguration will beobserved by monitoring this mernc. Although this metricdoes provide an overall quantification of the workperformed by the system, it is insufficient to monitor thismetric exclusively. More detailed metrics must becollected and monitored to pro-actively affect SystemThroughput. ‘llese subordinate metrics include NetworkUtilization, CPU Utilization, Swap Space, Batch QueueL.agth, Turnaround Time, and Interactive Response Time.

System Throughput, as an aggregate metric, is monitoredand reported to Operations Managers as an indicator of theoverall performance and productivity of the distributedsystem. Significant trends in the data reported, however,should have already been observed and action taken based

on subordinate System Throughput metrics by eitherOpemtionsPersonnel or Administrators.

3.1.1 Network Utilization

Network Utilization, a System Throughput metric,provides an overview of network activity for a distributedsystem. The purpose of this metric is to present peak andaverage network usage for each network type within thedistributed system.

Figure 3.2 is an example of a graph used to monitorNetwork Utilization. Its components include aggregate

dailynetwork activitysampled every 5 minutes on both

theIEEE 802.3 Ethernet subnets and the FDDI Backbone,as well as peak values observed during 5 minute sampleperiods. Peak activity generally follows an eight hourwork&y, with peak average network activity occurringduring mid-morning and early afternoon. Assuming thereis only one work-force shift from approximately 8:00 amto 5:00 pm, correlation can be made between NelworkUtilizatwn and the number of activeuserslogged intothe

distributed system. Network activity may also increasenear midnight, as shown in the example, capturing anetwork load generated by deferred batch programs

❑ PeakSubnet

Peak

40”A● Backbone

30°A

20%-0--0-+ +04 0+++0+ +04 +04-0- 0-0

●

●+~”-*\*O”e*’

< o-+~+-++-~o+

+-+-* -*—*-* m a n a 1 1 1 1 1 1 1 1 n t 1I I I I 1 I I 1 I I I I I 1 I I I I 1 I I I

02 4 6 8 10 12 14 16 18 20 22

Hour of Day

~Sample ——0—--Ave —+— Sample —o- - AWSubnet Subnet Backbone Backbone

Figure 3.2. Network Utilization (Backtxme and Subnets) - Daily.

875

scheduled to execute during the night (off-peak). Figure3.2 would be used as a daily observation tool capturinghourly network usage patterns and identifying delayproblems. In addition, average Network Utilization isplotted. The backbone and subnet metrics are reportedseparately due to the very different characteristics andthresholds associated with the FDDI Token Ring backboneversus Ethernet subnets.

The average and peak utilizations from the daily chart arecarried over to a weekly chart. In this way, trends in peakand average Network Utilization are monitored and trendsprojected. Trends may be observed indicating eitherincnxised or decreased network activity. Further, the long-term continuance of a trend in either direction requiresfurther investigation. Decreases in average NetworkUtilization may indicate an overall decrease in systemactivity or an increase in collisions on an Ethernetnetwork. Peak Network Utilization may also have thesame symptoms.

This metric provides useful information to bothAdministrator and Operations Personnel. For OperationsPersonnel, periodic sampling provides a frequency chartindicating daily workload trends and heavy data transferactivity at a summary level. For Administrators, thischart is used as a reference for normal daily activity whentracing response time problems in a distributed system.Further, Administrators may generate subnet-level reports,including monitoring packet rates, collisions, andtransmission error rates to provide a snapshot of user andsystem network activity. As describe above, these detailedinvestigative metrics are used to isolate problems identifiedwith the aggregate metrics.

3.1.2 CPU Utilization

CPU Utilization, a System Throughput metric, provides ameasure of time that system CPU’s are in service. Thepurpose of collecting this metric is to report the amount oftime the CPU is busy processing during a specifiedinterval.

Nodes in a distributed system dedicated to multi-userprocessing are key elements requiring monitoring of CPUUtilization. For nodes with user-dedicated CPU’s, such asworkstations, CPU Utilization is not as valuable andtherefore, not monitored at the system level. Note thatpeak CPU U[i/ization, in this example distributed system,is of minimum benefit and, therefore, not reporttxl sinceevery node will instantaneously peak at very high CPU

Utilization at some time during operations.

Figure 3.3 provides a typical summary of average CPUUtilization for multi-user nodes in the distributed systemas an aggregate by node type (i.e., data nodes, computenodes, and fileservers). When a node approaches anaverage utilization of 70 to 9090, corrective action must betaken [3], dependent upon the type of node. Correctiveactions include 1) supplementing the processingcapabilities via upgrade and 2) changing the distribution ofwork within the overall system. In Figure 3.3, computenode utilization is increasing steadily through the earlysample periods. In order to reverse this trend, a re-allocation of work can be performed, allocating small,demand jobs to the workstations. As shown in the figure,such a re-allocation of work is successful in reversing thecompute node trend. The data nodes also show a trend ofincreased CPU Utilization. For this type of processing,

100

80

I 9 0“-

20

Figure 3.3. CPU Utilization - Weekly,

876

Figure 3.4. Swap Space Summary.

however, re-allocation opportunities are not avaitable. Anupgrade is available for the data node CPU, providingsignificantly more compute power within the existingchassis. By pro-actively monitoring this metric andkeeping in mind the “bad Time to Procure”, the CPUupgrade solves the data node CPU Utilization problems,preventing significant delays in overall SystemThroughput.

CPU Utilization, at a system level, is monitored byOperations Personnel to observe overall trends in computerequirements for each major node type. In addition,Administrators monitor node-level reports of CPUUtilization to ensure individual node trends do not showabnormal increases or decreases.

3.1.3 Swap Space

The Swap Space metric, a significant component ofoverall System Throughput, provides early indications ofinsufficient physical memory. The purpose of the SwapSpace metric is to report the ratio of swap space in usedivided by the amount of physical main memory.

Empirical “warning levels” have been identified for theSwap Space metric. For diskless workstations, thiswarning level is a ratio value between two and threG thewarning level for diskful workstations, compute nodes,data nodes, and fileserver is between three and four (Figure3.4). Swap Space ratios which consistently fall withinthese warning levels are likely to have significant negativeimpact on the System Throughput. When these t.hmsholdsare exceeded, Interactive Response Time and TurnaroundTime will increase and limit the production capability ofthe system.

In a diskless environment, network traffic will increase tofurther reduce the production capability of the system.

Although some system tuningchanges may be possible toinsufficient memory, typical

and operational policyreduce the effects ofscenarios will require

memory upgrades for-problem nodes of the system. -

Summary metrics are reportedfor each node type forOperations Personnel. Administrators, however, will pro-actively monitor Swap Space for each node, ensuring thattrends are found before overall System Throughputbecomes unacceptable.

3.1.4 Batch Queue Length

Batch Queue Length, as a component of the SystemThroughput metric, provides valuable insight into howwell jobs are being serviced by the system and the quantityof “backlog” which is processed during non-peak hours.By reporting this metric, the mix of work on theoperational system may be monitored.

Figure 3.5 provides a sample report of the daily BatchQueue Length metric. During prime operations, BatchQueue Length indicates the number of jobs not receivingservice. The batch queue begins to build as user load onthe system increases, with a noticeable increase occurringnear the lunch hour [4]. During lunch, while user inputsare few, the queue length typically is reduced as more jobsare serviced than are received. During the afternoon,however, additional jobs are submitted, with another peakof activity near the end of the day for users. Batch jobs areprocessed during the night while no new user inputs arenxeived. If this number is not reduced to zero during off-hour operations, turnaround in the system is a problem.

For longer term planning, weekly Batch Queue Lengthcharts are used (Figure 3.6), plotting the number of jobsremaining in the queue at 6:00 am. If the number of dayswith jobs in the queue becomes more frequent, System

877

--- Actual$ ‘K

--,%,........,......,. -\.,*.- 9. .

h‘.

‘.‘% --

●.---- ---

,1 I I I

012345678910 1112131415161718192021 2223

Time of Day

Figure 3.5. Batch Queue Length - Daily.

Throughput may be affected. A large number of batchjobs still being serviced at 600 am, as shown on the curvein Figure 3.6, may point to other performance problems inthe system limiting the servicing of batch jobs, or simplyto an increase in number of jobs. Significant decreases inoverall batch jobs in queue may be correlated withreductions in quantity of batch jobs submitted and acorrelating increase in interactive jobs, affecting theinteractive response of the system.

Operational personnel monitor both the daily and weeklyBatch Queue Length metric to ensure System Throughputis acceptable. Steadily increasing weekly metrics will becorrelated with CPU Utilization, Network Utilization andSwap Space to determine the cause for incomplete batchqueue service.

3.1.5 Turnaround Time

Turnaround Time, a System Throughput metric, representsthe total time to complete all processing for each job. Thepurpose of this mernc is to define rhe speed at which jobsare being processed by the system, as well as the relativedistribution of job Turnaround Time.

Figure 3.7 provides a plot of the number of jobs versusTurnaround Time for all jobs in the distributed system, aswell as cumulative job turnaround time. The lower curveshows the distribution of jobs for specific turnaroundtimes. The cumulative curve allows identification ofpercent of jobs complete per turnaround time. Specificpercentiles would then be selected for study, to &terminethe long-term trend of the Turnaround Time metric. In

= - Sample Queue Length— Trend

9

Time ->

Figure 3.6. Batch Queue Length - Weekly.

878

. ..5Q%--

25%

00

I95%-------- -------- -------- -------- --------

‘/ ‘\

\

1r’ 1

t- 1

\

1I

\!

>:

<>!

*-

1,-.’ :

w:n 1

I,T : I

1 1I

,- I04’:’ 1 I 1 1 1 1, 1 1 I I, , 1 m 1 1 1 1 m

Figure 3.7. Turnaround Time-Daily.

figure 3.7, threepercentile rangesm specified:25%, 50%,and 959o. A plot is then made of the time it takes 259o ofthe jobs to complete through time. The OperationsPersonnel would employ this plot to determine the effectsof system upgrade or reconfiguration. Upward trends,suggesting increased time to satisfy batch jobs, would becompared against other System Throughput merncs todetermine the causes.

Further, the Turnaround Tim metric ties very closely tothe workload models. The workload models would be usedto project the amount of work specific nodes and the entiresystem could perform, This is generally characterized asjobs which run on various nodes. A projection of the jobdistribution is performed during the development of thesystem [3] and this can then be correlated to theTurnaround Tim metric to validate the system design.

3.1.6 Interactive Response Time

Interactive Response Time, the last System Throughpu[metric, provides insight into the ability of the system toperform interactive user tasks. The purpose of this metricis to measure the time it takes for an interactive process torespond to a user input, such as opening a window orresponding to an operating system command [5].

Interactive Response Time is the most noticeableperformance metric for the user community. Closemonitoring of this metric at both a system and node-levelis necessary to maintain user productivity in a distributedsystem. As [interactive Response Time increases, theamount of time between user actions on the system alsoincreases, directly reducing productivityy [6]. Increasing

trends in this metric could be the result of increased jobsize, number of concurrent processes, and other workload-related changes, as well as subtle system tuning orreconfiguration changes. Further investigative metricsmay be required to accurately determine the cause of thetrends. In addition, other System Throughput metrics,such as Network or CPU Utilization, may provide insightinto the source of the degradation.

At a node-level, Administrators must perform trendassessment on both a daily and weekly basis. At asystem-level, Operations Personnel should pro-activelymonitor aggregate maximum, minimum, and averageresponse times to assist in long-range planning.

3.2 System Availability

Sysfern Availability metrics provide an overall indicationof the percent of time system components are operational.System Availability, although measured directly, isactually an aggregate metric. The purpose of this metric isto identify system capability to sustain needed operations,providing insight into system capacity and stability.

Effects of system tuning, problem resolution, andreconfiguration will be observed by monitoring thismetric. This metric provides an overall indication of theavailability of the system. Further, trends observed in thedata from this metric are used to indicate an overall healthand status of the system. More detailed metrics must becollected and monitored to pro-active] y affect SystemAvailability. These metrics include Disk Utilization,Workstation Utilization, Print Volume, and DiscrepancyReports.

879

.

The System AvailabiJify metric is plotted as thepercentageof time the system components are operationalover time. As an aggregate metric, it is monitored andreported to Operations Managers as an indicator of theoverall operational readiness of the distributed system. Aswith the aggregate metric System Throughput, significanttrends in the data reported should have already beenobserved and action taken based on the subordinate SystemAvailability metrics by either Operations Personnel orAdministrators.

3.2.1 Disk Utilization

Disk Utilization is a significant component of the SystemAvailability metric. The purpose of this metric is toreport disk usage over time and identi~ trends which couldreduce System Availability.

Operations Personnel pro-actively monitoring the DiskUtiiizarion metric may observe that disk usage has beenincreasing steadily, exceeding a pre-defined “high-watermark” (Figure 3.8). A decision may be made to changethe operational data policies of the system, reducing timethat data is stored on the system before deleting orarchiving. This change in operational system policiesmay effectively reverse the increasing Disk Utilizationtrend, eliminating the risk to System Availability. As anexample, the storage policy on time-sensitive data mightbe changed to archive files after 3 months rather than after5 months, resulting in a significant disk savings, as

In many cases, however, an upward trend observed in themetrics may be positively influenced on a temporary basis,but not alleviated. System Administrators may be taskedto analyze and define the detailed drivers affecting diskincrease using further investigative metrics. After thisanalysis, a new projection based on Disk Utilizationmetzics may still indicatean increase in disk usage.Having exhausted the options to operationally manage theproblem through system policy changes, there may be noother choice than to add more disk space to the system.The most critical focus at this point is the “Lead Time toProcure” additional disk (Figure 3.8).

3.2.2 Workstation Utilization

Workstation Utilization, a System Availability metric,captures the ovemll usage patterns of the end-user terminaldevices for the distributed system. The purpose of thismetric is to quantify the number of workstations whichcontain active login sessions over time. Once a system isoperational, this metric provides a realistic view of theamount of workstations in use and helps in futureworkstation capacity planning.

Figure 3.9 shows a graph of the number of workstationsused during the &y. Peak utilization occurs around 10:00am and 2:00 pm. Operations Personnel use weeklysummaries of the daily peak and average values provide formore long-range planning. Although both peak andaverage utilization are plotted, of most concern is the

shown in Figure 3.8. general trend associated with the average

120

100

80

:80

) 40

20

.!

------ .- —--- .- —..

Lead Timeto Procure

~ Total Disk Avail. ------- High-water mark — Disk Used

Figure 3.8. Disk Utilization - Monthly,

880

Workstation Utilization. Further, when the peakutilization approaches 100%, there is a need for additionalworkstations or a change in system operations policies.Flexible or extended work hours could result in a decreasein the peak utilization by allowing users to begin theirwork early or end late. In addition, increasing averagetrends may force the addition of more workstations in orderto sustain System Availability, independent of changes inoperations policies.

3.2.3 Print Volume

Print Volume, another System Availability metric,provides an overall view of the total printer activity for thedistributed system. The goal of this metric is to providethe aggregate printer volume, collected from each printer“pagtnmeter” divided by the total capacity of all printers.

Aggregate printer volume for a given time period isreported for the Print Volume metric. A theoretical printmaximum is calculated as the number of printers times thenumber of pages per hour times the number of hours pershift. The total number of pages printed is then compared10 this theoretical limit. Peak periods every month maydrive the printer usage beyond that which can occur in asingle operational shift, After review, a decision to absorbthis excess printing each month may be made rather thanaugmenting the system printer capability. At a moredetailed level, low print volume for individual printers

when compared to average printer volume indicatespossible inaccessibility of printers for users or poor logicaladdress within the distributed system.

Administrators monitor individual printer volumes on aperiodic basis to check for proper physical and networkahation. Operations Personnel monitor the aggregateprinter volume over time to ensure sufficient reservecapacity is present to handle peak worklcOds.

3.2.4 Discrepancy Reports (DR’s)

Discrepancy Reports are a significant driver in SystemAvailability. The purpose of this metric is to provide anindication of the stability of the system and the sped atwhich problems in the system are resolved.

Figure 3.10 is an example of the cumulative DR count anda count of the number of DR’s resolved per unit time. Thecumulative DR count increases rapidly during initialsystem operations, but begins to stabilize as the monthsprogress. This is typical of system operations, indicatingincreased system stability as well as improved userfamiliarity. The number of resolved DR’s identifies therate of problem resolution within the system. Initially inFigure 3.10, problems are reported faster than they can beresolved. After four months of operations the trend hasreversed, with problem resolution beginning to reduce theDR backlog, resolving problems at a much faster rate than

1--. a 1 1 1 1 , 1 1 1 1 I 1 1 h m 1 1 , 1I I 1 I I I I I 1 1 I 1 I I I 1 I I

8:OOam Noon 5:OOpm

Time of Day

Figure 3.9. Workstation Utilization - Daily.

881

looYo-

90yo - -

80’?40- -

70% - -

— Total DR’s600/0- ‘ ‘ — — - Closed OR’S

50%t /

.“, .

30?40

20?40

10“!0

Time +

Figure 3.10 Discrepancy Reports (Weekly)

problems are reported. Finally, nearing month twelve, theproblem resolution rate slows to reflect the decreasedbacklog and decreased occurrence of problems in theoperational system.

Although Administrators are the actual individuals whoinvestigate and resolve problems, Operations Personnel arethe primary monitors of the Discrepancy Reports metric ina distributed system.

5.0 Summary

A set of twelve metrics are proposed to measurethe operational performance of a distributed system. Thegoal of these metrics is to capture the health of thesystem, determine the appropriateness of the currentsystem configuration and provide tools for fi.mm planning.The proposed metrics are based on an actual operationaldistributed system in use at NASA. The intent of thesemetrics is to provide operations personnel the tools tomanage a distributed environment as an integrated system.The specific operational performance metrics would becustomized for individual distributed systems.

6.0

[1]

[2]

[31

References

Leung, C. H., Quantitative Analysis of ComputerSystems, John Wiley & Sons, 1988, pp. 1-4.Ferrari, D., Computer Systems PerformanceEvaluation, Prentice-Hall, 1978, pp. 1-22,225-232.Silberschatz, A. and J. L. Peterson, Operation System

[4]

[51

[6]

Concepts, Addison-Wesley, 1988, pp.158-159.Lawzowska, E. D.. J. Zahorian. G. S. Graham.. andK. C. Sevcik, Quantitative-System Performance,Computer Systems Analysis Using QueueingNetwork Analysis, Prentice-Hall, 1984, pp. 4, 14-23.Cortada, J. W., Managing DP Hardware, Prentice-Hall, 1983, pp. 51-105.Artis, H. P. and J. Buzen, Capacity Planning, TT’ISeminar, 1989, pp. 6.014,8.28.

Acknowledgements

We would like to thank C. Deiterich, C. Gilbert, J.Powell, and all of the FADS Engineers, who providedencouragement and support during the development andinitial operations of the new system. Special thanks isextended to R. Bates, J. Newman and T. Pelnik for theirdiligent peer review of this pape~ they greatly improvedit’s quality.

882

Operational Performance Metrics in a Distributed System ... · PDF file3.1 System Throughput...

Documents

Transcript of Operational Performance Metrics in a Distributed System ... · PDF file3.1 System Throughput...