Ch24 system administration

Performance Analysis

Chapter 24

Chapter Goals

• Understand the basic terminology of performance monitoring and analysis.

• Understand proper methods of monitoring a system’s performance.

• Knowledge of the tools that allow you to monitor system performance.

• Understand how to analyze the data provided by the monitoring tools.

• Understand how to apply the data to improve system performance.

• Understand what to tune, and why to tune it.

General Performance Tuning Rules• Right-size the system to start with.

– You do not want to start with an overtaxed system with the intention of providing a turbo-charged service. UNIX is very demanding on hardware. UNIX generally provides each process with (the illusion of) unlimited resources. This often leads to problems when system resources are taxed. Windows operating systems and applications often understate system requirements. The OS and/or applications will operate in a sparse environment, but the performance is often abysmal.

General Performance Tuning Rules• Determine the hardware requirements of specific types

of servers.– Generally, e-mail and web servers require high-

throughput network links, and medium to large memory capacity. Mail servers typically require significantly more disk space than web servers. Database servers typically require large amounts of memory, high capacity, high-speed disk systems, and significant processing elements. Timeshare systems require significant processing elements, and large amounts of memory.

General Performance Tuning Rules

• Monitor critical systems from day one in order to get a baseline of what “normal” job mixes and performance levels are for each system.

• Before making changes to a system configuration, make sure user jobs are not causing problems.– Check for rabbit jobs, users running too many jobs, or jobs of

an inappropriate size on the system.

• A performance problem may be temporary, so you need to think through any changes before you implement them.– You might also want to discuss proposed changes with other

system administrators as a sanity check.

General Performance Tuning Rules

• Once you are ready to make changes, take a scientific approach to implementing them.– You want to ensure that the impact of each change is independently

measurable. – You also want to make sure you have a goal in mind, at which point

you stop tuning and move on to other projects.

• Before you begin making changes to the system, consider the following.– Always know exactly what you are trying to achieve.– Measure the current system performance before making

any changes.– Make one change at a time.

Change Rules

– Once you do make a change, make sure to monitor the altered system for a long enough period to know how the system performs under various conditions (light load, heavy load, I/O load, swapping).

– Do not be afraid to back out of a change if it appears to be causing problems.

• When you back a change out, go back to the system configuration immediately previous to the “bad” configuration. Do not try to back out one change and insert another change at the same time.

– Take copious notes.• These are often handy when you upgrade the OS, and have to

start the entire process over.

Resource Rules• Install as much memory as you can afford.• Disk systems can also have a substantial impact on system

performance.• Network adapters are well-known bottlenecks.• Eliminate unused drivers, daemons, and processes on the

system.• Know and understand the resources required by the

applications you are running.

Terminology• Bandwidth:

– The amount of a resource available. If a highway contains four lanes (two in each direction), each car holds four people, and the maximum speed limit allows 6 cars per second to pass over a line across the road, the “bandwidth” of the road is 24 people per second. Increasing the number of lanes will increase the bandwidth.

• Throughput: – Percentage of the bandwidth you are actually getting. Continuing

with the road example, if the cars only hold one person, the protocol is inefficient (not making use of the available capacity). If traffic is backed up due to an accident and only one or two cars per second can pass the line, the system is congested, and the throughput is impacted. Likewise, if there is a toll booth on the road, the system experiences delays (latency) related to the operation of the toll booth.

Terminology• Utilization:

– How much of the resource was used. It is possible to use 100% of the resource, and yet have 0% throughput (consider a traffic jam at rush hour).

• Latency: – How long it takes for something to happen. In the case of the

road example, how long does it take to pay the toll?

• Response time: – How long the user thinks it takes for something to occur.

• Knee: – Point at which throughput starts to drop off as load

increases.

Terminology• Benchmark:

– Set of statistics that (hopefully) shows the true bandwidth and/or throughput of a system.

• Baseline: – Set of statistics that shows the performance of a system over a long

period of time. – Instantaneous data about the system’s performance is rarely useful

for tuning the system. But long-term data is not very useful either, as peaks and valleys in the performance graph tend to disappear over time.

– You need to know the long-term performance characteristics, as well as the “spikes” caused by short-lived processes. A good way to obtain long-term (and short-term) information is to run the vmstat command every five seconds for a 24-hour period. Collect the data points, reduce/graph these data points, and study the results.

Windows Monitoring

• Task Manager• The Cygwin package allows the administrator to

build and install several UNIX tools to monitor system performance.

• For sites that do not use the Cygwin toolkit, there are several third-party native Windows tools that might be useful when you need to monitor system performance. Among these tools are: – vtune http://developer.intel.com/software/ products/vtune/

– SysInternals http://www.sysinternals.com/

UNIX Monitoring

• ps

• top

• vmstat

• iostat

• nfsstat

• netstat

• mpstat

• accounting

Unix Monitoring

• Most versions of UNIX ship with an accounting package that can monitor the system performance, and record information about commands used. – Many sites run the detailed system accounting package in order to bill

departments/users for the use of the computing resources they consume.

– The accounting packages can also be very useful tools for tracking system performance.

– Although the accounting information is generally most useful as a post-mortem tool (after the processes has completed), it is sometimes possible to gather semi real-time information from the system accounting utilities.

– System auditing packages can give a lot of information about the use of the system, but these packages also add considerable load to the system.

• Process accounting utilities will generally add 5% of overhead to the system load, and auditing utilities can add (up to) 20% of overhead load to the system.

Accounting• Why run accounting?

– Bill for resources used.

• CPU time used• Memory used• Disk space used• Printer page accounting

– Detailed job flow accounting (Banks/Insurance/Stock trading)

• Keep track of every keystroke• Keep track of every transaction

– Security

• track every network connection• track every local login• Track every keystroke

Accounting

• Two types of accounting– Process accounting

• Track what commands are used• Track what system calls are issued• Track what libraries are used• Good for security (audit trail)• Good when multiple users have access to system• Good way to track what utilities and applications are

being used, and who is using them.

Accounting

– Detailed accounting• Track every I/O operation

–Disk

–Tape

– tty

–Network

–Video

–Audio

• Primarily used for billing

Accounting

• Charging for computer use– Almost unheard of in academia (today).

• Some Universities charge research groups for CPU time.• Some Universities charge for printer supplies.• Some Universities charge for disk space and backups.

– Most companies that run accounting have a central computing facility.

• Subsidiaries buy computing time from the central group.• Accounting is used to pay for support, supplies, …

Accounting

• Why avoid accounting?– Log files are huge

• Must have disk space for them.– 15 minutes of detailed accounting on a system with one user

generated a 20 MB log file!– 15 minutes of process accounting on a system with one user

generated a 10 MB log file!• Must have (and bill) cpu time for accounting.

– Accounting can require a lot of CPU/disk resources– Who will pay for the CPU/disk resources used by accounting

• Must decide what information to keep, and what to pitch.

Accounting

• What can accounting track?– Some of the common things to track:

• CPU time• Memory usage• Disk usage• I/O usage• Connect time• Dial-up/Dial-out usage• Printer accounting

Accounting• Solaris

– Auditing• Perform audit trail accounting• Relies on the Basic Security Module (BSM).• Can monitor TONS of stuff.

– Processes– Function/subroutine calls– System calls– Ioctls– Libraries loaded– File operations (open close read write create remove)– File system operations (stat, chmod chown, …)– Can configure to monitor successful/unsuccessful operations– Can monitor on a per user basis

Accounting

• Solaris• Audit binaries

–auditconfig

–auditd – the audit daemon

–praudit – print audit information

–auditon – turn on auditing

Accounting

• Solaris– Audit files

• Control Files in /etc/security– audit_class– audit_control– audit_data– audit_event– audit_startup– audit_user– audit_warn– device_allocate– device_maps

Accounting

• Solaris– Audit Files

• Data Files in /var/audit– YYYYMMDDHHMMSS.YYYYMMDDHHMMSS.hostname– YYYYMMDDHHMMSS.not_terminated.hostname

Accounting

• Solaris– Accounting

• Daily Accounting• Connect Accounting• Process Accounting• Disk Accounting• Calculating User Fees

Accounting


• /usr/lib/acct/acctdisk• /usr/lib/acct/acctdusg • /usr/lib/acct/accton • /usr/lib/acct/acctwtmp• /usr/lib/acct/closewtmp• /usr/lib/acct/utmp2wtmp

Accounting

• Solaris– Accounting binaries

• acctcom – search/print accounting files• acctcms – generate command accounting from logs• acctcon – turn accounting on/off• acctmerg – merge multiple account forms into a report• Acctprc – programs to generate process accounting logs• fwtmp – manipulate connect accounting records• runacct – run daily accounting summary

Accounting


• Data Files– /var/adm/pacct– /var/adm/acct/fiscal– /var/adm/acct/nite– /var/adm/acct/sum


• Unix User Interface researchers report that an average user perceives a system to be slow when response times are longer than 0.7 seconds!

Performance Analysis– CPU time –

• How long does the user’s job take to complete?– Is the job time critical?

• What other jobs are running?– Context switches are costly.– Must share cpu cycles with other processes

• What is the system load average?

– Memory speed – • Does the job need to be loaded into memory?• How quickly can memory be filled with pertinent information?• Is the job swapped out?

– Swapping brings disk system into picture.– Swapping invalidates cache for this job.– Swapping is easy to eliminate/minimize!

• Does the job fit into cache?

Performance Analysis– Disk I/O bandwidth –

• Bus Speed• Controller width/speed• How fast can information be pulled off of disk?

– SCSI vs IDE vs RAID– Rotational latency– Caching in controller/drive

• Disk system speed will have an effect on memory speed (swapping).

– Network I/O bandwidth – • Are files stored on a network file system?• Does network file system do any caching?• Shared/switched media?• Full/half duplex?

Performance Analysis• CPU bound jobs are difficult to measure.

– Use ps and top to see what is running.

– Use uptime to determine load averages • 1 minute average is good for “spiky” load problems• 5 minute average is good metric to monitor for “normal” activity• 15 minute average is good indicator of overload conditions

– Use sar to determine the system cpu states.• System accounting can track amount of time each CPU spends

working on idle/system/user jobs.

– Use mpstat to determine what multi-processor systems are doing.• One busy processor and one idle processor is probably “normal”

operation.

– Use vmstat and iostat to determine percentage of time system is running user/kernel processes.

• Less detail than sar, but good general information.


• How can you improve CPU performance?– More cpu(s)– Faster cpu(s)– Lock jobs to specific cpu(s)– Lock cpu(s) to specific tasks

Performance Analysis• Before you can diagnose performance problems,

you must have a good idea of what is reasonable for your system.– Monitor system and develop a fingerprint of typical job mixes, load

average, memory use, disk use, network throughput, number of users, swapping, job size.

– If something happens to the performance use these metrics to determine what has changed.

• Did jobs get larger?• More disk or network I/O?• Less free memory?• More swapping?• More users?• More jobs?

CPU Performance• In general, the output of the top, vmstat, w, and other utilities

that show processor-state statistics can tell you a lot about the performance of the CPU subsystem. – If the CPU is in user mode more than 90% of the time, with little

or no idle time, it is executing application code. • This is probably what you want it to do, but too many user jobs

running concurrently may be detrimental to any one job getting any work done.

– If the CPU is in system mode more than 30% of the time, it is executing system code (probably I/O, or other system calls).

• Context switches are a symptom of high I/O activity (if the interrupt rate is also high).

• If seen in conjunction with high system call activity, it is a sign of poor code (nonlocal data, open, read, close, or loop).

– If the CPU is idle more than 10% of the time, the system is waiting on I/O (disk/network).

• This could be a symptom of poor application code (no internal buffering) or overloaded disk/network subsystems.

CPU Performance– If the system exhibits a high rate of context switches, the

system is displaying symptoms of a number of possible problems.

• Context switches occur when one job yields the processor to another job.

• This may occur because the scheduler time slice expired for the running job, because the running job required input/output or because a system interrupt occurred.

– If the number of context switches is high, and the interrupt rate is high, the system is probably performing I/O.

• If the number of context switches is high, and the system call rate is high, the problem is likely the result of bad application coding practices.

• Such practices include a program loop that repeatedly performs the sequence “open a file, read from the file, close the file.”

CPU Performance• If the system exhibits a high trap rate, and few system calls, the

system is probably experiencing page faults, experiencing memory errors, or attempting to execute unimplemented instructions. – Some chips do not contain instructions to perform certain

mathematical operations. – Such systems require that the system generate a trap that causes

the system to use software routines to perform the operation.• An example of this situation occurs when you attempt to run a

SPARC V8 binary on a SPARC V7 system.

• The SPARC V7 system contains no integer multiply/divide hardware. SPARC V8 systems contain hardware multiply/divide instructions, so compiling a program on the V8 architecture imbeds these instructions in the program.

• When this same program is run on a V7 system, the OS has to trap the instructions, call a software routine to perform the calculation, and then return to the running program with the answer.


• Memory is a critical system resource.– Unix is very good at finding/hoarding memory for disk/network

buffers.• Unix buffering scheme

– At boot time, size memory.Kernel takes all memory and hoards it

As jobs start, kernel begrudgingly gives some memory back to them.

In some versions of UNIX:Disk buffers are allocated on file system (disk partition) basis

Network buffers are allocated on a per-interface basis.


• Memory is a critical system resource.– Before upgrading the cluster systems OIT looked at the

memory question:

• With 64 Meg jobs took X minutes to run.• With 128 Meg of memory, the same jobs took X/3

minutes to run.• With 256 Meg of memory, the same job did not run

any faster, but you could run multiple instances of same job with no degradation in performance.

– Memory is cheap. Buy lots!


• Monitoring memory use.• Use pstat -s to look at swap information on BSD

systems.• Use swap -l to look at swap on System V systems.• Use sar -r to look at swap information• Use vmstat to look at memory statistics.• Use top to monitor job sizes and swap information.• If there is any sign of swapping

–Memory is cheap! Buy Lots!• Can adjust reclaim rate, and other memory system

parameters, but it is usually more profitable to add memory.

Memory Performance• Unlike CPU tuning, memory tuning is a bit more objective. Quantifying

CPU performance can be somewhat elusive, but quantifying memory usage is usually pretty straightforward.

• Job Size– An easy diagnostic for memory problems is to add up the size of all

jobs running on the system, and compare this to the size of the system’s physical memory.

– If the size of the jobs is grossly out of proportion to the size of the system memory, you need to do something to change this situation.

• You could use a scheduler that uses job size as one of the criteria for allowing a job to run, remove some processes from the system (for example migrate some applications to another server), or add memory to lessen the disparity in the requested versus available memory.

Memory Performance

• Swapping/Paging– Under BSD operating systems, the amount of virtual memory is

equal to the swap space allocated on the system disks plus the size of the shared text segments in memory.

• The BSD VM system required that you allocate swap space equal to or greater than the size of memory. Many BSD environments recommended that you allocate swap space equal to 4x the size of real memory.

– Under System V UNIX kernels, the total amount of virtual memory is equal to the size of the swap space plus the size of memory, minus a small amount of “overhead” space.

• The system does not begin to swap until the job memory requirements exceed the size of the system memory.

Memory Performance• You can estimate the system’s virtual memory requirements on BSD

systems by looking at the output of the top and/or ps commands. – If you add up the values in the RSS columns (resident set size), you

can get an idea of the real memory usage on the system. • Adding up the values in the SZ column gives you an estimation of the

VM requirements for the system. – If the total of all SZ values increases over time (with the same jobs

running), one or more applications probably have memory leaks. – The system will eventually run out of swap space, and hang or

crash.• Some kernels allow you to modify the page scan/reclaim process.

– This allows you to alter how long a page stays in real memory before it is swapped or paged out.

– Such modifications are tricky, and should only be performed if you know what you are doing.

Memory Performance

• If you see that the scan rate (sr column in vmstat output) value is roughly equal to the free rate (fr column in vmstat output), the system is releasing pages as quickly as they are scanned. – If you tune the memory scan parameters to increase the period

between when the page is scanned and when it is paged out (allow pages to stay in memory for a longer period), the VM system performance may improve.

– On the other hand, if the sr value is greater than the fr value, decreasing the period between scan and paging time may improve VM system performance.

Memory Performance

• VM Symptoms– The following indicators may be useful when tuning the VM

system.• Paging activity may be an indicator of file system activity.• Swapping activity is usually an indicator of large memory

processes thrashing.• Attaches and reclaim activity is often a symptom of a program in

a loop performing a “file open, read, and close” operation.• If the output of netstat –s shows a high error rate, the system

may be kernel memory starved. This often leads to dropped packets, and memory allocation (malloc) failures.

Memory Performance

• Shared Memory– Large database applications often want to use shared memory for

communications among the many modules that make up the database package.

• By sharing the memory, the application can avoid copying chunks of data from one routine to another, therefore improving system performance and maximizing the utilization of system resources.

• This generally works fine, until the application requests more shared memory than the system has available.

• When this situation occurs, system performance will often nosedive.

– Under Solaris, the /usr/bin/ipcs command may be used to monitor the status of the shared memory, and semaphore system.

Memory Performance

• mmap– If an application is running from a local file system, you might

want to look into using the mmap utility to map open files into the system address space.

• The use of mmap replaces the open, malloc, and read cycles with much more efficient operation for read-only data.

• When the application is using network file systems, this might actually cause a degradation of system performance.

• Using the cachefs file system with NFS will improve this situation, as this allows the system to page to a local disk instead of through the network to an NFS disk.

Performance Analysis• How can you improve the memory system?

– Add memory • It’s cheap

– Use limits • they’re ugly• payoff is not (usually) very good.


• Disk I/O is one of the most critical factors in system performance.– Most file access goes through the disk I/O system.

• Multiple “hot” file systems on one disk will be a problem.• Slow disks will be a problem• Narrow controllers will be a problem• Partitioning of disks will have an effect on buffering• Disk geometry of disk will have an effect on buffering

– Swapping/paging goes through the disk I/O system.• Split swap space over multiple spindles to increase interleave

• If swapping: Buy More Memory (It’s cheap)– Use iostat to look at the disk I/O system.

Disk Performance• Swapping

– In general, if a system is swapping this is a symptom that it does not have enough physical memory.

• Add memory to the system to minimize the swapping/paging activity before continuing.

• You might also consider migrating some of the load to other systems in order to minimize contention for existing resources.

– If the system contains the maximum memory, and the system is still swapping, there are some things you can do to improve the performance of the swapping/paging subsystem.

• First, try to split the swap partitions across several disk drives and (if possible) disk controllers.

• Most current operating systems can interleave swap writes across several disks to improve performance.

• Adding controllers to the swap system can increase the bandwidth of the swap subsystem immensely.

Disk Performance

• Read/Modify/Write– One major problem for disk systems is the read/modify/write

sequence of operations. This sequence is typical of updates to a file (read the file into memory, modify the file in memory, and then write the file out to disk). This sequence is a problem for (at least) the following reasons.

• There is a delay between the read and the write, so the heads have probably been repositioned to perform some other operation.

• The file size may change, requiring the new file to be written to non-contiguous sectors/cylinders on the disk. This causes more head movement when the file has to be written back to the disk.

Disk Performance– It may seem simple to avoid or minimize such operations, but

consider the following :• A typical “make” operation might read in 50 include files. The

compiler might create 400 object files for a large make operation.• File system accesses require an inode lookup, a file system stat,

a direct block read, an indirect block read, and a double-indirect block read to access a file. When the file is written to disk, the same operations are required.

Disk Performance• When a database application needs to perform a data insert

operation it needs to write the data to disk. – It also needs to write the transaction to a log, and

read/modify/write an index. – Databases typically exhibit 50% updates, 20% inserts, and

30% lookups. – This can lead to 200 (or more) I/O operations per second on

medium-size databases! – Trying to store such a database on a single disk is sure to

fail. – You would probably get by with a four-drive RAID for such

applications, but a six- or eight-drive stripe would be a better bet for high performance.

Disk Performance

• File Servers– File servers should be delegated to providing one task: storing and

retrieving files from system disks. • Although this might seem like it should be a simple task, lack of

planning when creating and populating file systems may cause severe performance problems.

– If a file server seems sluggish, use the iostat, vmstat, and other commands available on the system to monitor the disk subsystem.

• You need to determine which disks are experiencing large numbers of transfers, and/or large amounts of information read/written to disks.

Disk Performance

• Unbalanced Disks– Monitor the disk activity on an overloaded system to determine what

file systems are being accessed most often.

• If most of the disk activity is centered on one disk drive, while other disk drives sit idle, you probably have an unbalanced disk system.

• A typical disk drive can handle (roughly) 50 I/O operations a second.

• If you are trying to perform 100 I/O operations/second to a single disk drive, system performance will suffer!

– If you see signs that one disk is being heavily accessed, while other disks sit idle, you might consider moving file systems in order to spread high-activity file systems across multiple disks.

• Place one high-activity file system on a disk drive with one or more low activity file systems. This minimizes head/arm movement, and improves the utilization of the on-drive and on-controller caches.

Disk Performance

• Unbalanced Disks– Too many hot file systems on a single disk drive/stripe is another

typical problem. • The tendency is to use all of the space available on the disk

drives. Many times the administrator will partition a large disk into two (or more) file systems, and load files on all of the partitions.

• However, when all of the file systems begin to experience high volumes of access requests the disk head-positioner and the bandwidth of the disk drive become bottlenecks.

• It is usually better to waste some disk space and leave partitions empty than to place multiple active file systems on a drive.

• If you must do so, try to place inactive or read-only file systems on one partition, with an active read/write partition on another partition.

Disk Performance– Another way to disperse file system load is to break up large

multifunction file systems into smaller, more easily dispersed chunks.

• For example, the UNIX /usr file system often contains the system/site source code, system binaries, window system binaries, and print and mail spool files.

• By breaking the /usr file system into several smaller file systems, the sysadmin can disperse the load across the disk subsystem.

• Some of the more typical ways to break /usr into smaller chunks include making separate partitions for /usr/bin, /usr/lib, /usr/openwin, /usr/local, and /usr/spool.

Disk Performance

• RAID– Some believe that by default RAID provides better performance than

Single Large Expensive Disks (SLEDs). – Others believe that RAID is only useful if you want a redundant,

fault-tolerant disk system. • In reality, RAID can provide both of these capabilities. • However, a poorly configured RAID can also cause system

performance and reliability degradation.

Disk Performance• RAID

– Due to RAID’s flexibility and complexity, RAID subsystems present some tough challenges when it comes to performance monitoring and tuning.

• Most operating levels of RAID have well-known performance characteristics.

• Design your file systems such that high-performance file systems are housed on RAID volumes that provide the best performance (typically RAID 0).

• For improved reliability, RAID level 1, 4, or 5 would be a better choice.

• However, even within the RAID levels there are some general guidelines to keep in mind while designing RAID volumes.

Disk Performance• Disk Stripes

– One of the prime considerations for tuning RAID disk systems is “stripe size.”

• RAID allows you to “gang” several disks to form a “logical” disk drive.

• These logical drives allow you to attain better throughput, and large-capacity file systems.

• However, you need to be careful when you design these file systems.

Disk Performance• Disk Stripes

– The basic unit of storage on a standard disk is the disk sector (typically 512 bytes).

• On a RAID disk system, the basic unit of storage is referred to as the block size, which in reality is the sector size.

• However, RAID disks allow you to have multiple disks ganged such that you stripe the data across all disks.

• The number of disks in a RAID file system is referred to as the interleave factor, whereas the “stripe size” is the block size multiplied by the interleave factor.

• You typically want the size of an access to be a multiple of the stripe size.

Disk Performance

• Sequential I/O Optimizations– When using RAID, each disk I/O request will access every drive in the

stripe in parallel.

• The block size of the RAID stripe is equal to the access size divided by the interleave factor.

– For example, a file server that contains a four-drive RAID array that allows 64-kilobyte file system accesses would be best served by reading/writing 16-kilobyte chunks of data to/from each in the array in parallel.

– A file server with an eight-disk stripe that allowed 8-kilobyte file system accesses should be tuned to read/write 1 kilobyte to/from each disk in the stripe.

– Such setups (RAID with a four- to eight-drive interleave) can provide a 3x to 4x improvement in I/O throughput compared to a single disk system.

Disk Performance

• Random I/O Optimizations– When using RAID for random I/O operations, you want each request to

hit a different disk in the array.

• You want to force the I/O to be scattered across the available disk resources.

• In this case you want to tune the system such that the block size is equal to the access size.

– For example, a file server that allows 8-kilobyte file accesses across a six-disk RAID stripe should employ a 48-kilobyte stripe size, whereas a database server that allowed 2-kilobyte accesses across a four-drive RAID array should employ an 8-kilobyte stripe size.

Disk Performance

• File System Optimizations– The way an OS manages memory may also impact the performance

of the I/O subsystem. • For example, the BSD kernel allocates a portion of memory as a

buffer pool for file system I/O, whereas System V kernels use main memory for file system I/O.

• Under System V, all file system input/output operations result in page-in/page-out memory transactions!

– This is much more efficient than the BSD buffer-pool model. – You can tune the BSD kernel to use 50% of system memory

as buffer pool to improve file system performance.

Disk Performance

• Disk-based File Systems– File systems stored on local disks are referred to as disk-based file

systems (as opposed to network file systems, or memory-based file systems).

– There are several items related to disk-based file systems the administrator might want to tune to improve the performance of the system.

Disk Performance

• Zone Sectoring– Most modern disk drives employ zone-sectoring technology.

• This means that the drive has a larger number of storage sectors on the outer cylinders than it has on the inner cylinders; hence, the outer cylinders provide “more dense” storage than the inner cylinders.

• As the platter rotates, more sectors are under the read/write heads (per revolution) on the high-density cylinders than on the low-density cylinders.

• In many cases, two thirds of the disk’s storage space is on the outer (high-density) cylinders.

• This implies that you can attain higher performance if you just use the outer two-thirds of the disk drive, and “waste” one-third of the drive’s storage capacity.

Disk Performance

• Zone Sectoring– File systems should be sized with this constraint in mind.

• While wasting one-third of the storage capacity seems counterproductive, in reality system performance will be much better if you waste some space.

• Free Space– Most modern file systems do not perform well when they are more

than 90% filled. • When the file system gets full, the system has to work harder to

locate geographically “close” sectors on which to store the file. • Fragmentation becomes a performance penalty, and

read/modify/write operations become extremely painful, as the disk heads may have to traverse several cylinders to retrieve and then rewrite the file.

Disk Performance

– On user partitions (where the user’s files are stored) you can use the quota system to ensure that you never fill the file system to more than 90% capacity.

• This entails calculating how much space each user can have, and checking that you do not allow more total quota space than 90% of the total partition size.

• This can be a tedious process. • More commonly, the sysadmin watches the file system, and if it

approaches 90% full moves one or two of the space hogs to another partition.

Disk Performance

• Linux Ext3 Performance Options– The Ext2 file system has a reputation for being a rock-solid file system.

The Ext3 file system builds on this base by adding journaling features.

• Ext3 allows you to choose from one of three journaling modes at file system mount time: data=writeback, data=ordered, and data=journal.

• To specify a journal mode, you can add the appropriate string (data=journal) to the options section of your /etc/fstab, or specify the –o data=journal command-line option when calling mount from the command line.

• To specify the data journaling method used for root file systems (data=ordered is the default), you can use a special kernel boot option called rootflags.

• To force the root file system into full data journaling mode, add rootflags=data=journal to the boot options.

Disk Performance

• data=writeback Mode– In data=writeback mode, Ext3 does not do any form of data journaling

at all.

– In this mode, Ext3 provides journaling similar to that found in XFS file systems; that is, only the metadata is actually journaled.

– This could allow recently modified files to become corrupt in the event of an unexpected crash or reboot.

– Despite this drawback, data=writeback mode should give the best performance under most conditions.

Disk Performance

• data=ordered Mode– In data=ordered mode, Ext3 only journals metadata, but it logically

groups metadata and data blocks into a single unit called a transaction.

• When it is time to write the new metadata out to disk, the associated data blocks are written first.

• The data=ordered mode solves the corruption problem found in data=writeback mode, and does so without requiring full data journaling.

• In general, data=ordered Ext3 file systems perform slightly slower than data=writeback file systems, but significantly faster than their full data journaling counterparts.

Disk Performance• data=ordered Mode

– When appending data to files, data=ordered mode provides all of the integrity guarantees offered by Ext3’s full data journaling mode.

• However, if part of a file is being overwritten and the system crashes, it is possible that the region being written will contain a combination of original blocks interspersed with updated blocks.

• This can happen because data=ordered provides no guarantees as to which blocks are overwritten first, and therefore you cannot assume that just because overwritten block x was updated that overwritten block x-1 was updated as well.

• Instead, data=ordered leaves the write ordering up to the hard drive’s write cache.

– In general, this limitation does not end up negatively impacting system integrity very often, in that file appends are usually much more common than file overwrites.

• For this reason, data=ordered mode is a good higher-performance replacement for full data journaling.

Disk Performance• data=journal Mode

– The Ext3 data=journal mode provides full data and metadata journaling.

• All new data is written to the journal first, and then to the disk. • In the event of a crash, the journal can be replayed, bringing both

data and metadata into a consistent state. • Theoretically, data=journal mode is the slowest journaling mode, in

that data gets written to disk twice rather than once. • However, it turns out that in certain situations data=journal mode

can be blazingly fast.– Ext3’s data=journal mode is incredibly well suited to situations in

which data needs to be read from and written to disk at the same time. • Therefore, Ext3’s data=journal mode, assumed to be the slowest of

all Ext3 modes in nearly all conditions, actually turns out to have a major performance advantage in busy environments for which interactive I/O performance needs to be maximized.

Disk Performance• TIP: On busy (Linux) NFS servers, the server may experience a huge

storm of disk-write activity every 30 seconds when the kernel forces a sync operation. The following command will cause the system to run kupdate every 0.6 seconds rather than every 5 seconds. In addition, the command will cause the kernel to flush a dirty buffer after 3 seconds, rather than after the default of 30 seconds.

echo 40 0 0 0 60 300 60 0 0 > /proc/sys/vm/bdflush

Disk Performance

• BSD Disk System Performance

– The Berkeley file system, when used on BSD-derived operating systems, also provides some methods for improving the performance of the file system.

• For file servers with memory to spare, it is possible to increase BUFCACHEPERCENT.

• That is, it is possible to increase the percentage of system RAM used as file system buffer space.

• To increase BUFCACHEPERCENT, add a line to the kernel configuration similar to the following.

option BUFCACHEPERCENT=30• You can set the BUFCACHEPERCENT value as low as 5% (the

default) or as high as 50%.

Disk Performance

• BSD Disk System Performance

– Another method that can be used to speed up the file system is softupdates.

• One of the slowest operations in the traditional BSD file system is updating metainfo, which happens when applications create or delete files and directories.

• softupdates attempts to update metainfo in RAM instead of writing to the hard disk for every metainfo update.

• An effect of this is that the metainfo on disk should always be complete, although not always up to date.

Disk Performance

• Network File Systems– Network file systems are “at the mercy” of two bottlenecks: the disk

system on the server and the network link between the server and the client.

• One way to improve performance of NFS file systems is to use the TCP protocol for transport instead of the UDP protocol.

• Some operating systems (Solaris, among others) have already made TCP their default transport for this reason.

– Another way to improve the performance of NFS is to increase the size of the data chunks sent and received.

• In NFS V1 and V2, the chunk size is 8 kilobytes. In NFS v3, the chunk size is up to 32 kilobytes.

Disk Performance

• cachefs– Another way to improve NFS performance is to use a client-side

cache. • By default, the NFS file system does not provide client-side

caching. • Allocating memory or disk space as a cache for network files can

improve the performance of the client system at the cost of local disk space or memory for jobs.

• cachefs provides huge improvements for read-only (or “read-mostly”) file systems. A good size for the cachefs is 100 to 200 megabytes.

Performance Analysis• Network I/O can be very critical in a heavily

networked environment.– NFS/AFS performance relies on the network performance.

• NFS is extremely dependent (no cache)– Large transfer size (8k v3, 32 k v4)– UDP implementation

• AFS is less dependent (has disk cache)• Web servers are very network sensitive

– Lots of small transfers (input)– Lots of larger transfers (output)– Also disk system dependent.

• Use netstat to view network statistics.• Use nfsstat to look at nfs statistics.


• How can you improve network I/O?– Change network connections to switched technology.

• Upgrade half duplex to full duplex– Change to a faster network technology

• Upgrade 10 Mbit to 100 Mbit Ethernet.– Fewer hosts on the network.

• Less traffic on critical links == more headroom.– Faster network core.– Off-site caching

• Particularly useful for web services– Create dense request packets– Use keepalives

Network Performance• Trunking

– Network adapters are engineered to provide specific bandwidth to the system.

• A 10-megabit Ethernet adapter will typically provide 6 to 10 megabits per second of bandwidth when operating in half-duplex mode.

• You can improve the performance of the network subsystem by configuring the adapter to operate in full duplex mode.

• But what do you do if you need 500 megabits of throughput from a database server to the corporate web server?

– Many vendors allow you to cluster multiple network interfaces to provide improved performance.

• By clustering interfaces, you can also provide redundancy, as the system will continue to operate, albeit at lower performance levels, if one interface fails.

• An example of trunking is the Sun Multipath package.

Network Performance• Trunking

NOTE: Some applications seem to have problems working with systems using network trunking. At least one popular utility that allows UNIX servers to operate as Apple file/print servers experiences difficulties (and horrible performance) when used in a trunk environment!

• Collisions– Anytime a packet is involved in a collision, network performance

suffers. • The damaged packet(s) will need to be retransmitted, adding

load to an already overburdened network. • Consider transitioning all connections to switched hardware. • The switched hardware may still experience collisions, but

usually at much lower rates than shared-mode hardware.

Network Performance• TCP_NODELAY

– Under most circumstances, TCP sends data when it is “handed” to the TCP stack.

• When outstanding data has not yet been acknowledged, TCP gathers small amounts of output to be sent in a single packet once an acknowledgment has been received.

• For a small number of clients, such as windowing systems that send a stream of mouse events that receive no replies, this process may cause significant delays.

• To circumvent this problem, TCP provides a socket-level option, TCP_NODELAY, which may be used to tune the operation of the TCP stack in regard to these delays.

• Enabling TCP_NODELAY can improve the performance of certain network communications.

Network Performance• HIWAT/LOWAT

– Most operating systems make certain assumptions about the type of network connection likely to be encountered.

• These assumptions are used to set the size and number of network buffers that can be used to hold inbound and outbound packets.

– Systems with several network interfaces, or systems that are connected to very high-speed networks, may realize an improvement in network performance by increasing the number of buffers available to the network stack.

• This is typically accomplished by tuning variables in the TCP stack.

– Under Solaris and HP-UX (among others), you can tune the number of buffers by setting the hiwat and lowat variables via the ndd command.

Common sense:– Don’t overload the system.

• Unix does not deal well when presented with overload conditions. The same is true for NT.

– ALWAYS keep at least 10% free space on disk partitions.

– Keep 35% free bandwidth on network links – Eliminate swapping!– Try to keep 30 to 50 % free cycles on CPU’s

– Don’t run accounting/quotas if your goal is peak performance.

– Run system scripts late at night (or other off-hours).– Don’t run backups during peak hours.

Common Sense

– Watch for runaway jobs.• Users hate “killer” programs, but they do have their place!

– Watch for hardware problems that might aggravate problems• Disk errors will cause the disk system to retry - this

makes the system slower!• Network errors require retransmission of packets – this

makes the system slower!• Slow (or speed mismatched) memory DIMMS may cause

system to stall (wait states) – this makes the system slower!

– Try to run “native mode” programs• Binary compatibility mode is slow • Vmware and other “virtual environments” can be very slow• Interpreted languages can be slow (sh code vs C code,

compiled Java vs. Interpreted Java)

Common Sense• Watch for stupid programmer tricks

• Walking backward through arrays defeats cache– Try to optimize loops such that critical data resides in

cache.– Sparse matrix operations can be avoided!

• Single character reads/writes defeat buffering– User should read large blocks into a buffer in their

code, then work from this buffer.• File open/file close operations are slow• Rabbit jobs• Background processes• Zombie processes

Summary

• System performance analysis and tuning is an iterative process.

• The sysadmin must use scientific methodology when attempting to tune a system’s performance.

Ch24 system administration

Technology

Transcript of Ch24 system administration