Lec#4 - Types of WorkLoads
description
Transcript of Lec#4 - Types of WorkLoads
4-1©2010 Raj Jain www.rajjain.com
Simulation, Modeling and Analysis of Computer Networks
(ECE 6620)
Dr. M. Hasan Islam
Type of WorkLoads(Chapter#4)
“Art of Computer Systems Performance Analysis” By R. Jain
4-2©2010 Raj Jain www.rajjain.com
Overview
Terminology Test Workloads for Computer Systems
Addition Instruction Instruction Mixes Kernels Synthetic Programs Application Benchmarks: Sieve, Ackermann's Function,
Debit-Credit, SPEC
4-3©2010 Raj Jain www.rajjain.com
Workload Selection
Computer system performance measurements involve monitoring the system while it is being subjected to a particular workload
In order to perform meaningful measurements, the workload should be carefully selected
To achieve that goal, the performance analyst needs to understand the following before performing measurements:1. What are the different types of workloads?
2. Which workloads are commonly used by other analysts?
3. How are the appropriate workload types selected?
4. How is the measured workload data summarized?
5. How is the system performance monitored?
6. How can desired workload be placed on the system in a controlled manner?
7. How are the results of the evaluation presented?
4-4©2010 Raj Jain www.rajjain.com
Terminology
Test workload Any workload used in performance studies Test workload can be real or synthetic
Real workload Observed on a system being used for normal operations Cannot be repeated, generally not suitable for use as a test workload
Synthetic workload: Similar to real workload Can be applied repeatedly in a controlled manner No large real-world data files No sensitive data Easily modified without affecting operation Easily ported to different systems due to its small size May have built-in measurement capabilities
4-5©2010 Raj Jain www.rajjain.com
Test Workloads for Computer Systems
1. Addition Instruction
2. Instruction Mixes
3. Kernels
4. Synthetic Programs
5. Application Benchmarks
4-6©2010 Raj Jain www.rajjain.com
Addition Instruction
Processors were the most expensive and most used components of the system
Addition was the most frequent instruction Thus, as a first approximation, the computer with the
faster addition instruction was considered to be the better performer
The addition instruction was the sole workload used, and the addition time was the sole performance metric
4-7©2010 Raj Jain www.rajjain.com
Instruction Mixes
Specification of various instructions coupled with their usage frequency
Gibson mix: Developed by Jack C. Gibson in 1959 for IBM 704 systems.
4-8©2010 Raj Jain www.rajjain.com
Instruction Mixes (Cont)
Disadvantages: Complex classes of instructions not reflected in the mixes. Instruction time varies with:
Addressing modes Cache hit rates Pipeline efficiency Interference from other devices during processor-memory
access cycles Parameter values Frequency of zeros as a parameter The distribution of zero digits in a multiplier Average number of positions of pre-shift in floating-point add Number of times a conditional branch is taken
4-9©2010 Raj Jain www.rajjain.com
Instruction Mixes (Cont)
Performance Metrics:
MIPS = Millions of Instructions Per Second
MFLOPS = Millions of Floating Point Operations Per Second
It must be pointed that the instruction mixes only measure the
speed of the processor
This may or may not have effect on the total system performance
when the system consists of many other components
System performance is limited by the performance of the
bottleneck component, and unless the processor is the bottleneck
(that is, the usage is mostly compute bound), the MIPS rate of the
processor does not reflect the system performance
4-10©2010 Raj Jain www.rajjain.com
Kernels Introduction of pipelining, instruction caching, and various address
translation mechanisms made computer instruction times highly variable
An individual instruction could no longer be considered in isolation
Instead, it became more appropriate to consider a set of instructions, which constitutes a higher level function, a service provided by the processors
Such and function is called a Kernel (most frequent function (algorithms))
Most of the initial kernels did not make use of the input/output (I/O) devices and concentrated solely on the processor performance, this class of kernels could be called the processing kernel
Commonly used kernels: Sieve, Puzzle, Tree Searching, Ackerman's Function, Matrix Inversion, and Sorting
Disadvantages: Do not make use of I/O devices or OS services, and thus, the kernel performance does not reflect the total system performance
4-11©2010 Raj Jain www.rajjain.com
Synthetic Programs
To measure I/O performance lead analysts develop simple exerciser loops that make a specified number of service calls or I/O requests (Exerciser loops)
Allows them to compute the average CPU time and elapsed time for each service call
Exerciser loops are also used to measure operating system services such as process creation, forking, and memory allocation
In order to maintain portability to different operating systems, such exercisers are usually written in high-level languages such as FORTRAN or Pascal
First exerciser loop was by Buchholz (1969) who called it a synthetic program
4-12©2010 Raj Jain www.rajjain.com
Synthetic Programs Advantage:
Quickly developed and given to different vendors No real data files Easily modified and ported to different systems Have built-in measurement capabilities Measurement process is automated Repeated easily on successive versions of the operating systems
Disadvantages: Too small Do not make representative memory or disk references Mechanisms for page faults and disk cache may not be adequately
exercised CPU-I/O overlap may not be representative Loops may create synchronizations, better or worse performance
4-13©2010 Raj Jain www.rajjain.com
Synthetic workload generation program
4-14©2010 Raj Jain www.rajjain.com
Application Benchmarks
If computer systems to be compared are to be used for a particular application (banking or airline reservations), a representative subset of functions for that application may be used Such benchmarks are generally described in terms of functions to be
performed and make use of almost all resources in the system, including processors, I/O devices, networks, and databases
Benchmarking Process of performance comparison for two or more systems by measurements Workloads used in the measurements are called benchmarks
Some Authors: Benchmark = set of programs taken from real workloads
Popular Benchmarks Sieve, Ackermann’s Function, Whetstone, LINPACK, Dhrystone, Lawrence Livermore Loops,
Debit-Credit Benchmark, SPEC Benchmark Suite
4-15©2010 Raj Jain www.rajjain.com
Sieve
The sieve kernel has been used to compare microprocessors, personal computers, and high-level languages
Based on Eratosthenes' sieve algorithm: find all prime numbers below a given number n.
Algorithm: Write down all integers from 1 to n Strike out all multiples of k, for k=2, 3, …, n.
Example: Write down all numbers from 1 to 20. Mark all as prime:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 Remove all multiples of 2 from the list of primes:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
4-16©2010 Raj Jain www.rajjain.com
Sieve (Cont)
The next integer in the sequence is 3. Remove all multiples of 3:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
5 > 20 Stop Pascal Program to Implement the Sieve Kernel:
See Program listing Figure 4.2 in the book
4-17©2010 Raj Jain www.rajjain.com
Ackermann's Function To assess the efficiency of the procedure-calling mechanisms The function has two parameters and is defined recursively Ackermann(3, n) evaluated for values of n from one to six. Metrics:
Average execution time per call Number of instructions executed per call, and Stack space per call
Verification: Ackermann(3, n) = 2n+3-3 Number of recursive calls in evaluating Ackermann(3,n):
(512 4n-1 -15 2n+3 + 9n + 37)/3This expression is used to compute the execution time per call
Depth of the procedure calls = 2n+3-4 Stack space required doubles when n is increase by 1
4-18©2010 Raj Jain www.rajjain.com
Other Benchmarks
Whetstone U.S. Steel LINPACK Dhrystone Doduc TOP Lawrence Livermore Loops Digital Review Labs Abingdon Cross Image-Processing Benchmark
4-19©2010 Raj Jain www.rajjain.com
Debit-Credit Benchmark
A de facto standard for transaction processing systems.
First recorded in Anonymous et al (1975). In 1973, a retail bank wanted to put its 1000
branches, 10,000 tellers, and 10,000,000 accounts online with a peak load of 100 Transactions Per Second (TPS).
Each TPS requires 10 branches, 100 tellers, and 100,000 accounts.
4-20©2010 Raj Jain www.rajjain.com
Debit-Credit (Cont)
4-21©2010 Raj Jain www.rajjain.com
Debit-Credit Benchmark (Continued)
Metric: price/performance ratio. Performance
Throughput in terms of TPS such that 95% of all transactions provide one second or less response time
Response time Measured as the time interval between the arrival of the last bit from the
communications line and the sending of the first bit to the communications line
Cost Total expenses for a five-year period on purchase, installation, and
maintenance of the hardware and software in the machine room. Cost does not include expenditures for terminals,
communications, application development, or operations
4-22©2010 Raj Jain www.rajjain.com
Debit-Credit Transaction Pseudo-Code
4-23©2010 Raj Jain www.rajjain.com
Pseudo-code Definition of Debit-Credit
Four record types: account, teller, branch, and history Fifteen percent of the transactions require remote
access Transactions Processing Performance Council (TPC)
was formed in August 1988 TPC BenchmarkTM A is a variant of the debit-credit Metrics: TPS such that 90% of all transactions provide
two seconds or less response time
4-24©2010 Raj Jain www.rajjain.com
SPEC Benchmark Suite
Systems Performance Evaluation Cooperative (SPEC): Non-profit corporation formed by leading computer vendors to develop a standardized set of benchmarks.
Release 1.0 consists of the 10 benchmarks: GCC, Espresso, Spice 2g6, Doduc, LI, Eqntott, Matrix300, Fpppp, Tomcatv
Primarily stress the CPU, Floating Point Unit (FPU), and to some extent the memory subsystem To compare CPU speeds.
Benchmarks to compare I/O and other subsystems may be included in future releases.
4-25©2010 Raj Jain www.rajjain.com
SPEC Benchmark Suite 1. GCC
The time for the GNU C Compiler to convert 19 preprocessed source files into assembly language output is measured
This benchmark is representative of a software engineering environment and measures the compiling efficiency of a system
2. Espresso An Electronic Design Automation (EDA) tool that performs heuristic boolean function
minimization for Programmable Logic Arrays (PLAs) The elapsed time to run a set of seven input models is measured.
3. Spice 2g6 Spice, another representative of the EDA environment, is a widely used analog circuit
simulation tool The time to simulate a bipolar circuit is measured.
4. Doduc This is a synthetic benchmark that performs a Monte Carlo simulation of certain aspects of a
nuclear reactor. Because of its iterative structure and abundance of short branches and compact loops, it tests the cache memory effectiveness.
5. NASA7 This is a collection of seven floating-point intensive kernels performing matrix operations on
double-precision data.
4-26©2010 Raj Jain www.rajjain.com
SPEC Benchmark Suite 6. LI:
Elapsed time to solve the popular 9-queens problem by the LISP interpreter is measured
7. Eqntom Translates a logical representation of a boolean equation to a truth table
8. Matrix300 Performs various matrix operations using several LINPACK routines on
matrices of size 300 × 300 The code uses double-precision floating-point arithmetic and is highly
vectorizable 9. Fpppp
This is a quantum chemistry benchmark that performs two electron integral derivatives using double-precision floating-point FORTRAN. It is difficult to vectorize.
10. Tomcatv A vectorized mesh generation program using double-precision floating-point
FORTRAN Since it is highly vectorizable, substantial speedups have been observed on
several shared-memory multiprocessor systems
4-27©2010 Raj Jain www.rajjain.com
SPEC (Cont)
The elapsed time to run two copies of a benchmark on each of the N processors of a system (a total of 2N copies) is measured and compared with the time to run two copies of the benchmark on a reference system (which is VAX-11/780 for Release 1.0).
For each benchmark, the ratio of the time on the system under test and the reference system is reported as SPECthruput using a notation of #CPU@Ratio. For example, a system with three CPUs taking 1/15 times as long as the the reference system on GCC benchmark has a SPECthruput of 3@15.
Measure of the per processor throughput relative to the reference system
4-28©2010 Raj Jain www.rajjain.com
SPEC (Cont)
The aggregate throughput for all processors of a multiprocessor system can be obtained by multiplying the ratio by the number of processors. For example, the aggregate throughput for the above system is 45.
The geometric mean of the SPECthruputs for the 10 benchmarks is used to indicate the overall performance for the suite and is called SPECmark.