Types of Software Testing

23/11/2012 Functional testing - Wikipedia, the free encyclopedia

1/2en.wikipedia.org/wiki/Functional_testing

Functional testingFrom Wikipedia, the free encyclopedia

Functional testing is a type of black box testing that bases its test cases on the specifications of the softwarecomponent under test. Functions are tested by feeding them input and examining the output, and internal

program structure is rarely considered (not like in white-box testing).[1]

Functional testing differs from system testing in that functional testing "verif[ies] a program by checking itagainst ... design document(s) or specification(s)", while system testing "validate[s] a program by checking itagainst the published user or system requirements" (Kaner, Falk, Nguyen 1999, p. 52).

Functional testing typically involves five steps[citation needed]:

1. The identification of functions that the software is expected to perform

2. The creation of input data based on the function's specifications

3. The determination of output based on the function's specifications

4. The execution of the test case5. The comparison of actual and expected outputs

See also

Non-functional testing

Acceptance testing

Regression testingSystem testing

Software testing

Integration testing

Unit testing

Database testing

References

1. ^ Kaner, Falk, Nguyen. Testing Computer Software. Wiley Computer Publishing, 1999, p. 42. ISBN 0-471-35846-0.

External links

JTAG for Functional Test without Boundary-scan

(http://www.corelis.com/blog/index.php/blog/2011/01/10/jtag-for-functional-test-without-boundary-

scan)

Retrieved from "http://en.wikipedia.org/w/index.php?title=Functional_testing&oldid=510357783"

Categories: Software testing Computing stubs

This page was last modified on 2 September 2012 at 00:29.

Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may

23/11/2012 Functional testing - Wikipedia, the free encyclopedia

2/2en.wikipedia.org/wiki/Functional_testing

apply. See Terms of Use for details.Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.

2/4nresult.com/quality-assurance/functionality-testing/

Vision™

Functionality Testing

What is Functionality Testing?

Functionality testing is employed to verify whether your product meets the intended specifications and functional requirements laid out in your development documentation.

What is the purpose of Functionality Testing?

As competition in the software and hardware development arena intensifies, it becomes critical to deliver products that are virtually bug-free. Functionality testing helps yourcompany deliver products with a minimum amount of issues to an increasingly sophisticated pool of end users. Potential purchasers of your products may find honest and oftenbrutal product reviews online from consumers and professionals, which might deter them from buying your software. nResult will help ensure that your product functions asintended, keeping your service and support calls to a minimum. Let our trained professionals find functional issues and bugs before your end users do!

How can nResult help you deliver high quality products that are functionally superior to products offered by your competition?

We offer several types of functional testing techniques:

Ad Hoc – Takes advantage of individual testing talents based upon product goals, level of user capabilities and possible areas and features that may create confusion.The tester will generate test cases quickly, on the spur of the moment.Exploratory – The tester designs and executes tests while learning the product. Test design is organized by a set of concise patterns designed to assure that testers don’tmiss anything of importance.

Combination – The tester performs a sequence of events using different paths to complete tasks. This can uncover bugs related to order of events that are difficult to findusing other methods.Scripted – The tester uses a test script that lays out the specific functions to be tested. A test script can be provided by the customer/developer or constructed bynResult, depending on the needs of your organization.

Let nResult ensure that your hardware or software will function as intended. Our team will check for any anomalies or bugs in your product, through any or all stages of

development, to help increase your confidence level in the product you are delivering to market. nResult offers detailed, reasonably priced solutions to meet your testing needs.

Services offered by nResult:

Accessibility TestingWith accessibility testing, nResult ensures that your software or hardware product is accessible and effective for those with disabilities. Read more>>Compatibility Testing

Make sure your software applications and hardware devices function correctly with all relevant operating systems and with computing environments.Read more>>Interoperability TestingMake sure your software applications and hardware devices function correctly with all other products in the market.Read more>>

Competitive AnalysisStack up next to your competitors with a full competitive analysis report. Read more>>Performance TestingEnsure that your software/web application or website is equipped to handle anticipated and increased network traffic with adequate performance testing.

Performance Testing includes Load Testing and Benchmarking. Read more>>Localization TestingMake certain that your localized product blends flawlessly with the native language and culture. Read more>>

Medical Device TestingnResult provides solutions for complying with challenging and expensive testing requirements for your medical device. Read more>>Web Application Testing

Find and eliminate weaknesses in your website’s usability, functionality, performance, and browser compatibilities. Read more>>Certification TestingAdd instant credibility to your product from one of the most trusted names in testing. Read more>>

Security TestingTest your product for common security vulnerabilities; gain piece of mind in an insecure world.

Quick QuoteRequest a Demo

Login

www.PerfTestPlus.com

© 2006 PerfTestPlus, Inc. All rights reserved.

Introduction to Performance Testing Page 1

Introduction to Performance Testing

Scott BarberChief Technology Officer

PerfTestPlus, Inc.

First Presented for:

PSQT/PSTT ConferenceWashington, DC May, 2003




Agenda

Why Performance Test? What is Performance related testing?Intro to Performance Engineering Methodology Where to go for more info Summary / Q&A




Why Performance Test?

Speed - Does the application respond quickly enough for the intended users?

Scalability – Will the application handle the expected user load and beyond? (AKA Capacity)

Stability – Is the application stable under expected and unexpected user loads? (AKA Robustness)

Confidence – Are you sure that users will have a positive experience on go-live day?




Speed

User Expectations – Experience– Psychology– Usage

System Constraints– Hardware– Network– Software

Costs– Speed can be expensive!




Scalability

How many users… – before it gets “slow”?– before it stops working?– will it sustain?– do I expect today?– do I expect before the next upgrade?

How much data can it hold?– Database capacity– File Server capacity– Back-up Server capacity– Data growth rates




Stability

What happens if… – there are more users than we expect?– all the users do the same thing?– a user gets disconnected?– there is a Denial of Service Attack? – the web server goes down?– we get too many orders for the same thing?




Confidence

If you know what the performance is… – you can assess risk. – you can make informed decisions.– you can plan for the future.– you can sleep the night before go-live day.

The peace of mind that it will work on go-live day alone justifies the cost of performance testing.




What is Performance Related Testing?

Performance ValidationPerformance TestingPerformance Engineering

Detect

Resolve

Diagnose

What?

Why?Not Resolved

Compare & Contrast




Performance Validation

“Performance validation is the process by which software is tested with the intent of determining if the software meets pre-existing performance requirements. This process aims to evaluate compliance.”

Primarily used for…– determining SLA compliance. – IV&V (Independent Validation and Verification).– validating subsequent builds/releases.




Performance Testing

“Performance testing is the process by which software is tested to determine the current system performance. This process aims to gather information about current performance, but places no value judgments on the findings.”

Primarily used for…– determining capacity of existing systems.– creating benchmarks for future systems. – evaluating degradation with various loads and/or configurations.




Performance Engineering

“Performance engineering is the process by which software is tested and tuned with the intent of realizing the required performance. This process aims to optimize the most important application performance trait, user experience.”

Primarily used for…– new systems with pre-determined requirements. – extending the capacity of old systems.– “fixing” systems that are not meeting requirements/SLAs.




Compare and Contrast

Validation and Testing:– Are a subset of Engineering.– Are essentially the same except:

• Validation usually focuses on a single scenario and tests against pre-determined standards.

• Testing normally focuses on multiple scenarios with no pre-determined standards.

– Are generally not iterative.– May be conducted separate from software development.– Have clear end points.




Compare and Contrast

Engineering:– Is iterative.– Has clear goals, but ‘fuzzy’ end points.– Includes the effort of tuning the application.– Focuses on multiple scenarios with pre-determined

standards.– Heavily involves the development team.– Occurs concurrently with software development.




Intro to PE Methodology

Evaluate SystemDevelop Test AssetsBaselines and BenchmarksAnalyze ResultsTuneIdentify Exploratory TestsExecute Scheduled TestsComplete Engagement




Evaluate SystemDetermine performance requirements. Identify expected and unexpected user activity.Determine test and/or production architecture.Identify non-user-initiated (batch) processes.Identify potential user environments.Define expected behavior during unexpected circumstances.




Develop Test Assets

Create Strategy Document.Develop Risk Mitigation Plan.Develop Test Data.Automated test scripts:

– Plan– Create– Validate




Baseline and BenchmarksMost important for iterative testing.Baseline (single user) for initial basis of comparison and ‘best

case’.Benchmark (15-25% of expected user load) determines actual

state at loads expected to meet requirements.




Analyze ResultsMost important.Most difficult.Focuses on:

– Have the performance criteria been met?– What are the bottlenecks?– Who is responsible to fix those bottlenecks? – Decisions.




Tune

Engineering only.Highly collaborative with development team.Highly iterative.Usually, performance engineer ‘supports’ and ‘validates’ while

developers/admins ‘tune’.




Identify Exploratory TestsEngineering only.Exploits known bottleneck.Assists with analysis & tuning.Significant collaboration with ‘tuners’.Not robust tests – quick and dirty, not often reusable/relevant

after tuning is complete.




Execute Scheduled Tests

Only after Baseline and/or Benchmark tests.These tests evaluate compliance with documented

requirements.Often are conducted on multiple hardware/configuration

variations.




Complete Engagement

Document:– Actual Results– Tuning Summary– Known bottlenecks not tuned– Other supporting information – Recommendation

Package Test Assets:– Scripts– Documents– Test data




Where to go for more information

http://www.PerfTestPlus.com (My site)http://www.QAForums.com (Huge QA Forum)http://www.loadtester.com (Good articles and links)http://www.segue.com/html/s_solutions/papers/s_wp_info.htm (Good

articles and statistics)http://www.keynote.com/resources/resource_library.html

(Good articles and statistics)

http://www.perftestplus.com/

http://www.qaforums.com/

http://www.loadtester.com/

http://www.segue.com/html/s_solutions/papers/s_wp_info.htm

http://www.segue.com/html/s_solutions/papers/s_wp_info.htm

http://www.keynote.com/resources/resource_library.html




Summary

We test performance to:– Evaluate Risk.– Determine system capabilities.– Determine compliance.

Performance Engineering Methodology:– Ensures goals are accomplished.– Defines tasks.– Identifies critical decision points.– Shortens testing lifecycle.




Questions and Contact Information

Scott BarberChief Technology Officer

PerfTestPlus, Inc.

E-mail:[email protected]

Web Site:www.PerfTestPlus.com

mailto:[email protected]

http://www.perftestplus.com/

Software Performance Testing

Xiang Gan

Helsinki 26.09.2006

Seminar paper

University of Helsinki

Department of Computer Science

HELSINGIN YLIOPISTO − HELSINGFORS UNIVERSITET – UNIVERSITY OF HELSINKITiedekunta/Osasto − Fakultet/Sektion – Faculty/Section

Faculty of Science Laitos − Institution − Department

Department of Computer ScienceTekijä − Författare − Author

Xiang GanTyön nimi − Arbetets titel − Title

Software performance testingOppiaine − Läroämne − Subject

Työn laji − Arbetets art − Level Aika − Datum − Month and year

26.9.2006 Sivumäärä − Sidoantal − Number of pages

9Tiivistelmä − Referat − Abstract

Performance is one of the most important aspects concerned with the quality of software. Itindicates how well a software system or component meets its requirements for timeliness. Tillnow, however, no significant progress has been made on software performance testing. Thispaper introduces two software performance testing approaches which are named workloadcharacterization and early performance testing with distributed application, respectively.

ACM Computing Classification System (CCS):A.1 [Introductory and Survey],D.2.5 [Testing and Debugging]

Avainsanat – Nyckelord − Keywords

software performance testing, performance, workload, distributed applicationSäilytyspaikka − Förvaringställe − Where deposited

Muita tietoja − Övriga uppgifter − Additional information

ii

Contents1 Introduction .......................................................................................1

2 Workload characterization approach ..................................................22.1 Requirements and specifications in performance testing............................. 22.2 Characterizing the workload ...................................................................... 22.3 Developing performance test cases............................................................. 3

3 Early performance testing with distributed application.......................43.1 Early testing of performance ...................................................................... 5

3.1.1 Selecting performance use-cases...........................................................................53.1.2 Mapping use-cases to middleware ........................................................................63.1.3 Generating stubs....................................................................................................73.1.4 Executing the test ..................................................................................................7

4 Conclusion .........................................................................................8

References ...............................................................................................9

1

1 IntroductionAlthough the functionality supported by a software system is apparently important, itis usually not the only concern. The various concerns of individuals and of thesociety as a whole may face significant breakdowns and incur high costs if thesystem cannot meet the quality of service requirements of those non-functionalaspects, for instance, performance, availability, security and maintainability that areexpected from it.

Performance is an indicator of how well a software system or component meets itsrequirements for timeliness. There are two important dimensions to softwareperformance timeliness, responsiveness and scalability [SmW02]. Responsiveness isthe ability of a system to meet its objectives for response time or throughput. Theresponse time is the time required to respond to stimuli (events). The throughput of asystem is the number of events processed in some interval of time [BCK03].Scalability is the ability of a system to continue to meet its response time orthroughput objectives as the demand for the software function increases [SmW02].

As Weyuker and Vokolos argued [WeV00], usually, the primary problems thatprojects report after field release are not system crashes or incorrect systemsresponses, but rather system performance degradation or problems handling requiredsystem throughput. If queried, the fact is often that although the software system hasgone through extensive functionality testing, it was never really tested to assess itsexpected performance. They also found that performance failures can be roughlyclassified as the following three categories:

l the lack of performance estimates,l the failure to have proposed plans for data collection,l the lack of a performance budget.

This seminar paper concentrates upon the introduction of two software performancetesting approaches. Section 2 introduces a workload characterization approach whichrequires a careful collection of data for significant periods of time in the productionenvironment. In addition, the importance of clear performance requirements writtenin requirement and specification documents is emphasized, since it is thefundamental basis to carry out performance testing. Section 3 focuses on anapproach to test the performance of distributed software application as early aspossible during the entire software engineering process since it is obviously a largeoverhead for the development team to fix the performance problems at the end of thewhole process. Even worse, it may be impossible to fix some performance problemswithout sweeping redesign and re-implementation which can eat up lots of time andmoney. A conclusion is made at last in section 4.

2

2 Workload characterization approachAs indicated [AvW04], one of the key objectives of performance testing is touncover problems that are revealed when the system is run under specific workloads.This is sometimes referred to in the software engineering literature as an operationalprofile [Mus93]. An operational profile is a probability distribution describing thefrequency with which selected important operations are exercised. It describes howthe system has historically been used in the field and thus is likely to be used in thefuture. To this end, performance requirement is one of the necessary prerequisiteswhich will be used to determine whether software performance testing has beenconducted in a meaningful way.

2.1 Requirements and specifications in performance testing

Performance requirements must be provided in a concrete, verifiable manner[VoW98]. This should be explicitly included in a requirements or specificationdocument and might be provided in terms of throughput or response time, and mightalso include system availability requirements.

One of the most serious problems with performance testing is making sure that thestated requirements can actually be checked to see whether or not they are fulfilled[WeV00]. For instance, in functional testing, it seems to be useless to choose inputswith which it is entirely impossible to determine whether or not the output is correct.The same situation applies to performance testing. It is important to writerequirements that are meaningful for the purpose of performance testing. It is quiteeasy to write a performance requirement for an ATM such as, one customer canfinish a single transaction of withdrawing money from the machine in less than 25seconds. Then it might be possible to show that the time used in most of the testcases is less than 25 seconds, while it only fails in one test case. Such a situation,however, cannot guarantee that the requirement has been satisfied. A more plausiblepiece of performance requirement should state that the time used in such a singletransaction is less than 25 seconds when the server at host bank is run with anaverage workload. Assume that a benchmark has been established which canaccurately reflect the average workload, it is then possible to test whether thisrequirement has been satisfied or not.

2.2 Characterizing the workload

In order to do the workload characterization, it is necessary to collect data forsignificant periods of time in production environment. This can help characterize thesystem workload, and then use these representative workloads to determine what thesystem performance will look like when it is run in production on significantly largeworkloads.

3

The workload characterization approach described by Alberto Avritzer and JoeKondek [AKL02] is comprised of two steps that will be illustrated as follows.

The first step is to model the software system. Since most industrial softwaresystems are usually too complex to handle all the possible characteristics, thenmodeling is necessary. The goal of this step is thus to establish a simplified versionof the system in which the key parameters have been identified. It is essential thatthe model be as close enough to the real system as possible so that the data collectedfrom it will realistically reflect the true system’s behavior. Meanwhile, it shall besimple enough as it will then be feasible to collect the necessary data.

The second step is to collect data while the system is in operation after the systemhas been modeled, and key parameters identified. According to the paper [AKL02],this activity should usually be done for periods of two to twelve months. Followingthat, the data must be analyzed and a probability distribution should be determined.Although the input space, in theory, is quite enormous because of the non-uniformproperty of the frequency distribution, experience has shown that there are arelatively small number of inputs which actually occur during the period of datacollection. The paper [AKL02] showed that it is quite common for only severalthousand inputs to correspond to more than 99% of the probability mass associatedwith the input space. This means that a very accurate picture of the performance thatthe user of the system tends to see in the field can be drawn only through testing therelatively small number of inputs.

2.3 Developing performance test cases

After performing the workload characterization and determining what are theparamount system characteristics that require data collection, now we need to usethat information to design performance test cases to reflect field production usage forthe system. The following prescriptions were defined by Weyuker and Vokolos[WeV00]. One of the most interesting points in this list of prescriptions is that theyalso defined how to design performance test cases in case the detailed historical datais unavailable. Their by then situation was that a new platform has been purchasedbut not yet available; plus software has already been designed and written explicitlyfor the new hardware platform. The goal of such work is to determine whether thereare likely to be performance problems once the hardware is delivered and thesoftware is installed and running with the real customer base.

Typical steps to form performance test cases are as follows:

l identify the software processes that directly influence the overall performance ofthe system,

l for each process, determine the input parameters that will most significantlyinfluence the performance of the system. It is important to limit the parametersto the essential ones so that the set of test cases selected will be of manageablesize,

4

l determine realistic values for these parameters by collecting and analyzingexisting usage data. These values should reflect desired usage scenarios,including both average and heavy workloads.

l if there are parameters for which historical usage data are not available, thenestimate reasonable values based on such things as the requirements used todevelop the system or experience gathered by using an earlier version of thesystem or similar systems.

l if, for a given parameter, the estimated values form a range, then selectrepresentative values from within this range that are likely to reveal usefulinformation about the performance behavior of the system. Each selected valueshould then form a separate test case.

It is, however, important to recognize that this list cannot be treated as a precisepreparation for test cases since every system is different.

3 Early performance testing withdistributed applicationTesting techniques are usually applied towards the end of a project. However, mostresearchers and practitioners agree that the most critical performance problems, as aquality of interest, depend upon decisions made in the very early stages of thedevelopment life cycle, such as architectural choices. Although iterative andincremental development has been widely promoted, the situation concerned withtesting techniques has not been changed so much.

With the increasingly advance in distributed component technologies, such as J2EEand CORBA, distributed systems are no longer built from scratch [DPE04]. Moderndistributed systems are often built on top of middlewares. As a result, when thearchitecture is defined, a certain part of the implementation of a class of distributedapplications is already available. Then, it was argued that this enables performancetesting to be successfully applied at such early stages.

The method proposed by Denaro, Polini and Emmerich [DPE04] is based upon theobservation that the middleware used to build a distributed application oftendetermines the overall performance of the application. However, they also noted thatonly the coupling between the middleware and the application architecturedetermines the actual performance. The same middleware may perform quitedifferently under the context of different applications. Based on such observation,architecture designs were proposed as a tool to derive application-specificperformance test cases which can be executed on the early available middlewareplatform on which a distributed application is built. It then allows measurements ofperformance to be done in the very early stage of the development process.

5

3.1 Early testing of performance

The approach for early performance testing of distributed component-basedapplications consists of four phases [DPE04]:

l selection of the use-case scenarios relevant to performance, given a set ofarchitecture designs,

l mapping of the selected use-cases to the actual deployment technology andplatform,

l creation of stubs of components that are not available in the early stages of thedevelopment, but are needed to implement the use cases, and

l execution of the test.

The detailed contents in each phase are discussed in the following sub-sections.

3.1.1 Selecting performance use-cases

First of all, the design of functional test cases is entirely different from the case inperformance testing as already indicated in the previous section. However, as forperformance testing of distributed applications, the main parameters relating to it aremuch more complicated than that described before. Table 1 is excerpted from thepaper [DPE04] to illustrate this point.

Table 1: Performance parameters [DPE04].

Apart from traditional concerns about workloads and physical resources,consideration about the middleware configuration is also highlighted in this table (inthis case, it describes J2EE-based middleware). The last row of the table classifies

6

the relative interactions in distributed settings according to the place where theyoccur. This taxonomy is far from complete, however, it was believed that such ataxonomy of distributed interactions is key for using this approach. The next step isthe definition of appropriate metrics to evaluate the performance relevance of theavailable use-cases according to the interactions that they trigger.

3.1.2 Mapping use-cases to middleware

At the early stage of development process, software architecture is generally definedat a very abstract level. It usually just describes the business logic and abstract manydetails of deployment platforms and technologies. From this point, it is necessary tounderstand how abstract use-cases are mapped to possible deployment technologiesand platforms.

To facilitate the mapping from abstract use-cases to the concrete instances, softwareconnectors might be a feasible solution as indicated [DPE04]. Software connectorsmediate interactions among components. That is, they establish the rules that governcomponent interaction and specify any auxiliary mechanisms required [MMP00].According to the paper [MMP00], four major categories of connectors,communication, coordination, conversion, and facilitation, were identified. It wasbased on the services provided to interacting components. In addition, majorconnector types, procedure call, data access, linkage, stream, event, arbitrator,adaptor, and distributor, were also identified. Each connector type supports one ormore interaction services. The architecturally relevant details of each connector typeare captured by dimensions, and possibly, sub-dimensions. One dimension consistsof a set of values. Connector species are created by choosing the appropriatedimensions and values for those dimensions from connector types. Figure 1 depictsthe software connector classification framework which might provide a moredescriptive illustration about the whole structure.

As a particular element of software architecture, software connector was studied toinvestigate the possibility of defining systematic mappings between architecturesand middlewares. Well characterized software connectors may be associated withdeployment topologies that preserve the properties of the original architecture[DPE04]. As indicated, however, further work is still required to understand manydimensions and species of software connectors and their relationships with thepossible deployment platforms and technologies.

7

Figure 1: Software connector classification framework [MMP00].

3.1.3 Generating stubs

To actually implement the test cases, it needs to solve the problem that not all of theapplication components which participate in the use-cases are available in the earlystages of development. Stubs should be used in place where the components miss.Stubs are fake versions of components that can be used instead of the correspondingcomponents for instantiating the abstract use-cases. Stubs will only take care that thedistributed interactions happen as specified and the other components are coherentlyexercised.

The main hypothesis of this approach is that performance measurements in thepresence of the stubs are decent approximations of the actual performance of thefinal application [DPE04]. It results from the observation that the availablecomponents, for instance, middleware and databases, embed the software that mainlyimpact performance. The coupling between such implementation support and theapplication-specific behavior can be extracted from the use-cases, while theimplementation details of the business components remain negligible.

3.1.4 Executing the test

Building the support to test execution involves more technical problems providedscientific problems raised in the previous three sub-sections have been solved. Inaddition, several aspects, for example, deployment and implementation of workloadgenerators, execution of measurements, can be automated.

8

4 ConclusionIn all, two software performance testing approaches were described in this paper.Workload characterization approach can be treated as a traditional performancetesting approach that requires to carefully collecting a series of data in the productionfield and that can only be implemented at the end of the project. In contrast, earlyperformance testing approach for distributed software applications seems to be morenovel since it encourages to implement performance testing early in the developmentprocess, say, when the architecture is defined. Although it is still not a very matureapproach and more researches need to be conducted upon it according to itsadvocators [DPE04], its future looks like to be promising since it allows to fix thoseperformance problems as early as possible which is quite attractive.

Several other aspects also need to be discussed. First of all, there has been very littleresearch published in the area of software performance testing. For example, withthe search facility IEEE Xplore, if one enters software performance testing in thesearch field, there were only 3 results returned when this paper was written. Such asituation indicates that the field of software performance testing as a whole is only inits initial stage and needs much more emphasis in future. Secondly, the importanceof requirements and specifications is discussed in this paper. The fact, however, isthat usually no performance requirements are provided, which means that there is noprecise way of determining whether or not the software performance is acceptable.Thirdly, a positive trend is that software performance, as an important quality, isincreasingly punctuated during the development process. Smith and Williams[SmW02] proposed Software Performance Engineering (SPE) which is a systematic,quantitative approach to constructing software systems that meet performanceobjectives. It aids in tracking performance throughout the development process andprevents performance problems from emerging late in the life cycle.

9

ReferencesAKL02

AvW04

BCK03

DPE04

MMP00

Mus93

SmW02

VoW98

WeV00

Avritzer A., Kondek J., Liu D., Weyuker E.J., Software performancetesting based on workload characterization. Proc. of the 3rd

international workshop on software and performance, Jul. 2002, pp.17-24.

Avritzer A., and Weyuker E.J., The role of modeling in the performancetesting of E-commerce applications. IEEE Transactions on softwareengineering, 30, 12, Dec. 2004, pp. 1072-1083.

Bass L., Clements P., Kazman R., Software architecture in practice,second edition. Addision Wesley, Apr. 2003.

Denaro G., Polini A., Emmerich W., Early performance testing ofdistributed software applications. Proc. of the 4th internationalworkshop on software and performance, 2004, pp. 94-103.

Mehta N., Medvidovic N. and Phadke S., Towards a taxonomy ofsoftware connectors. In proc. of the 22nd International conference onsoftware engineering, 2000, pp. 178-187.

Musa J.D., Operational profiles in software reliability engineering.IEEE Software, 10, 2, Mar. 1993, pp. 14-32.

Smith C.U. and Williams L.G., Performance solutions: a practicalguide to creating responsive, scalable software. Boston, MA, AddisionWesley, 2002.

Vokolos F.I., Weyuker E.J., Performance testing of software systems.Proc. of the 1st international workshop on software and performance,Oct. 1998, pp. 80-87.

Weyuker E.J. and Vokolos F.I., Experience with performance testing ofsoftware systems: issues, an approach and a case study. IEEETransactions on Software Engineering, 26, 12, Dec. 2000, pp.1147-1156.

Michael R. Lyu received the Ph.D. in computer science from University of California, Los Angeles in 1988. He is a Professor in the Computer Science and Engineering Department of the Chinese University of Hong Kong. He worked at the Jet Propulsion Laboratory, Bellcore, and Bell Labs; and taught at the University of Iowa. He has participated in more than 30 industrial projects, published over 250 papers, and helped to develop many commercial systems and software tools. Professor Lyu is frequently invited as a keynote or tutorial speaker to conferences and workshops in U.S., Europe, and Asia. He initiated the International Symposium on Software Reliability Engineering (ISSRE) in 1990. He also received Best Paper Awards in ISSRE'98 and in ISSRE'2003. Professor Lyu is an IEEE Fellow and an AAAS Fellow, for his contributions to software reliability engineering and software fault tolerance.

Software Reliability Engineering: A Roadmap Michael R. Lyu

Software Reliability Engineering: A Roadmap

Michael R. Lyu Computer Science and Engineering Department

The Chinese University of Hong Kong, Hong Kong [email protected]

Abstract

Software reliability engineering is focused on engineering techniques for developing and maintaining software systems whose reliability can be quantitatively evaluated. In order to estimate as well as to predict the reliability of software systems, failure data need to be properly measured by various means during software development and operational phases. Moreover, credible software reliability models are required to track underlying software failure processes for accurate reliability analysis and forecasting. Although software reliability has remained an active research subject over the past 35 years, challenges and open questions still exist. In particular, vital future goals include the development of new software reliability engineering paradigms that take software architectures, testing techniques, and software failure manifestation mechanisms into consideration. In this paper, we review the history of software reliability engineering, the current trends and existing problems, and specific difficulties. Possible future directions and promising research subjects in software reliability engineering are also addressed.

1. Introduction

Software permeates our daily life. There is probably no other human-made material which is more omnipresent than software in our modern society. It has become a crucial part of many aspects of society: home appliances, telecommunications, automobiles, airplanes, shopping, auditing, web teaching, personal entertainment, and so on. In particular, science and technology demand high-quality software for making improvements and breakthroughs.

The size and complexity of software systems have grown dramatically during the past few decades, and the trend will certainly continue in the future. The data from industry show that the size of the software for

various systems and applications has been growing exponentially for the past 40 years [20]. The trend of such growth in the telecommunication, business, defense, and transportation industries shows a compound growth rate of ten times every five years. Because of this ever-increasing dependency, software failures can lead to serious, even fatal, consequences in safety-critical systems as well as in normal business. Previous software failures have impaired several high-visibility programs and have led to loss of business [28].

The ubiquitous software is also invisible, and its invisible nature makes it both beneficial and harmful. From the positive side, systems around us work seamlessly thanks to the smooth and swift execution of software. From the negative side, we often do not know when, where and how software ever has failed, or will fail. Consequently, while reliability engineering for hardware and physical systems continuously improves, reliability engineering for software does not really live up to our expectation over the years.

This situation is frustrating as well as encouraging. It is frustrating because the software crisis identified as early as the 1960s still stubbornly stays with us, and “software engineering” has not fully evolved into a real engineering discipline. Human judgments and subjective favorites, instead of physical laws and rigorous procedures, dominate many decision making processes in software engineering. The situation is particularly critical in software reliability engineering. Reliability is probably the most important factor to claim for any engineering discipline, as it quantitatively measures quality, and the quantity can be properly engineered. Yet software reliability engineering, as elaborated in later sections, is not yet fully delivering its promise. Nevertheless, there is an encouraging aspect to this situation. The demands on, techniques of, and enhancements to software are continually increasing, and so is the need to understand

its reliability. The unsettled software crisis poses tremendous opportunities for software engineering researchers as well as practitioners. The ability to manage quality software production is not only a necessity, but also a key distinguishing factor in maintaining a competitive advantage for modern businesses.

Software reliability engineering is centered on a key attribute, software reliability, which is defined as the probability of failure-free software operation for a specified period of time in a specified environment [2]. Among other attributes of software quality such as functionality, usability, capability, and maintainability, etc., software reliability is generally accepted as the major factor in software quality since it quantifies software failures, which can make a powerful system inoperative. Software reliability engineering (SRE) is therefore defined as the quantitative study of the operational behavior of software-based systems with respect to user requirements concerning reliability. As a proven technique, SRE has been adopted either as standard or as best current practice by more than 50 organizations in their software projects and reports [33], including AT&T, Lucent, IBM, NASA, Microsoft, and many others in Europe, Asia, and North America. However, this number is still relatively small compared to the large amount of software producers in the world.

Existing SRE techniques suffer from a number of weaknesses. First of all, current SRE techniques collect the failure data during integration testing or system testing phases. Failure data collected during the late testing phase may be too late for fundamental design changes. Secondly, the failure data collected in the in-house testing may be limited, and they may not represent failures that would be uncovered under actual operational environment. This is especially true for high-quality software systems which require extensive and wide-ranging testing. The reliability estimation and prediction using the restricted testing data may cause accuracy problems. Thirdly, current SRE techniques or modeling methods are based on some unrealistic assumptions that make the reliability estimation too optimistic relative to real situations. Of course, the existing software reliability models have had their successes; but every model can find successful cases to justify its existence. Without cross-industry validation, the modeling exercise may become merely of intellectual interest and would not be widely adopted in industry. Thus, although SRE has been around for a while, credible software reliability techniques are still urgently needed, particularly for modern software systems [24].

In the following sections we will discuss the past, the present, and the future of software reliability engineering. We first survey what techniques have been proposed and applied in the past, and then describe what the current trend is and what problems and concerns remain. Finally, we propose the possible future directions in software reliability engineering.

2. Historical software reliability engineering techniques

In the literature a number of techniques have been proposed to attack the software reliability engineering problems based on software fault lifecycle. We discuss these techniques, and focus on two of them.

2.1. Fault lifecycle techniques

Achieving highly reliable software from the customer’s perspective is a demanding job for all software engineers and reliability engineers. [28] summarizes the following four technical areas which are applicable to achieving reliable software systems, and they can also be regarded as four fault lifecycle techniques:

1) Fault prevention: to avoid, by construction, fault occurrences.

2) Fault removal: to detect, by verification and validation, the existence of faults and eliminate them.

3) Fault tolerance: to provide, by redundancy, service complying with the specification in spite of faults having occurred or occurring.

4) Fault/failure forecasting: to estimate, by evaluation, the presence of faults and the occurrences and consequences of failures. This has been the main focus of software reliability modeling.

Fault prevention is the initial defensive mechanism against unreliability. A fault which is never created costs nothing to fix. Fault prevention is therefore the inherent objective of every software engineering methodology. General approaches include formal methods in requirement specifications and program verifications, early user interaction and refinement of the requirements, disciplined and tool-assisted software design methods, enforced programming principles and environments, and systematic techniques for software reuse. Formalization of software engineering processes with mathematically specified languages and tools is an aggressive approach to rigorous engineering of software systems. When applied successfully, it can completely prevent faults. Unfortunately, its application scope has been

limited. Software reuse, on the other hand, finds a wider range of applications in industry, and there is empirical evidence for its effectiveness in fault prevention. However, software reuse without proper certification could lead to disaster. The explosion of the Ariane 5 rocket, among others, is a classic example where seemly harmless software reuse failed miserably,

in which critical software faults slipped through all the testing and verification procedures, and where a system went terribly wrong only during complicated real-life operations.

Fault prevention mechanisms cannot guarantee avoidance of all software faults. When faults are injected into the software, fault removal is the next protective means. Two practical approaches for fault removal are software testing and software inspection, both of which have become standard industry practices in quality assurance. Directions in software testing techniques are addressed in [4] in detail.

When inherent faults remain undetected through the testing and inspection processes, they will stay with the software when it is released into the field. Fault tolerance is the last defending line in preventing faults from manifesting themselves as system failures. Fault tolerance is the survival attribute of software systems in terms of their ability to deliver continuous service to the customers. Software fault tolerance techniques enable software systems to (1) prevent dormant software faults from becoming active, such as defensive programming to check for input and output conditions and forbid illegal operations; (2) contain the manifested software errors within a confined boundary without further propagation, such as exception handling routines to treat unsuccessful operations; (3) recover software operations from erroneous conditions, such as checkpointing and rollback mechanisms; and (4) tolerate system-level faults methodically, such as employing design diversity in the software development.

Finally if software failures are destined to occur, it is critical to estimate and predict them. Fault/failure forecasting involves formulation of the fault/failure relationship, an understanding of the operational environment, the establishment of software reliability models, developing procedures and mechanisms for software reliability measurement, and analyzing and evaluating the measurement results. The ability to determine software reliability not only gives us guidance about software quality and when to stop testing, but also provides information for software maintenance needs. It can facilitate the validity of software warranty when reliability of software has

been properly certified. The concept of scheduled maintenance with software rejuvenation techniques [46] can also be solidified.

The subjects of fault prevention and fault removal have been discussed thoroughly by other articles in this issue. We focus our discussion on issues related to techniques on fault tolerance and fault/failure forecasting.

2.2. Software reliability models and measurement

As a major task of fault/failure forecasting, software reliability modeling has attracted much research attention in estimation (measuring the current state) as well as prediction (assessing the future state) of the reliability of a software system. A software reliability model specifies the form of a random process that describes the behavior of software failures with respect to time. A historical review as well as an application perspective of software reliability models can be found in [7, 28]. There are three main reliability modeling approaches: the error seeding and tagging approach, the data domain approach, and the time domain approach, which is considered to be the most popular one. The basic principle of time domain software reliability modeling is to perform curve fitting of observed time-based failure data by a pre-specified model formula, such that the model can be parameterized with statistical techniques (such as the Least Square or Maximum Likelihood methods). The model can then provide estimation of existing reliability or prediction of future reliability by extrapolation techniques. Software reliability models usually make a number of common assumptions, as follows. (1) The operation environment where the reliability is to be measured is the same as the testing environment in which the reliability model has been parameterized. (2) Once a failure occurs, the fault which causes the failure is immediately removed. (3) The fault removal process will not introduce new faults. (4) The number of faults inherent in the software and the way these faults manifest themselves to cause failures follow, at least in a statistical sense, certain mathematical formulae. Since the number of faults (as well as the failure rate) of the software system reduces when the testing progresses, resulting in growth of reliability, these models are often called software reliability growth models (SRGMs).

Since Jelinsky and Moranda proposed the first SRGM [23] in 1972, numerous SRGMs have been proposed in the past 35 years, such as exponential failure time class models, Weibull and Gamma failure

time class models, infinite failure category models, Bayesian models, and so on [28, 36, 50]. Unified modeling approaches have also been attempted [19]. As mentioned before, the major challenges of these models do not lie in their technical soundness, but their validity and applicability in real world projects.

Figure 1 shows an SRE framework in current practice [28]. First, a reliability objective is determined quantitatively from the customer's viewpoint to maximize customer satisfaction, and customer usage is defined by developing an operational profile. The software is then tested according to the operational profile, failure data collected, and reliability tracked during testing to determine the product release time. This activity may be repeated until a certain reliability level has been achieved. Reliability is also validated in the field to evaluate the reliability engineering efforts and to achieve future product and process improvements.

It can be seen from Figure 1 that there are four major components in this SRE process, namely (1) reliability

objective, (2) operational profile, (3) reliability modeling and measurement, and (4) reliability validation. A reliability objective is the specification of the reliability goal of a product from the customer viewpoint. If a reliability objective has been specified by the customer, that reliability objective should be used. Otherwise, we can select the reliability measure which is the most intuitive and easily understood, and then determine the customer's "tolerance threshold" for system failures in terms of this reliability measure.

The operational profile is a set of disjoint alternatives of system operational scenarios and their associated probabilities of occurrence. The construction of an operational profile encourages testers to select test cases according to the system's likely operational usage, which contributes to more accurate estimation of software reliability in the field.

Reliability modeling is an essential element of the reliability estimation process. It determines whether a product meets its reliability objective and is ready for release. One or more reliability models are employed to calculate, from failure data collected during system testing, various estimates of a product's reliability as a function of test time. Several interdependent estimates can be obtained to make equivalent statements about a product's reliability. These reliability estimates can provide the following information, which is useful for product quality management: (1) The reliability of the product at the end of system testing. (2) The amount of (additional) test time required to reach the product's reliability objective. (3) The reliability growth as a result of testing (e.g., the ratio of the value of the failure intensity at the start of testing to the value at the end of testing). (4) The predicted reliability beyond the system testing, such as the product's reliability in the field.

Despite the existence of a large number of models, the problem of model selection and application is manageable, as there are guidelines and statistical methods for selecting an appropriate model for each application. Furthermore, experience has shown that it is sufficient to consider only a dozen models, particularly when they are already implemented in software tools [28].

Using these statistical methods, "best" estimates of reliability are obtained during testing. These estimates are then used to project the reliability during field operation in order to determine whether the reliability objective has been met. This procedure is an iterative process, since more testing will be needed if the objective is not met. When the operational profile is not fully developed, the application of a test

Apply Software Reliability Tools

No

Yes

Determine Reliability Objective

Perform Software Testing

Collect Failure Data

Select Appropriate Software Reliability Models

Use Software Reliability Models to Calculate Current Reliability

Reliability Objective

met?

Continue Testing

Develop Operational Profile

Start to Deploy

Validate Reliability in the Field

Feedback to Next Release

Figure 1. Software Reliability Engineering Process Overview

compression factor can assist in estimating field reliability. A test compression factor is defined as the ratio of execution time required in the operational phase to execution time required in the test phase to cover the input space of the program. Since testers during testing are quickly searching through the input space for both normal and difficult execution conditions, while users during operation only execute the software with a regular pace, this factor represents the reduction of failure rate (or increase in reliability) during operation with respect to that observed during testing.

Finally, the projected field reliability has to be validated by comparing it with the observed field reliability. This validation not only establishes benchmarks and confidence levels of the reliability estimates, but also provides feedback to the SRE process for continuous improvement and better parameter tuning. When feedback is provided, SRE process enhancement comes naturally: the model validity is established, the growth of reliability is determined, and the test compression factor is refined.

2.3. Software fault tolerance techniques and models

Fault tolerance, when applicable, is one of the major approaches to achieve highly reliable software. There are two different groups of fault tolerance techniques: single version and multi-version software techniques [29]. The former includes program modularity, system closure, atomicity of actions, error detection, exception handling, checkpoint and restart, process pairs, and data diversity [44]; while the latter, so-called design diversity, is employed where multiple software versions are developed independently by different program teams using different design methods, yet they provide equivalent services according to the same requirement specifications. The main techniques of this multiple version software approach are recovery blocks, N-version programming, N self-checking programming, and other variants based on these three fundamental techniques.

Reliability models attempt to estimate the probability of coincident failures in multiple versions. Eckhardt and Lee (1985) [15] proposed the first reliability model of fault correlation in design diversity to observe positive correlations between version failures on the assumption of variation of difficulty on demand space. Littlewood and Miller (1989) [25] suggested that there was a possibility that negative fault correlations may exist on the basis of forced design diversity. Dugan and Lyu (1995) [14] proposed a Markov reward model

to compare system reliability achieved by various design diversity approaches, and Tomek and Trivedi (1995) [43] suggested a Stochastic reward net model for software fault tolerance. Popov, Strigini et al. (2003) [37] estimated the upper and lower bounds for failure probability of design diversity based on the subdomain concept on the demand space. A detailed summary of fault-tolerant software and its reliability modeling methods can be found in [29]. Experimental comparisons and evaluations of some of the models are listed in [10] and [11].

3. Current trends and problems

The challenges in software reliability not only stem from the size, complexity, difficulty, and novelty of software applications in various domains, but also relate to the knowledge, training, experience and character of the software engineers involved. We address the current trends and problems from a number of software reliability engineering aspects.

3.1. Software reliability and system reliability

Although the nature of software faults is different from that of hardware faults, the theoretical foundation of software reliability comes from hardware reliability techniques. Previous work has been focused on extending the classical reliability theories from hardware to software, so that by employing familiar mathematical modeling schemes, we can establish software reliability framework consistently from the same viewpoints as hardware. The advantages of such modeling approaches are: (1) The physical meaning of the failure mechanism can be properly interpreted, so that the effect of failures on reliability, as measured in the form of failure rates, can be directly applied to the reliability models. (2) The combination of hardware reliability and software reliability to form system reliability models and measures can be provided in a unified theory. Even though the actual mechanisms of the various causes of hardware faults and software faults may be different, a single formulation can be employed from the reliability modeling and statistical estimation viewpoints. (3) System reliability models inherently engage system structure and modular design in block diagrams. The resulting reliability modeling process is not only intuitive (how components contribute to the overall reliability can be visualized), but also informative (reliability-critical components can be quickly identified).

The major drawbacks, however, are also obvious. First of all, while hardware failures may occur independently (or approximately so), software failures

do not happen independently. The interdependency of software failures is also very hard to describe in detail or to model precisely. Furthermore, similar hardware systems are developed from similar specifications, and hardware failures, usually caused by hardware defects, are repeatable and predictable. On the other hand, software systems are typically “one-of-a-kind.” Even similar software systems or different versions of the same software can be based on quite different specifications. Consequently, software failures, usually caused by human design faults, seldom repeat in exactly the same way or in any predictable pattern. Therefore, while failure mode and effect analysis (FMEA) and failure mode and effect criticality analysis (FMECA) have long been established for hardware systems, they are not very well understood for software systems.

3.2. Software reliability modeling

Among all software reliability models, SRGM is probably one of the most successful techniques in the literature, with more than 100 models existing in one form or another, through hundreds of publications. In practice, however, SRGMs encounter major challenges. First of all, software testers seldom follow the operational profile to test the software, so what is observed during software testing may not be directly extensible for operational use. Secondly, when the number of failures collected in a project is limited, it is hard to make statistically meaningful reliability predictions. Thirdly, some of the assumptions of SRGM are not realistic, e.g., the assumptions that the faults are independent of each other; that each fault has the same chance to be detected in one class; and that correction of a fault never introduces new faults [40]. Nevertheless, the above setbacks can be overcome with suitable means. Given proper data collection processes to avoid drastic invalidation of the model assumptions, it is generally possible to obtain accurate estimates of reliability and to know that these estimates are accurate.

Although some historical SRGMs have been widely adopted to predict software reliability, researchers believe they can further improve the prediction accuracy of these models by adding other important factors which affect the final software quality [12,31,48]. Among others, code coverage is a metric commonly engaged by software testers, as it indicates how completely a test set executes a software system under test, therefore influencing the resulting reliability measure. To incorporate the effect of code coverage on reliability in the traditional software reliability models, [12] proposes a technique using

both time and code coverage measurement for reliability prediction. It reduces the execution time by a parameterized factor when the test case neither increases code coverage nor causes a failure. These models, known as adjusted Non-Homogeneous Poisson Process (NHPP) models, have been shown empirically to achieve more accurate predictions than the original ones.

In the literature, several models have been proposed to determine the relationship between the number of failures/faults and the test coverage achieved, with various distributions. [48] suggests that this relation is a variant of the Rayleigh distribution, while [31] shows that it can be expressed as a logarithmic-exponential formula, based on the assumption that both fault coverage and test coverage follow the logarithmic NHPP growth model with respect to the execution time. More metrics can be incorporated to further explore this new modeling avenue.

Although there are a number of successful SRE models, they are typically measurement-based models which are employed in isolation at the later stage of the software development process. Early software reliability prediction models are often too insubstantial, seldom executable, insufficiently formal to be analyzable, and typically not linked to the target system. Their impact on the resulting reliability is therefore modest. There is currently a need for a creditable end-to-end software reliability model that can be directly linked to reliability prediction from the very beginning, so as to establish a systematic SRE procedure that can be certified, generalized and refined.

3.3. Metrics and measurements

Metrics and measurements have been an important part of the software development process, not only for software project budget planning but also for software quality assurance purposes. As software complexity and software quality are highly related to software reliability, the measurements of software complexity and quality attributes have been explored for early prediction of software reliability [39]. Static as well as dynamic program complexity measurements have been collected, such as lines of code, number of operators, relative program complexity, functional complexity, operational complexity, and so on. The complexity metrics can be further included in software reliability models for early reliability prediction, for example, to predict the initial software fault density and failure rate.

In SRGM, the two measurements related to reliability are: 1) the number of failures in a time period; and 2) time between failures. An important advancement of

SRGM is the notation of “time” during which failure data are recorded. It is demonstrated that CPU time is more suitable and more accurate than calendar time for recording failures, in which the actual execution time of software can be faithfully represented [35]. More recently, other forms of metrics for testing efforts have been incorporated into software reliability modeling to improve the prediction accuracy [8,18].

One key problem about software metrics and measurements is that they are not consistently defined and interpreted, again due to the lack of physical attributes of software. The achieved reliability measures may differ for different applications, yielding inconclusive results. A unified ontology to identify, describe, incorporate and understand reliability-related software metrics is therefore urgently needed.

3.4. Data collection and analysis

The software engineering process is described sardonically as a garbage-in/garbage-out process. That is to say, the accuracy of its output is bounded by the precision of its input. Data collection, consequently, plays a crucial role for the success of software reliability measurement.

There is an apparent trade-off between the data collection and the analysis effort. The more accuracy is required for analysis, the more effort is required for data collection. Fault-based data are usually easier to collect due to their static nature. Configuration management tools for source code maintenance can help to collect these data as developers are required to check in and check out new updated versions of code for fault removal. Failure-based data, on the other hand, are much harder to collect and usually require additional effort, for the following reasons. First, the dynamic operating condition where the failures occur may be hard to identify or describe. Moreover, the time when the failures occur must be recorded manually, after the failures are manifested. Calendar time data can be coarsely recorded, but they lack accuracy for modeling purposes. CPU time data, on the other hand, are very difficult to collect, particularly for distributed systems and networking environment where multiple CPUs are executing software in parallel. Certain forms of approximation are required to avoid the great pain in data collection, but then the accuracy of the data is consequently reduced. It is noted that while manual data collection can be very labor intensive, automatic data collection, although unavoidable, may be too intrusive (e.g., online collection of data can cause interruption to the system under test).

The amounts and types of data to be collected for reliability analysis purposes vary between organizations. Consequently, the experiences and lessons so gained may only be shared within the same company culture or at a high level of abstraction between organizations. To overcome this disadvantage, systematic failure data analysis for SRE purposes should be conducted.

Given field failure data collected from a real system, the analysis consists of five steps: 1) preprocessing of data, 2) analysis of data, 3) model structure identification and parameter estimation, 4) model solution, if necessary, and 5) analysis of models. In Step 1, the necessary information is extracted from the field data. The processing in this step requires detailed understanding of the target software and operational conditions. The actual processing required depends on the type of data. For example, the information in human-generated reports is usually not completely formatted. Therefore, this step involves understanding the situations described in the reports and organizing the relevant information into a problem database. In contrast, the information in automatically generated event logs is already formatted. Data processing of event logs consists of extracting error events and coalescing related error events.

In Step 2, the data are interpreted. Typically, this step begins with a list of measures to evaluate. However, new issues that have a major impact on software reliability can also be identified during this step. The results from Step 2 are reliability characteristics of operational software in actual environments and issues that must be addressed to improve software reliability. These include fault and error classification, error propagation, error and failure distribution, software failure dependency, hardware-related software errors, evaluation of software fault tolerance, error recurrence, and diagnosis of recurrences.

In Step 3, appropriate models (such as Markov models) are identified based on the findings from Step 2. We identify model structures and realistic ranges of parameters. The identified models are abstractions of the software reliability behavior in real environments. Statistical analysis packages and measurement-based reliability analysis tools are useful at this stage.

Step 4 involves either using known techniques or developing new ones to solve the model. Model solution allows us to obtain measures, such as reliability, availability, and performability. The results obtained from the model must be validated against real data. Reliability and performance modeling and

evaluation tools such as SHARPE [45] can be used in this step.

In Step 5, “what if” questions are addressed, using the identified models. Model factors are varied and the resulting effects on software reliability are evaluated. Reliability bottlenecks are determined and the effects of design changes on software reliability are predicted. Research work currently addressed in this area includes software reliability modeling in the operational phase, the modeling of the impact of software failures on performance, detailed error and recovery processes, and software error bursts. The knowledge and experience gained through such analysis can be used to plan additional studies and to develop the measurement techniques.

3.5. Methods and tools

In addition to software reliability growth modeling, many other methods are available for SRE. We provide a few examples of these methods and tools.

Fault trees provide a graphical and logical framework for a systematic analysis of system failure modes. Software reliability engineers can use them to assess the overall impact of software failures on a system, or to prove that certain failure modes will not occur. If they may occur, the occurrence probability can also be assessed. Fault tree models therefore provide an informative modeling framework that can be engaged to compare different design alternatives or system architectures with respect to reliability. In particular, they have been applied to both fault tolerant and fault intolerant (i.e., non-redundant) systems. Since this technique originates from hardware systems and has been extended to software systems, it can be employed to provide a unified modeling scheme for hardware/software co-design. Reliability modeling for hardware-software interactions is currently an area of intensive research [42].

In addition, simulation techniques can be provided for SRE purposes. They can produce observables of interest in reliability engineering, including discrete integer-valued quantities that occur as time progresses. One simulation approach produces artifacts in an actual software environment according to factors and influences believed to typify these entities within a given context [47]. The artifacts and environment are allowed to interact naturally, whereupon the flow of occurrences of activities and events is observed. This artifact-based simulation allows experiments to be set up to examine the nature of the relationships between software failures and other software metrics, such as program structure, programming error characteristics,

and test strategies. It is suggested that the extent to which reliability depends merely on these factors can be measured by generating random programs having the given characteristics, and then observing their failure statistics.

Another reliability simulation approach [28] produces time-line imitations of reliability-related activities and events. Reliability measures of interest to the software process are modeled parametrically over time. The key to this approach is a rate-based architecture, in which phenomena occur naturally over time as controlled by their frequencies of occurrence, which depend on driving software metrics such as number of faults so far exposed or yet remaining, failure criticality, workforce level, test intensity, and software execution time. Rate-based event simulation is an example of a form of modeling called system dynamics, whose distinctive feature is that the observables are discrete events randomly occurring in time. Since many software reliability growth models are also based on rate (in terms of software hazard), the underlying processes assumed by these models are fundamentally the same as the rate-based reliability simulation. In general, simulations enable investigations of questions too difficult to be answered analytically, and are therefore more flexible and more powerful.

Various SRE measurement tools have been developed for data collection, reliability analysis, parameter estimation, model application and reliability simulation. Any major improvement on SRE is likely to focus on such tools. We need to provide tools and environments which can assist software developers to build reliable software for different applications. The partition of tools, environments, and techniques that will be engaged should reflect proper employment of the best current SRE practices.

3.6. Testing effectiveness and code coverage

As a typical mechanism for fault removal in software reliability engineering, software testing has been widely practiced in industry for quality assurance and reliability improvement. Effective testing is defined as uncovering of most if not all detectable faults. As the total number of inherent faults is not known, testing effectiveness is usually represented by a measurable testing index. Code coverage, as an indicator to show how thoroughly software has been stressed, has been proposed and is widely employed to represent fault coverage.

Reference Findings Horgan (1994) [17] Frankl (1988) [16] Rapps (1988) [38]

High code coverage brings high software reliability and low fault rate.

Chen (1992) [13] A correlation between code coverage and software reliability was observed. Wong (1994) The correlation between test effectiveness and block coverage is higher than

that between test effectiveness and the size of test set. Frate (1995) An increase in reliability comes with an increase in at least one code coverage

measure, and a decrease in reliability is accompanied by a decrease in at least one code coverage measure.

Positive

Cai (2005) [8] Code coverage contributes to a noticeable amount of fault coverage.

Negative Briand (2000) [6] The testing result for published data did not support a causal dependency between code coverage and fault coverage.

Table 1. Comparison of Investigations on the Relation of Code Coverage to Fault Coverage

Despite the observations of a correlation between code coverage and fault coverage, a question is raised: Can this phenomenon of concurrent growth be attributed to a causal dependency between code coverage and fault detection, or is it just coincidental due to the cumulative nature of both measures? In one investigation of this question, an experiment involving Monte Carlo simulation was conducted on the assumption that there is no causal dependency between code coverage and fault detection [6]. The testing result for published data did not support a causal dependency between code coverage and defect coverage.

Nevertheless, many researchers consider coverage as a faithful indicator of the effectiveness of software testing results. A comparison among various studies on the impact of code coverage on software reliability is shown in Table 1.

3.7. Testing and operational profiles

The operational profile is a quantitative characterization of how a system will be used in the field by customers. It helps to schedule test activities, generate test cases, and select test runs. By allocating development and test resources to functions on the basis of how they are used, software reliability engineering can thus be planned with productivity and economics considerations in mind.

Using an operational profile to guide system testing ensures that if testing is terminated and the software is shipped because of imperative schedule constraints, the most-used operations will have received the most testing, and the reliability level will be the maximum that is practically achievable for the given test time. Also, in guiding regression testing, the profile tends to find, among the faults introduced by changes, the ones that have the most effect on reliability. Examples of

the benefits of applying operational profiles can be found in a number of industrial projects [34].

Although significant improvement can be achieved by employing operational profiles in regression or system testing, challenges still exist for this technique. First of all, the operational profiles for some applications are hard to develop, especially for some distributed software systems, e.g., Web services. Moreover, unlike those of hardware, the operational profiles of software cannot be duplicated in order to speed the testing, because the failure behavior of software depends greatly on its input sequence and internal status. While in unit testing, different software units can be tested at the same time, this approach is therefore not applicable in system testing or regression testing. As a result, learning to deal with improper operational profiles and the dependences within the operational profile are the two major problems in operational profile techniques.

3.8. Industry practice and concerns

Although some success stories have been reported, there is a lack of wide industry adoption for software reliability engineering across various applications. Software practitioners often see reliability as a cost rather than a value, an investment rather than a return. Often the reliability attribute of a product takes less priority than its functionality or innovation. When product delivery schedule is tight, reliability is often the first element to be squeezed.

The main reason for the lack of industry enthusiasm in SRE is because its cost-effectiveness is not clear. Current SRE techniques incur visible overhead but yield invisible benefits. In contrast, a company’s target is to have visible benefit but invisible overhead. The former requires some demonstration in the form of successful projects, while the latter involves avoidance

of labor-intensive tasks. Many companies, voluntarily or under compulsion from their quality control policy, collect failure data and make reliability measurements. They are not willing to spend much effort on data collection, let alone data sharing. Consequently, reliability results cannot be compared or benchmarked, and the experiences are hard to accumulate. Most software practitioners only employ some straightforward methods and metrics for their product reliability control. For example, they may use some general guidelines for quality metrics, such as fault density, lines of code, or development or testing time, and compare current projects with previous ones.

As the competitive advantage of product reliability is less obvious than that of other product quality attributes (such as performance or usability), few practitioners are willing to try out emerging techniques on SRE. The fact that there are so many software reliability models to choose from also intimidates practitioners. So instead of investigating which models are suitable for their environments or which model selection criteria can be applied, practitioners tend to simply take reliability measurements casually, and they are often suspicious about the reliability numbers obtained by the models. Many software projects claim to set reliability objectives such as five 9’s or six 9’s (meaning 0.99999 to 0.999999 availability or 10-5 to 10-6 failures per execution hour), but few can validate their reliability achievement.

Two major successful hardware reliability engineering techniques, reliability prediction by system architecture block diagrams and FME(C)A, still cannot be directly applied to software reliability engineering. This, as explained earlier, is due to the intricate software dependencies within and between software components (and sub-systems). If software components can be decoupled, or their dependencies can be clearly identified and properly modeled, then these popular techniques in hardware may be applicable to software, whereupon wide industry adoption may occur. We elaborate this in the following section.

3.9. Software architecture

Systematic examination of software architectures for a better way to support software development has been an active research direction in the past 10 years, and it will continue to be center stage in the coming decade [41]. Software architectural design not only impacts software development activities, but also affects SRE efforts. Software architecture should be enhanced to decrease the dependency of different software pieces

that run on the same computer or platform so that their reliability does not interact. Fault isolation is a major design consideration for software architecture. Good software architecture should enjoy the property that exceptions are raised when faults occur, and module failures are properly confined without causing system failures. In particular, this type of component-based software development approach requires different framework, quality assurance paradigm [9], and reliability modeling [51] from those in traditional software development.

A recent trend in software architecture is that as information engineering is becoming the central focus for today’s businesses, service-oriented systems and the associated software engineering will be the de facto standards for business development. Service orientation requires seamless integration of heterogeneous components and their interoperability for proper service creation and delivery. In a service-oriented framework, new paradigms for system organizations and software architectures are needed for ensuring adequate decoupling of components, swift discovery of applications, and reliable delivery of services. Such emerging software architectures include cross-platform techniques [5], open-world software [3], service-oriented architectures [32], and Web applications [22]. Although some modeling approaches have been proposed to estimate the reliability for specific Web systems [49], SRE techniques for general Web services and other service-oriented architectures require more research work.

4. Possible future directions

SRE activities span the whole software lifecycle. We discuss possible future directions with respect to five areas: software architecture, design, testing, metrics and emerging applications.

4.1. Reliability for software architectures and off-the-shelf components

Due to the ever-increasing complexity of software systems, modern software is seldom built from scratch. Instead, reusable components have been developed and employed, formally or informally. On the one hand, revolutionary and evolutionary object-oriented design and programming paradigms have vigorously pushed software reuse. On the other hand, reusable software libraries have been a deciding factor regarding whether a software development environment or methodology would be popular or not. In the light of this shift, reliability engineering for software development is

focusing on two major aspects: software architecture, and component-based software engineering.

The software architecture of a system consists of software components, their external properties, and their relationships with one another. As software architecture is the foundation of the final software product, the design and management of software architecture is becoming the dominant factor in software reliability engineering research. Well-designed software architecture not only provides a strong, reliable basis for the subsequent software development and maintenance phases, but also offers various options for fault avoidance and fault tolerance in achieving high reliability. Due to the cardinal importance of, and complexity involved in, software architecture design and modeling, being a good software architect is a rare talent that is highly demanded. A good software architect sees widely and thinks deeply, as the components should eventually fit together in the overall framework, and the anticipation of change has to be considered in the architecture design. A clean, carefully laid out architecture requires up-front investments in various design considerations, including high cohesion, low coupling, separation of modules, proper system closure, concise interfaces, avoidance of complexity, etc. These investments, however, are worthwhile since they eventually help to increase software reliability and reduce operation and maintenance costs.

One central research issue for software architecture concerning reliability is the design of failure-resilient architecture. This requires an effective software architecture design which can guarantee separation of components when software executes. When component failures occur in the system, they can then be quickly identified and properly contained. Various techniques can be explored in such a design. For example, memory protection prevents interference and failure propagation between different application processes. Guaranteed separation between applications has been a major requirement for the integration of multiple software services in complicated modern systems. It should be noted that the separation methods can support one another, and usually they are combined for achieve better reliability returns. Exploiting this synergy for reliability assessment is a possibility for further exploration.

In designing failure-resilient architecture, additional resources and techniques are often engaged. For example, error handling mechanisms for fault detection, diagnosis, isolation, and recovery procedures are incorporated to tolerate component failures; however,

these mechanisms will themselves have some impact on the system. Software architecture has to take this impact into consideration. On the one hand, the added reliability-enhancement routines should not introduce unnecessary complexity, making them error-prone, which would decrease the reliability instead of increasing it. On the other hand, these routines should be made unintrusive while they monitor the system, and they should not further jeopardize the system while they are carrying out recovery functions. Designing concise, simple, yet effective mechanisms to perform fault detection and recovery within a general framework is an active research topic for researchers.

While software architecture represents the product view of software systems, component-based software engineering addresses the process view of software engineering. In this popular software development technique, many research issues are identified, such as the following. How can reliable general reusable components be identified and designed? How can existing components be modified for reusability? How can a clean interface design be provided for components so that their interactions are fully under control? How can defensive mechanisms be provided for the components so that they are protected from others, and will not cause major failures? How can it be determined whether a component is risk-free? How can the reliability of a component be assessed under untested yet foreseeable operational conditions? How can the interactions of components be modeled if they cannot be assumed independent? Component-based software engineering allows structure-based reliability to be realized, which facilitates design for reliability before the software is implemented and tested. The dependencies among components will thus need to be properly captured and modeled first.

These methods favor reliability engineering in multiple ways. First of all, they directly increase reliability by reducing the frequency and severity of failures. Run-time protections may also detect faults before they cause serious failures. After failures, they make fault diagnosis easier, and thus accelerate reliability improvements. For reliability assessment, these failure prevention methods reduce the uncertainties of application interdependencies or unexpected environments. So, for instance, having sufficient separation between running applications ensures that when we port an application to a new platform, we can trust its failure rate to equal that experienced in a similar use on a previous platform plus that of the new platform, rather than being also affected by the specific combination of other applications present on the new platform. Structure-

based reliability models can then be employed with this system aspect in place. With this modeling framework assisted by well-engineered software architecture, the range of applicability of structure-based models can further be increased. Examples of new applications could be to specify and investigate failure dependence between components, to cope with wide variations of reliability depending on the usage environment, and to assess the impact of system risk when components are checked-in or checked-out of the system.

4.2. Achieving design for reliability

To achieve reliable system design, fault tolerance mechanism needs to be in place. A typical response to system or software faults during operation includes a sequence of stages: Fault confinement, Fault detection, Diagnosis, Reconfiguration, Recovery, Restart, Repair, and Reintegration. Modern software systems pose challenging research issues in these stages, which are described as follows:

1. Fault confinement. This stage limits the spread of fault effects to one area of the system, thus preventing contamination of other areas. Fault-confinement can be achieved through use of self-checking acceptance tests, exception handling routines, consistency checking mechanisms, and multiple requests/confirmations. As the erroneous system behaviours due to software faults are typically unpredictable, reduction of dependencies is the key to successful confinement of software faults. This has been an open problem for software reliability engineering, and will remain a tough research challenge.

2. Fault detection. This stage recognizes that something unexpected has occurred in the system. Fault latency is the period of time between the occurrence of a software fault and its detection. The shorter it is, the better the system can recover. Techniques fall in two classes: off-line and on-line. Off-line techniques such as diagnostic programs can offer comprehensive fault detection, but the system cannot perform useful work while under test. On-line techniques, such as watchdog monitors or redundancy schemes, provide a real-time detection capability that is performed concurrently with useful work.

3. Diagnosis. This stage is necessary if the fault detection technique does not provide information about the failure location and/or properties. On-line, failure-prevention diagnosis is the research trend. When the diagnosis indicates unhealthy conditions in the system (such as low available system resources), software

rejuvenation can be performed to achieve in-time transient failure prevention.

4. Reconfiguration. This stage occurs when a fault is detected and a permanent failure is located. The system may reconfigure its components either to replace the failed component or to isolate it from the rest of the system. Successful reconfiguration requires robust and flexible software architecture and the associated reconfiguration schemes.

5. Recovery. This stage utilizes techniques to eliminate the effects of faults. Two basic recovery approaches are based on: fault masking, retry and rollback. Fault-masking techniques hide the effects of failures by allowing redundant, correct information to outweigh the incorrect information. To handle design (permanent) faults, N-version programming can be employed. Retry, on the other hand, attempts a second try at an operation and is based on the premise that many faults are transient in nature. A recovery blocks approach is engaged to recover from software design faults in this case. Rollback makes use of the system operation having been backed up (checkpointed) to some point in its processing prior to fault detection and operation recommences from this point. Fault latency is important here because the rollback must go back far enough to avoid the effects of undetected errors that occurred before the detected error. The effectiveness of design diversity as represented by N-version programming and recovery blocks, however, continues to be actively debated.

6. Restart. This stage occurs after the recovery of undamaged information. Depending on the way the system is configured, hot restart, warm restart, or cold restart can be achieved. In hot restart, resumption of all operations from the point of fault detection can be attempted, and this is possible only if no damage has occurred. In warm restart, only some of the processes can be resumed without loss; while in cold restart, complete reload of the system is performed with no processes surviving.

7. Repair. In this stage, a failed component is replaced. Repair can be off-line or on-line. In off-line repair, if proper component isolation can be achieved, the system will continue as the failed component can be removed for operation. Otherwise, the system must be brought down to perform the repair, and so the system availability and reliability depends on how fast a fault can be located and removed. In on-line repair the component may be replaced immediately with a backup spare (in a procedure equivalent to reconfiguration) or operation may continue without the faulty component (for example, masking redundancy

or graceful degradation). With on-line repair, system operation is not interrupted; however, achieving complete and seamless repair poses a major challenge to researchers.

8. Reintegration. In this stage the repaired module must be reintegrated into the system. For on-line repair, reintegration must be performed without interrupting system operation.

Design for reliability techniques can further be pursued in four different areas: fault avoidance, fault detection, masking redundancy, and dynamic redundancy. Non-redundant systems are fault intolerant and, to achieve reliability, generally use fault avoidance techniques. Redundant systems typically use fault detection, masking redundancy, and dynamic redundancy to automate one or more of the stages of fault handling. The main design consideration for software fault tolerance is cost-effectiveness. The resulting design has to be effective in providing better reliability, yet it should not introduce excessive cost, including performance penalty and unwarranted complexity, which may eventually prove unworthy of the investigation.

4.3. Testing for reliability assessment

Software testing and software reliability have traditionally belonged to two separate communities. Software testers test software without referring to how software will operate in the field, as often the environment cannot be fully represented in the laboratory. Consequently they design test cases for exceptional and boundary conditions, and they spend more time trying to break the software than conducting normal operations. Software reliability measurers, on the other hand, insist that software should be tested according to its operational profile in order to allow accurate reliability estimation and prediction. In the future, it will be important to bring the two groups together, so that on the one hand, software testing can be effectively conducted, while on the other hand, software reliability can be accurately measured. One approach is to measure the test compression factor, which is defined as the ratio between the mean time between failures during operation and during testing. This factor can be empirically determined so that software reliability in the field can be predicted from that estimated during testing. Another approach is to ascertain how other testing related factors can be incorporated into software reliability modeling, so that accurate measures can be obtained based on the effectiveness of testing efforts.

Recent studies have investigated the effect of code coverage on fault detection under different testing profiles, using different coverage metrics, and have studied its application in reducing test set size [30]. Experimental data are required to evaluate code coverage and determine whether it is a trustworthy indicator for the effectiveness of a test set with respect to fault detection capability. Also, the effect of code coverage on fault detection may vary under different testing profiles. The correlation between code coverage and fault coverage should be examined across different testing schemes, including function testing, random testing, normal testing, and exception testing. In other words, white box testing and black box testing should be cross–checked for their effectiveness in exploring faults, and thus yielding reliability increase.

Furthermore, evidence for variation between different coverage metrics can also established. Some metrics may be independent and some correlated. The quantitative relationship between different code coverage metrics and fault detection capability should be assessed, so that redundant metrics can be removed, and orthogonal ones can be combined. New findings about the effect of code coverage and other metrics on fault detection can be used to guide the selection and evaluation of test cases under various testing profiles, and a systematic testing scheme with predictable reliability achievement can therefore be derived.

Reducing test set size is a key goal in software testing. Different testing metrics should be evaluated regarding whether they are good filters in reducing the test set size, while maintaining the same effectiveness in achieving reliability. This assessment should be conducted under various testing scenarios [8]. If such a filtering capability can be established, then the effectiveness of test cases can be quantitatively determined when they are designed. This would allow the prediction of reliability growth with the creation a test set before it is executed on the software, thus facilitating early reliability prediction and possible feedback control for better test set design schemes.

Other than linking software testing and reliability with code coverage, statistical learning techniques may offer another promising avenue to explore. In particular, statistical debugging approaches [26, 52], whose original purpose was to identify software faults with probabilistic modeling of program predicates, can provide a fine quantitative assessment of program codes with respect to software faults. They can therefore help to establish accurate software reliability

prediction models based on program structures under testing.

4.4. Metrics for reliability prediction

Today it is almost a mandate for companies to collect software metrics as an indication of a maturing software development process. While it is not hard to collect metrics data, it is not easy to collect clean and consistent data. It is even more difficult to derive meaningful results from the collected metrics data. Collecting metrics data for software reliability prediction purposes across various projects and applications is a major challenge. Moreover, industrial software engineering data, particularly those related to system failures, are historically hard to obtain across a range of organizations. It will be important for a variety of sources (such as NASA, Microsoft, IBM, Cisco, etc.) across industry and academia to make available real-failure data for joint investigation to establish credible reliability analysis procedures. Such a joint effort should define (1) what data to collect by considering domain sensitivities, accessibility, privacy, and utility; (2) how to collect data in terms of tools and techniques; and (3) how to interpret and analyze the data using existing techniques.

In addition to industrial data collection efforts, novel methods to improve reliability prediction are actively being researched. For example, by extracting rich information from metrics data using a sound statistical and probability foundation, Bayesian Belief Networks (BBNs) offer a promising direction for investigation in software engineering [7]. BBNs provide an attractive formalism for different software cases. The technique allows software engineers to describe prior knowledge about software development quality and software verification and validation (SV&V) quality, with manageable visual descriptions and automated inferences. The software reliability process can then be modified with inference from observed failures, and future reliability can be predicted. With proper engagement of software metrics, this is likely to be a powerful tool for reliability assessment of software based systems, finding applications in predicting software defects, forecasting software reliability, and determining runaway projects [1].

Furthermore, traditional reliability models can be enhanced to incorporate some testing completeness or effectiveness metrics, such as code coverage, as well as their traditional testing-time based metrics. The key idea is that failure detection is not only related to the time that the software is under testing, but also what fraction of the code has been executed by the testing.

The effect of testing time on reliability can be estimated using distributions from traditional SRGMs. However, new models are needed to describe the effect of coverage on reliability. These two dimensions, testing time and coverage, are not orthogonal. The degree of dependency between them is thus an open problem for investigation. Formulation of new reliability models which integrate time and coverage measurements for reliability prediction would be a promising direction.

One drawback of the current metrics and data collection process is that it is a one-way, open-loop avenue: while metrics of the development process can indicate or predict the outcome quality, such as the reliability, of the resulting product, they often cannot provide feedback to the process regarding how to make improvement. Metrics would present tremendous benefits to reliability engineering if they could achieve not just prediction, but also refinement. Traditional software reliability models take metrics (such as defect density or times between failures) as input and produce reliability quantity as the output. In the future, a reverse function is urgently called for: given a reliability goal, what should the reliability process (and the resulting metrics) look like? By providing such feedback, it is expected that a closed-loop software reliability engineering process can be informative as well as beneficial in achieving predictably reliable software.

4.5. Reliability for emerging software applications

Software engineering targeted for general systems may be too ambitious. It may find more successful applications if it is domain-specific. In this Future of Software Engineering volume, future software engineering techniques for a number of emerging application domains have been thoroughly discussed. Emerging software applications also create abundant opportunities for domain-specific reliability engineering.

One key industry in which software will have a tremendous presence is the service industry. Service-oriented design has been employed since the 1990s in the telecommunications industry, and it reached software engineering community as a powerful paradigm for Web service development, in which standardized interfaces and protocols gradually enabled the use of third-party functionality over the Internet, creating seamless vertical integration and enterprise process management for cross-platform, cross-provider, and cross-domain applications. Based

on the future trends for Web application development as laid out in [22], software reliability engineering for this emerging technique poses enormous challenges and opportunities. The design of reliable Web services and the assessment of Web service reliability are novel and open research questions. On the one hand, having abundant service providers in a Web service makes the design diversity approach suddenly appealing, as the diversified service design is perceived not as cost, but as an available resource. On the other hand, this unplanned diversity may not be equipped with the necessary quality, and the compatibility among various service providers can pose major problems. Seamless Web service composition in this emerging application domain is therefore a central issue for reliability engineering. Extensive experiments are required in the area of measurement of Web service reliability. Some investigations have been initiated with limited success [27], but more efforts are needed.

Researchers have proposed the publish/subscribe paradigm as a basis for middleware platforms that support software applications composed of highly evolvable and dynamic federations of components. In this approach, components do not interact with each other directly; instead an additional middleware mediates their communications. Publish/subscribe middleware decouples the communication among components and supports implicit bindings among components. The sender does not know the identity of the receivers of its messages, but the middleware identifies them dynamically. Consequently new components can dynamically join the federation, become immediately active, and cooperate with the other components without requiring any reconfiguration of the architecture. Interested readers can refer to [21] for future trends in middleware-based software engineering technologies.

The open system approach is another trend in software applications. Closed-world assumptions do not hold in an increasing number of cases, especially in ubiquitous and pervasive computing settings, where the world is intrinsically open. Applications cover a wide range of areas, from dynamic supply-chain management, dynamic enterprise federations, and virtual endeavors, on the enterprise level, to automotive applications and home automation on the embedded-systems level. In an open world, the environment changes continuously. Software must adapt and react dynamically to changes, even if they are unanticipated. Moreover, the world is open to new components that context changes could make dynamically available – for example, due to mobility. Systems can discover and bind such components

dynamically to the application while it is executing. The software must therefore exhibit a self-organization capability. In other words, the traditional solution that software designers adopted – carefully elicit change requests, prioritize them, specify them, design changes, implement and test, then redeploy the software – is no longer viable. More flexible and dynamically adjustable reliability engineering paradigms for rapid responses to software evolution are required.

5. Conclusions

As the cost of software application failures grows and as these failures increasingly impact business performance, software reliability will become progressively more important. Employing effective software reliability engineering techniques to improve product and process reliability would be the industry’s best interests as well as major challenges. In this paper, we have reviewed the history of software reliability engineering, the current trends and existing problems, and specific difficulties. Possible future directions and promising research problems in software reliability engineering have also been addressed. We have laid out the current and possible future trends for software reliability engineering in terms of meeting industry and customer needs. In particular, we have identified new software reliability engineering paradigms by taking software architectures, testing techniques, and software failure manifestation mechanisms into consideration. Some thoughts on emerging software applications have also been provided.

References

[1] S. Amasaki, O. Mizuno, T. Kikuno, and Y. Takagi, “A Bayesian Belief Network for Predicting Residual Faults in Software Products,” Proceedings of 14th International Symposium on Software Reliability Engineering (ISSRE2003), November 2003, pp. 215-226,

[2] ANSI/IEEE, Standard Glossary of Software Engineering Terminology, STD-729-1991, ANSI/IEEE, 1991.

[3] L. Baresi, E. Nitto, and C. Ghezzi, “Toward Open-World Software: Issues and Challenges,” IEEE Computer, October 2006, pp. 36-43.

[4] A. Bertolino, “Software Testing Research: Achievements, Challenges, Dreams,” Future of Software Engineering 2007, L. Briand and A. Wolf (eds.), IEEE-CS Press, 2007.

[5] J. Bishop and N. Horspool, “Cross-Platform Development: Software That Lasts,” IEEE Computer, October 2006, pp. 26-35.

[6] L. Briand and D. Pfahl, “Using Simulation for Assessing the Real Impact of Test Coverage on Defect Coverage,”

IEEE Transactions on Reliability, vol. 49, no. 1, March 2000, pp. 60-70.

[7] J. Cheng, D.A. Bell, and W. Liu, “Learning Belief Networks from Data: An Information Theory Based Approach,” Proceedings of the Sixth International Conference on Information and Knowledge Management, Las Vegas, 1997, pp. 325-331.

[8] X. Cai and M.R. Lyu, “The Effect of Code Coverage on Fault Detection Under Different Testing Profiles,” ICSE 2005 Workshop on Advances in Model-Based Software Testing (A-MOST), St. Louis, Missouri, May 2005.

[9] X. Cai, M.R. Lyu, and K.F. Wong, “A Generic Environment for COTS Testing and Quality Prediction,” Testing Commercial-off-the-shelf Components and Systems, S. Beydeda and V. Gruhn (eds.), Springer-Verlag, Berlin, 2005, pp. 315-347.

[10] X. Cai, M.R. Lyu, and M.A. Vouk, “An Experimental Evaluation on Reliability Features of N-Version Programming,” in Proceedings 16th International Symposium on Software Reliability Engineering (ISSRE’2005), Chicago, Illinois, Nov. 8-11, 2005.

[11] X. Cai and M.R. Lyu, “An Empirical Study on Reliability and Fault Correlation Models for Diverse Software Systems,” in Proceedings 15th International Symposium on Software Reliability Engineering (ISSRE’2004), Saint-Malo, France, Nov. 2004, pp.125-136.

[12] M. Chen, M.R. Lyu, and E. Wong, “Effect of Code Coverage on Software Reliability Measurement,” IEEE Transactions on Reliability, vol. 50, no. 2, June 2001, pp.165-170.

[13] M.H. Chen, A.P. Mathur, and V.J. Rego, “Effect of Testing Techniques on Software Reliability Estimates Obtained Using Time Domain Models,” In Proceedings of the 10th Annual Software Reliability Symposium, Denver, Colorado, June 1992, pp. 116-123.

[14] J.B. Dugan and M.R. Lyu, “Dependability Modeling for Fault-Tolerant Software and Systems,” in Software Fault Tolerance, M. R. Lyu (ed.), New York: Wiley, 1995, pp. 109–138.

[15] D.E. Eckhardt and L.D. Lee, “A Theoretical Basis for the Analysis of Multiversion Software Subject to Coincident Errors,” IEEE Transactions on Software Engineering, vol. 11, no. 12, December 1985, pp. 1511–1517.

[16] P.G. Frankl and E.J. Weyuker, “An Applicable Family of Data Flow Testing Criteria,” IEEE Transactions on Software Engineering, vol. 14, no. 10, October 1988, pp. 1483-1498.

[17] J.R. Horgan, S. London, and M.R. Lyu, “Achieving Software Quality with Testing Coverage Measures,” IEEE Computer, vol. 27, no.9, September 1994, pp. 60-69.

[18] C.Y. Huang and M.R. Lyu, “Optimal Release Time for Software Systems Considering Cost, Testing-Effort, and Test

Efficiency,” IEEE Transactions on Reliability, vol. 54, no. 4, December 2005, pp. 583-591.

[19] C.Y. Huang, M.R. Lyu, and S.Y. Kuo, "A Unified Scheme of Some Non-Homogeneous Poisson Process Models for Software Reliability Estimation," IEEE Transactions on Software Engineering, vol. 29, no. 3, March 2003, pp. 261-269.

[20] W.S. Humphrey, “The Future of Software Engineering: I,” Watts New Column, News at SEI, vol. 4, no. 1, March, 2001.

[21] V. Issarny, M. Caporuscio, and N. Georgantas: “A Perspective on the Future of Middleware-Based Software Engineering,” Future of Software Engineering 2007, L. Briand and A. Wolf (eds.), IEEE-CS Press, 2007.

[22] M. Jazayeri, “Web Application Development: The Coming Trends,” Future of Software Engineering 2007, L. Briand and A. Wolf (eds.), IEEE-CS Press, 2007.

[23] Z. Jelinski and P.B. Moranda, “Software Reliability Research,” in Proceedings of the Statistical Methods for the Evaluation of Computer System Performance, Academic Press, 1972, pp. 465-484.

[24] B. Littlewood and L. Strigini, “Software Reliability and Dependability: A Roadmap,” in Proceedings of the 22nd International Conference on Software Engineering (ICSE’2000), Limerick, June 2000, pp. 177-188.

[25] B. Littlewood and D. Miller, “Conceptual Modeling of Coincident Failures in Multiversion Software,” IEEE Transactions on Software Engineering, vol. 15, no. 12, December 1989, pp. 1596–1614.

[26] C. Liu, L. Fei, X. Yan, J. Han, and S. Midkiff, “Statistical Debugging: A Hypothesis Testing-based Approach,” IEEE Transaction on Software Engineering, vol. 32, no. 10, October, 2006, pp. 831-848.

[27] N. Looker and J. Xu, “Assessing the Dependability of SOAP-RPC-Based Web Services by Fault Injection,” in Proceedings of 9th IEEE International Workshop on Object-oriented Real-time Dependable Systems, 2003, pp. 163-170.

[28] M.R. Lyu (ed.), Handbook of Software Reliability Engineering, IEEE Computer Society Press and McGraw-Hill, 1996.

[29] M.R. Lyu and X. Cai, “Fault-Tolerant Software,” Encyclopedia on Computer Science and Engineering, Benjamin Wah (ed.), Wiley, 2007.

[30] M.R. Lyu, Z. Huang, S. Sze, and X. Cai, “An Empirical Study on Testing and Fault Tolerance for Software Reliability Engineering,” in Proceedings 14th IEEE International Symposium on Software Reliability Engineering (ISSRE'2003), Denver, Colorado, November 2003, pp.119-130.

[31] Y.K. Malaiya, N. Li, J.M. Bieman, and R. Karcich, “Software Reliability Growth with Test Coverage,” IEEE

Transactions on Reliability, vol. 51, no. 4, December 2002, pp. 420-426.

[32] T. Margaria and B. Steffen, “Service Engineering: Linking Business and IT,” IEEE Computer, October 2006, pp. 45-55.

[33] J.D. Musa, Software Reliability Engineering: More Reliable Software Faster and Cheaper (2nd Edition), AuthorHouse, 2004.

[34] J.D. Musa, “Operational Profiles in Software Reliability Engineering,” IEEE Software, Volume 10, Issue 2, March 1993, pp. 14-32.

[35] J.D. Musa, A. Iannino, and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, Inc., New York, NY, 1987.

[36] H. Pham, Software Reliability, Springer, Singapore, 2000.

[37] P.T. Popov, L. Strigini, J. May, and S. Kuball, “Estimating Bounds on the Reliability of Diverse Systems,” IEEE Transactions on Software Engineering, vol. 29, no. 4, April 2003, pp. 345–359.

[38] S. Rapps and E.J. Weyuker, “Selecting Software Test Data Using Data Flow Information,” IEEE Transactions on Software Engineering, vol. 11, no. 4, April 1985, pp. 367-375.

[39] Rome Laboratory (RL), Methodology for Software Reliability Prediction and Assessment, Technical Report RL-TR-92-52, volumes 1 and 2, 1992.

[40] M.L. Shooman, Reliability of Computer Systems and Networks: Fault Tolerance, Analysis and Design, Wiley, New York, 2002.

[41] R. Taylor and A. van der Hoek, “Software Design and Architecture: The Once and Future Focus of Software Engineering,” Future of Software Engineering 2007, L. Briand and A. Wolf (eds.), IEEE-CS Press, 2007.

[42] X. Teng, H. Pham, and D. Jeske, “Reliability Modeling of Hardware and Software Interactions, and Its Applications,” IEEE Transactions on Reliability, vol. 55, no. 4, Dec. 2006, pp. 571-577.

[43] L.A. Tomek and K.S. Trivedi, “Analyses Using Stochastic Reward Nets,” in Software Fault Tolerance, M.R. Lyu (ed.), New York: Wiley, 1995, pp. 139–165.

[44] W. Torres-Pomales, “Software Fault Tolerance: A Tutorial,” NASA Langley Research Center, Hampton, Virginia, TM-2000-210616, Oct. 2000.

[45] K.S. Trivedi, “SHARPE 2002: Symbolic Hierarchical Automated Reliability and Performance Evaluator,” in Proceedings International Conference on Dependable Systems and Networks, 2002.

[46] K.S. Trivedi, K. Vaidyanathan, and K. Goseva-Postojanova, "Modeling and Analysis of Software Aging and Rejuvenation", in Proceedings of 33 rd Annual Simulation

Symposium, IEEE Computer Society Press, Los Alamitos, CA, 2000, pp. 270-279.

[47] A. von Mayrhauser and D. Chen, “Effect of Fault Distribution and Execution Patterns on Fault Exposure in Software: A Simulation Study,” Software Testing, Verification & Reliability, vol. 10, no.1, March 2000, pp. 47-64.

[48] M.A. Vouk, “Using Reliability Models During Testing With Nonoperational Profiles,” in Proceedings of 2nd Bellcore/Purdue Workshop on Issues in Software Reliability Estimation, October 1992, pp. 103-111.

[49] W. Wang and M. Tang, “User-Oriented Reliability Modeling for a Web System,” in Proceedings of the 14th International Symposium on Software Reliability Engineering (ISSRE’03), Denver, Colorado, November 2003, pp.1-12.

[50] M. Xie, Software Reliability Modeling, World Scientific Publishing Company, 1991.

[51] S. Yacoub, B. Cukic, and H Ammar, “A Scenario-Based Reliability Analysis Approach for Component-Based Software,” IEEE Transactions on Reliability, vol. 53, no. 4, 2004, pp. 465-480.

[52] A.X. Zheng, M.I. Jordan, B. Libit, M. Naik, and A. Aiken, “Statistical Debugging: Simultaneous Identification of Multiple Bugs,” in Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, 2006, pp. 1105-1112.

23/11/2012 Software Reliability Testing - Wikipedia, the free encyclopedia

1/6en.wikipedia.org/wiki/Software_Reliability_Testing

Software Reliability TestingFrom Wikipedia, the free encyclopedia

Software reliability testing is one of the testing field, which deals with checking the ability of software tofunction under given environmental conditions for particular amount of time by taking into account all precisionsof the software. In Software Reliability Testing, the problems are discovered regarding the software design andfunctionality and the assurance is given that the system meets all requirements. Software Reliability is theprobability that software will work properly in specified environment and for given time.

Probability = Number of cases when we find failure / Total number of cases under consideration

Using this formula, failure probability is calculated by testing a sample of all available input states. The set of allpossible input states is called as input space. To find reliability of software, we need to find output space from

given input space and software.[1]

Contents

1 Overview

2 Objective of reliability testing

2.1 Secondary objectives2.2 Points for defining objectives

3 Need of reliability testing

4 Types of reliability testing

4.1 Feature test

4.2 Load test4.3 Regression test

5 Tests planning

5.1 Steps for planning

5.2 Problems in designing test cases

6 Reliability enhancement through testing

6.1 Reliability growth testing

6.2 Designing test cases for current release

7 Reliability evaluation based on operational testing7.1 Reliability growth assessment and prediction

7.2 Reliability estimation based on failure-free working

8 See also

9 References

10 External links

Overview

To perform software testing, it is necessary to design the test cases and test procedure for each softwaremodule. Data is gathered from various stages of development for reliability testing,like design stage, Operatingstage etc. The tests are limited because of some restrictions, like Cost of test performing and time restrictions.Statistical samples are obtained from the software products to test for reliability of the software. when sufficientdata or information is gathered then statistical studies are done. time constraints are handled by applying fix



dates or deadlines to the tests to be performed,after this phase designed of the software is stopped and actualimplementations started.As there are restriction on cost and time the data is gathered carefully so that each data

has some purpose and it gets expected precision.[2] To achieve the satisfactory results from reliability testing one

must take care of some reliability characteristics. for example Mean Time to Failure (MTTF)[3] is measured interms of three factors

1. Operating Time.

2. Number of on off cycles.

3. Calendar Time.

If the restrictions are on Operation time or if focus is on first point for improvement then one can applycompressed time accelerations to reduce the test time. If the focus is on calendar time that is there are

predefined deadlines,then intensified stress testing is used.[2]

Software Reliability is measured in terms of Mean Time Between Failure(MTBF).[4]

MTBF consisting of mean time to failure (MTTF) and mean time to repair(MTTR). MTTF means difference of

time in two consecutive failures and MTTR is the time required to fix the failure.[5] .Reliability for good softwareshould be always between 0 to 1.Reliability increases when the errors or bugs from the programs are

removed.[6]

e.g. MTBF = 1000 hours for average software, then software should work for 1000 hrs for continuousoperations.

Objective of reliability testing

The main objective of the reliability testing is to test the performance of the software under given conditionswithout any type of corrective measure with known fixed procedures considering its specifications.

Secondary objectives

1. To find perceptual structure of repeating failures.

2. To find the number of failures occurring in specified amount of time.3. To find the mean life of the software.

4. To know the main cause of failure.5. After taking preventive actions checking the performance of different units of software.

Points for defining objectives

1. Behaviour of software should be defined in given conditions.2. The objective should be feasible.

3. Time constraints should be provided.[7]

Need of reliability testing

Nowadays in maximum number of fields we find the application of computer software. Also some software areused in many critical applications like in industries, in military, in commercial systems etc. For these softwarefrom last century software engineering is developing. There is no complete measure to assess them . But toassess them software reliability measure are used as tool. So software reliability is the most important aspect of



any software.[8]

To improve the performance of software product and software development process through assessment ofreliability is required. Reliability testing is of great use for software managers and software practitioners. Thus

ultimately testing reliability of a software is important.[9]

Types of reliability testing

Software Reliability Testing requires to check features provided by the software,the load that software can

handle and regression testing.[10]

Feature test

feature test for software conducts in following steps

Each operation in the software is executed once.Interaction between the two operations is reduced and

Each operation each checked for its proper execution.

feature test is followed by the load test.[10]

Load test

This test is conducted to check the performance of the software under maximum work load. Any softwareperforms better up to some extent of load on it after which the response time of the software starts degrading.For example, a web site can be tested to see till how many simultaneous users it can function withoutperformance degradation. This testing mainly helps for Databases and Application servers.Load testing alsorequires to do software performance testing where it checks that how well some software performs under

workload.[10]

Regression test

Regression testing is used to check if any bug fixing in the software introduced new bug. One part of thesoftware affects the other is determined. Regression testing is conducted after every change in the software

features. This testing is periodic. The period depends on the length and features of software.[10]

Tests planning

Reliability testing costs more as compare to other types of testing. Thus while doing reliability testing propermanagement and planning is required. This plan includes testing process to be implemented, data about itsenvironment, test schedule, test points etc.

Steps for planning

1. Find main aim of testing.

2. Know the requirements of testing.3. Have a look over existing data and check for the requirements.

4. Considering priorities of test find out necessary tests.

5. Utilize time constraints, available money and manpower properly.



6. Determine specifications of test.7. Allot different responsibilities to testing teams.

8. Decide policies for providing report of testing.

9. Have control over testing procedure throughout testing procedure.[7]

Problems in designing test cases

There are some problem while going through this tests.

Test cases can be simply selected by selecting valid input values for each field of the software,but aftersome changes in particular module the recorded input values again needs to check. Those values may not

test the new features introduced after older version of software.

There may be some critical runs in the software which are not handled by any test case.so careful test

case selection is necessary.[10]

Reliability enhancement through testing

Studies during development and design of software helps for reliability of product. Reliability testing is basicallyperformed to eliminate the failure mode of the software.life testing of the product should always done after the

design part is finished or at least complete design is finalize.[11] failure analysis and design improvement isachieved through following testings.

Reliability growth testing

[11] This testing is used to check new prototypes of the software which are initially supposed to failfrequently.Failure causes are detected and actions are taken to reduce defects. suppose T is total accumulatedtime for prototype.n(T) is number of failure from start to time T.The graph drawn for n(T)/T is a straight line.This graph is called Duance Plot. one can get, how much reliability can be gained after all other cycles of testand to fix it.

solving eq.1 for n(T),

where K is e^b. if value of alpha in the equation is zero the reliability can not be improved as expected for givennumber of failure. for alpha greater than zero cumulative time T increases. this explains that number of the failuredoesn't depends on test lengths.

Designing test cases for current release

If in the current version of software release we are adding new operation,then writing a test case for thatoperation is done differently.

first plan how many new test cases are to be written for current version.If the new feature is part of any existing feature then share the test cases of new and existing features

among them.



Finally combine all test cases from current version and previous one and record all the results.[10]

There is a predefined rule to calculate count of new test cases for the software. if N is the probability ofoccurrence of new operations for new release of the software, R is the probability of occurrence of usedoperations in the current release and T is the number of all previously used test cases then

Reliability evaluation based on operational testing

In reliability testing to test the reliability of the software use the method of operational testing. Here one checksthe working of software in its relevant operational environment. But constructing such an operationalenvironment is the main problem. Such type of simulation is observed in some industries like nuclear industries,in aircraft etc. Predicting future reliability is a part of reliability evaluation. There are two techniques used for this:

Steady state reliability estimation In this case we use the feed-backs of delivered software products. Depending on those results we predict

the future reliability of next version of product. It simply follows the way of sample testing for physical

products.Reliability growth based prediction

This method uses the documentation of testing procedure. For example consider a developed software

and after that we are creating different new versions of that software. At that time we consider data about

testing of each version of that software and on the basis of that observed trend we predict the reliability of

software.[12]

Reliability growth assessment and prediction

In assessment and prediction of software reliability we use reliability growth model. During operation of softwaredata about its failure is stored in statistical form and is given as input to reliability growth model. Using that datareliability growth model will evaluate the reliability of software. Lots of data about reliability growth model isavailable with probability models claiming to represent failure process. But there is no model which best suitedfor all conditions. So considering circumstances we have to chose one of the model. So today such type ofproblem is overcome by using advanced techniques.

Reliability estimation based on failure-free working

In this case the reliability of the software is estimated with some assumptions like

If a bug is found out then it is sure that it is going to fix by someone.

Fixing of bug is not going to affect the reliability of software.

Each fix in the software is accurate.[12]

See also

Software testing

Load testing



Regression testing

Reliability engineering

References

1. ^ Software Reliability. Hoang Pham.

2. ̂a b E.E.Lewis. Introduction to Reliability Engineering.

3. ^ "MTTF" (http://www.weibull.com/hotwire/issue94/relbasics94.htm) .http://www.weibull.com/hotwire/issue94/relbasics94.htm.

4. ^ Roger Pressman. Software Engineering A Practitioner's Approach. McGrawHill.

5. ^ "Approaches to Reliability Testing & Setting of Reliability Test Objectives"(http://www.softwaretestinggenius.com/articalDetails.php?qry=963) .http://www.softwaretestinggenius.com/articalDetails.php?qry=963.

6. ^ Aditya P. Mathur. Foundations of Software Testing. Pearson publications.

7. ̂a b Reliability and life testing handbook. Dimitri kececioglu.

8. ^ A Statistical Basis for Software Reliability Assessment. M. xie.

9. ^ Software Reliability modelling. M. Xie.

10. ̂a b c d e f John D. Musa. Software reliability engineering: more reliable software, faster and cheaper.McGraw-Hill. ISBN 0-07-060319-7.

11. ̂a b E.E.Liwis. Introduction to Reliability Engineering. ISBN 0-471-01833-3.

12. ̂a b "Problem of Assessing reliability". CiteSeerX: 10.1.1.104.9831(http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.9831) .

External links

Mean Time Between Failure (http://www.weibull.com/hotwire/issue94/relbasics94.htm/)

Software Life Testing (http://www.weibull.com/basics/accelerated.htm/)

Retrieved from "http://en.wikipedia.org/w/index.php?title=Software_Reliability_Testing&oldid=521833844"

Categories: Software testing

This page was last modified on 7 November 2012 at 15:12.


apply. See Terms of Use for details.

Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.

23/11/2012 Software performance testing - Wikipedia, the free encyclopedia

1/7en.wikipedia.org/wiki/Software_performance_testing

Software performance testingFrom Wikipedia, the free encyclopedia

In software engineering, performance testing is in general testing performed to determine how a systemperforms in terms of responsiveness and stability under a particular workload. It can also serve to investigate,measure, validate or verify other quality attributes of the system, such as scalability, reliability and resourceusage.

Performance testing is a subset of performance engineering, an emerging computer science practice whichstrives to build performance into the implementation, design and architecture of a system.

Contents

1 Performance testing types

1.1 Load testing

1.2 Stress testing

1.3 Endurance testing (soak testing)1.4 Spike testing

1.5 Configuration testing

1.6 Isolation testing

2 Setting performance goals

2.1 Concurrency/throughput2.2 Server response time

2.3 Render response time

2.4 Performance specifications

2.5 Questions to ask

3 Pre-requisites for Performance Testing3.1 Test conditions3.2 Timing

4 Tools5 Technology

6 Tasks to undertake

7 Methodology

7.1 Performance testing web applications

8 See also

9 External links

Performance testing types

Load testing

Load testing is the simplest form of performance testing. A load test is usually conducted to understand thebehaviour of the system under a specific expected load. This load can be the expected concurrent number ofusers on the application performing a specific number of transactions within the set duration. This test will giveout the response times of all the important business critical transactions. If the database, application server, etc.



are also monitored, then this simple test can itself point towards any bottlenecks in the application software.

Stress testing

Stress testing is normally used to understand the upper limits of capacity within the system. This kind of test isdone to determine the system's robustness in terms of extreme load and helps application administrators todetermine if the system will perform sufficiently if the current load goes well above the expected maximum.

Endurance testing (soak testing)

Endurance testing is usually done to determine if the system can sustain the continuous expected load. Duringendurance tests, memory utilization is monitored to detect potential leaks. Also important, but often overlookedis performance degradation. That is, to ensure that the throughput and/or response times after some long periodof sustained activity are as good or better than at the beginning of the test. It essentially involves applying asignificant load to a system for an extended, significant period of time. The goal is to discover how the systembehaves under sustained use.

Spike testing

Spike testing is done by suddenly increasing the number of or load generated by, users by a very large amountand observing the behaviour of the system. The goal is to determine whether performance will suffer, the systemwill fail, or it will be able to handle dramatic changes in load.

Configuration testing

Rather than testing for performance from the perspective of load, tests are created to determine the effects ofconfiguration changes to the system's components on the system's performance and behaviour. A commonexample would be experimenting with different methods of load-balancing.

Isolation testing

Isolation testing is not unique to performance testing but involves repeating a test execution that resulted in asystem problem. Often used to isolate and confirm the fault domain.

Setting performance goals

Performance testing can serve different purposes.

It can demonstrate that the system meets performance criteria.

It can compare two systems to find which performs better.Or it can measure what parts of the system or workload causes the system to perform badly.

Many performance tests are undertaken without due consideration to the setting of realistic performance goals.The first question from a business perspective should always be "why are we performance testing?". Theseconsiderations are part of the business case of the testing. Performance goals will differ depending on thesystem's technology and purpose however they should always include some of the following:

Concurrency/throughput

If a system identifies end-users by some form of log-in procedure then a concurrency goal is highly desirable. By



definition this is the largest number of concurrent system users that the system is expected to support at anygiven moment. The work-flow of your scripted transaction may impact true concurrency especially if theiterative part contains the log-in and log-out activity.

If the system has no concept of end-users then performance goal is likely to be based on a maximum throughputor transaction rate. A common example would be casual browsing of a web site such as Wikipedia.

Server response time

This refers to the time taken for one system node to respond to the request of another. A simple example wouldbe a HTTP 'GET' request from browser client to web server. In terms of response time this is what all loadtesting tools actually measure. It may be relevant to set server response time goals between all nodes of thesystem.

Render response time

A difficult thing for load testing tools to deal with as they generally have no concept of what happens within anode apart from recognizing a period of time where there is no activity 'on the wire'. To measure renderresponse time it is generally necessary to include functional test scripts as part of the performance test scenariowhich is a feature not offered by many load testing tools.

Performance specifications

It is critical to detail performance specifications (requirements) and document them in any performance test plan.Ideally, this is done during the requirements development phase of any system development project, prior to anydesign effort. See Performance Engineering for more details.

However, performance testing is frequently not performed against a specification i.e. no one will have expressedwhat the maximum acceptable response time for a given population of users should be. Performance testing isfrequently used as part of the process of performance profile tuning. The idea is to identify the “weakest link” –there is inevitably a part of the system which, if it is made to respond faster, will result in the overall systemrunning faster. It is sometimes a difficult task to identify which part of the system represents this critical path, andsome test tools include (or can have add-ons that provide) instrumentation that runs on the server (agents) andreport transaction times, database access times, network overhead, and other server monitors, which can beanalyzed together with the raw performance statistics. Without such instrumentation one might have to havesomeone crouched over Windows Task Manager at the server to see how much CPU load the performancetests are generating (assuming a Windows system is under test).

Performance testing can be performed across the web, and even done in different parts of the country, since it isknown that the response times of the internet itself vary regionally. It can also be done in-house, although routerswould then need to be configured to introduce the lag what would typically occur on public networks. Loadsshould be introduced to the system from realistic points. For example, if 50% of a system's user base will beaccessing the system via a 56K modem connection and the other half over a T1, then the load injectors(computers that simulate real users) should either inject load over the same mix of connections (ideal) or simulatethe network latency of such connections, following the same user profile.

It is always helpful to have a statement of the likely peak numbers of users that might be expected to use thesystem at peak times. If there can also be a statement of what constitutes the maximum allowable 95 percentileresponse time, then an injector configuration could be used to test whether the proposed system met thatspecification.



Questions to ask

Performance specifications should ask the following questions, at a minimum:

In detail, what is the performance test scope? What subsystems, interfaces, components, etc. are in andout of scope for this test?

For the user interfaces (UIs) involved, how many concurrent users are expected for each (specify peak

vs. nominal)?What does the target system (hardware) look like (specify all server and network appliance

configurations)?

What is the Application Workload Mix of each system component? (for example: 20% log-in, 40%

search, 30% item select, 10% checkout).What is the System Workload Mix? [Multiple workloads may be simulated in a single performance test]

(for example: 30% Workload A, 20% Workload B, 50% Workload C).

What are the time requirements for any/all back-end batch processes (specify peak vs. nominal)?

Pre-requisites for Performance Testing

A stable build of the system which must resemble the production environment as close as is possible.

The performance testing environment should not be clubbed with User acceptance testing (UAT) ordevelopment environment. This is dangerous as if an UAT or Integration test or other tests are going on in thesame environment, then the results obtained from the performance testing may not be reliable. As a best practiceit is always advisable to have a separate performance testing environment resembling the productionenvironment as much as possible.

Test conditions

In performance testing, it is often crucial (and often difficult to arrange) for the test conditions to be similar to theexpected actual use. This is, however, not entirely possible in actual practice. The reason is that the workloadsof production systems have a random nature, and while the test workloads do their best to mimic what mayhappen in the production environment, it is impossible to exactly replicate this workload variability - except inthe most simple system.

Loosely-coupled architectural implementations (e.g.: SOA) have created additional complexities withperformance testing. Enterprise services or assets (that share a common infrastructure or platform) requirecoordinated performance testing (with all consumers creating production-like transaction volumes and load onshared infrastructures or platforms) to truly replicate production-like states. Due to the complexity and financialand time requirements around this activity, some organizations now employ tools that can monitor and createproduction-like conditions (also referred as "noise") in their performance testing environments (PTE) tounderstand capacity and resource requirements and verify / validate quality attributes.

Timing

It is critical to the cost performance of a new system, that performance test efforts begin at the inception of thedevelopment project and extend through to deployment. The later a performance defect is detected, the higherthe cost of remediation. This is true in the case of functional testing, but even more so with performance testing,due to the end-to-end nature of its scope. It is always crucial for performance test team to be involved as earlyas possible. As key performance requisites e.g. performance test environment acquisition and preparation isoften a lengthy and time consuming process.



Tools

In the diagnostic case, software engineers use tools such as profilers to measure what parts of a device orsoftware contributes most to the poor performance or to establish throughput levels (and thresholds) formaintained acceptable response time.

Technology

Performance testing technology employs one or more PCs or Unix servers to act as injectors – each emulatingthe presence of numbers of users and each running an automated sequence of interactions (recorded as a script,or as a series of scripts to emulate different types of user interaction) with the host whose performance is beingtested. Usually, a separate PC acts as a test conductor, coordinating and gathering metrics from each of theinjectors and collating performance data for reporting purposes. The usual sequence is to ramp up the load –starting with a small number of virtual users and increasing the number over a period to some maximum. The testresult shows how the performance varies with the load, given as number of users vs response time. Varioustools, are available to perform such tests. Tools in this category usually execute a suite of tests which willemulate real users against the system. Sometimes the results can reveal oddities, e.g., that while the averageresponse time might be acceptable, there are outliers of a few key transactions that take considerably longer tocomplete – something that might be caused by inefficient database queries, pictures etc.

Performance testing can be combined with stress testing, in order to see what happens when an acceptable loadis exceeded –does the system crash? How long does it take to recover if a large load is reduced? Does it fail ina way that causes collateral damage?

Analytical Performance Modeling is a method to model the behaviour of an system in a spreadsheet. The modelis fed with measurements of transaction resource demands (CPU, disk I/O, LAN, WAN), weighted by thetransaction-mix (business transactions per hour). The weighted transaction resource demands are added-up toobtain the hourly resource demands and divided by the hourly resource capacity to obtain the resource loads.Using the responsetime formula (R=S/(1-U), R=responsetime, S=servicetime, U=load), responsetimes can becalculated and calibrated with the results of the performance tests. Analytical performance modelling allowsevaluation of design options and system sizing based on actual or anticipated business usage. It is therefore muchfaster and cheaper than performance testing, though it requires thorough understanding of the hardwareplatforms.

Tasks to undertake

Tasks to perform such a test would include:

Decide whether to use internal or external resources to perform the tests, depending on inhouse expertise(or lack thereof)

Gather or elicit performance requirements (specifications) from users and/or business analysts

Develop a high-level plan (or project charter), including requirements, resources, timelines and milestonesDevelop a detailed performance test plan (including detailed scenarios and test cases, workloads,

environment info, etc.)

Choose test tool(s)

Specify test data needed and charter effort (often overlooked, but often the death of a valid performancetest)

Develop proof-of-concept scripts for each application/component under test, using chosen test tools and

strategies



Develop detailed performance test project plan, including all dependencies and associated time-lines

Install and configure injectors/controller

Configure the test environment (ideally identical hardware to the production platform), routerconfiguration, quiet network (we don’t want results upset by other users), deployment of server

instrumentation, database test sets developed, etc.

Execute tests – probably repeatedly (iteratively) in order to see whether any unaccounted for factor might

affect the resultsAnalyze the results - either pass/fail, or investigation of critical path and recommendation of corrective

action

Methodology

Performance testing web applications

According to the Microsoft Developer Network the Performance Testing Methodology(http://msdn2.microsoft.com/en-us/library/bb924376.aspx) consists of the following activities:

Activity 1. Identify the Test Environment. Identify the physical test environment and the productionenvironment as well as the tools and resources available to the test team. The physical environment

includes hardware, software, and network configurations. Having a thorough understanding of the entire

test environment at the outset enables more efficient test design and planning and helps you identify testing

challenges early in the project. In some situations, this process must be revisited periodically throughout

the project’s life cycle.

Activity 2. Identify Performance Acceptance Criteria. Identify the response time, throughput, and

resource utilization goals and constraints. In general, response time is a user concern, throughput is abusiness concern, and resource utilization is a system concern. Additionally, identify project success

criteria that may not be captured by those goals and constraints; for example, using performance tests to

evaluate what combination of configuration settings will result in the most desirable performance

characteristics.

Activity 3. Plan and Design Tests. Identify key scenarios, determine variability among representative

users and how to simulate that variability, define test data, and establish metrics to be collected.

Consolidate this information into one or more models of system usage to be implemented, executed, and

analyzed.Activity 4. Configure the Test Environment. Prepare the test environment, tools, and resources

necessary to execute each strategy as features and components become available for test. Ensure that the

test environment is instrumented for resource monitoring as necessary.

Activity 5. Implement the Test Design. Develop the performance tests in accordance with the test

design.

Activity 6. Execute the Test. Run and monitor your tests. Validate the tests, test data, and results

collection. Execute validated tests for analysis while monitoring the test and the test environment.Activity 7. Analyze Results, Tune, and Retest. Analyse, Consolidate and share results data. Make a

tuning change and retest. Improvement or degradation? Each improvement made will return smaller

improvement than the previous improvement. When do you stop? When you reach a CPU bottleneck,

the choices then are either improve the code or add more CPU.

See also

Stress testing (software)



Benchmark (computing)

Web server benchmarking

Application Response Measurement

External links

Web Load Testing for Dummies (http://www.gomez.com/ebook-web-load-testing-for-dummies-

generic/) (Book, PDF Version)

The Art of Application Performance Testing - O'Reilly ISBN 978-0-596-52066-3

(http://oreilly.com/catalog/9780596520670) (Book)

Performance Testing Guidance for Web Applications (http://msdn2.microsoft.com/en-

us/library/bb924375.aspx) (MSDN)

Performance Testing Guidance for Web Applications (http://www.amazon.com/dp/0735625700) (Book)Performance Testing Guidance for Web Applications

(http://www.codeplex.com/PerfTestingGuide/Release/ProjectReleases.aspx?ReleaseId=6690) (PDF)

Performance Testing Guidance (http://www.codeplex.com/PerfTesting) (Online KB)

Enterprise IT Performance Testing (http://www.perftesting.co.uk) (Online KB)

Performance Testing Videos (http://msdn2.microsoft.com/en-us/library/bb671346.aspx) (MSDN)

Open Source Performance Testing tools (http://www.opensourcetesting.org/performance.php)

"User Experience, not Metrics" and "Beyond Performance Testing"(http://www.perftestplus.com/pubs.htm)

"Performance Testing Traps / Pitfalls" (http://www.mercury-consulting-

ltd.com/wp/Performance_Testing_Traps.html)

Retrieved from "http://en.wikipedia.org/w/index.php?title=Software_performance_testing&oldid=523786276"

Categories: Software testing Software optimization

This page was last modified on 19 November 2012 at 03:44.


apply. See Terms of Use for details.Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.

Systematic software testing

Peter SestoftIT University of Copenhagen, Denmark1

Version 2, 2008-02-25

This note introduces techniques for systematic functionality testing of software.

Contents

1 Why software testing? 1

2 White-box testing 5

3 Black-box testing 10

4 Practical hints about testing 14

5 Testing in perspective 15

6 Exercises 16

1 Why software testing?

Programs often contain errors (so-called bugs), even though the compiler accepts the programas well-formed: the compiler can detect only errors of form, not of meaning. Many errorsand inconveniences in programs are discovered only by accident when the program is beingused. However, errors can be found in more systematic and effective ways than by “randomexperimentation”. This is the goal of software testing .

You may think, why don’t we just fix errors when they are discovered? After all, whatharm can a program do? Consider some effects of software errors:

• In the 1991 Gulf war, some Patriot missiles failed to hit incoming Iraqi Scud missiles,which therefore killed people on the ground. Accumulated rounding errors in the controlsoftware’s clocks caused large navigation errors.• Errors in the software controlling the baggage handling system of Denver International

Airport delayed the entire airport’s opening by a year (1994–1995), causing losses ofaround 360 million dollars. Since September 2005 the computer-controlled baggagesystem has not been used; manual baggage handling saves one million dollars a month.• The first launch of the European Ariane 5 rocket failed (1996), causing losses of hundreds

of million dollars. The problem was a buffer overflow in control software taken over fromAriane 4. The software had not been re-tested — to save money.

1Original 1998 version written for the Royal Veterinary and Agricultural University, Denmark.

1

• Errors in a new train control system deployed in Berlin (1998) caused train cancellationsand delays for weeks.• Errors in poorly designed control software in the Therac-25 radio-therapy equipment

(1987) exposed several cancer patients to heavy doses of radiation, killing some.

A large number of other software-related problems and risks have been recorded by the RISKSdigest since 1985, see the archive at http://catless.ncl.ac.uk/risks.

1.1 Syntax errors, semantic errors, and logic errors

A program in Java, or C# or any other language, may contain several kinds of errors:

• syntax errors: the program may be syntactically ill-formed (e.g. contain while x {},where there are no parentheses around x), so that strictly speaking it is not a Javaprogram at all;• semantic errors: the program may be syntactically well-formed, but attempt to access

non-existing local variables or non-existing fields of an object, or apply operators to thewrong type of arguments (as in true * 2, which attempts to multiply a logical valueby a number);• logical errors: the program may be syntactically well-formed and type-correct, but

compute the wrong answer anyway.

Errors of the two former kinds are relatively trivial: the Java compiler javac will automati-cally discover them and tell us about them. Logical errors (the third kind) are harder to dealwith: they cannot be found automatically, and it is our own responsibility to find them, oreven better, to convince ourselves that there are none.

In these notes we shall assume that all errors discovered by the compiler have been fixed.We present simple systematic techniques for finding semantic errors and thereby making itplausible that the program works as intended (when we can find no more errors).

1.2 Quality assurance and different kinds of testing

Testing fits into the more general context of software quality assurance; but what is softwarequality? ISO Standard 9126 (2001) distinguishes six quality characteristics of software:

• functionality : does this software do what it is supposed to do; does it work as intended?• usability : is this software easy to learn and convenient to use?• efficiency : how much time, memory, and network bandwidth does this software con-

sume?• reliability : how well does this software deal with wrong inputs, external problems such

as network failures, and so on?• maintainability : how easy is it to find and fix errors in this software?• portability : how easy is it to adapt this software to changes in its operating environment,

and how easy is it to add new functionality?

The present note is concerned only with functionality testing, but note that usability testingand performance testing address quality characteristics number two and three. Reliability canbe addressed by so-called stress testing, whereas maintainability and portability are rarelysystematically tested.

2

1.3 Debugging versus functionality testing

The purpose of testing is very different from that of debugging. It is tempting to confuse thetwo, especially if one mistakenly believes that the purpose of debugging is to remove the lastbug from the program. In reality, debugging rarely achieves this.

The real purpose of debugging is diagnosis. After we have observed that the program doesnot work as intended, we debug it to answer the question: why doesn’t this program work?When we have found out, we modify the program to (hopefully) work as intended.

By contrast, the purpose of functionality testing is to strengthen our belief that theprogram works as intended. To do this, we systematically try to show that it does not work.If our best efforts fail to show that the program does not work, then we have strengthenedour belief that it does work.

Using systematic functionality testing we might find some cases where the program doesnot work. Then we use debugging to find out why. Then we fix the problem. And then wetest again to make sure we fixed the problem without introducing new ones.

1.4 Profiling versus performance testing

The distinction between functionality testing and debugging has a parallel in the distinctionbetween performance testing and profiling. Namely, the purpose of profiling is diagnosis. Afterwe have observed that the program is too slow or uses too much memory, we use profiling toanswer the question: why is this program so slow, why does it use so much memory? Whenwe have found out, we modify the program to (hopefully) use less time and memory.

By contrast, the purpose of performance testing is to strengthen our belief that the pro-gram is efficient enough. To do this, we systematically measure how much time and memoryit uses on different kinds and sizes of inputs. If the measurements show that it is efficientenough for those inputs, then we have strengthened our belief that the program is efficientenough for all relevant inputs.

Using systematic performance testing we might find some cases where the program is tooslow. Then we use profiling to find out why. Then we fix the problem. And then we testagain to make sure we fixed the problem without introducing new ones.

Schematically, we have:

Purpose \ Quality Functionality EfficiencyDiagnosis Debugging ProfilingQuality assurance Functionality testing Performance testing

1.5 White-box testing versus black-box testing

Two important techniques for functionality testing are white-box testing and black-box testing.White-box testing , sometimes called structural testing or internal testing, focuses on the

text of the program. The tester constructs a test suite (a collection of inputs and correspondingexpected outputs) that demonstrates that all branches of the program’s choice and loopconstructs — if, while, switch, try-catch-finally, and so on — can be executed. Thetest suite is said to cover the statements of the program.

Black-box testing, sometimes called external testing, focuses on the problem that the pro-gram is supposed to solve; or more precisely, the problem statement or specification for the

3

program. The tester constructs a test data set (inputs and corresponding expected outputs)that includes ‘typical’ as well as ‘extreme’ input data. In particular, one must include inputsthat are described as exceptional or erroneous in the problem description.

White-box testing and black-box testing are complementary approaches to test case gener-ation. White-box testing does not focus on the problem area, and therefore may not discoverthat some subproblem is left unsolved by the program, whereas black-box testing should.Black-box testing does not focus on the program text, and therefore may not discover thatsome parts of the program are completely useless or have an illogical structure, whereaswhite-box testing should.

Software testing can never prove that a program contains no errors, but it can strengthenone’s faith in the program. Systematic software testing is necessary if the program will beused by others, if the welfare of humans or animals depends on it (so-called safety-criticalsoftware), or if one wants to base scientific conclusions on the program’s results.

1.6 Test coverage

Given that we cannot make a perfect test suite, how do we know when we have a reasonablygood one? A standard measure of a test suite’s comprehensiveness is coverage. Here are somenotions of coverage, in increasing order of strictness:

• method coverage: does the test suite make sure that every method (including function,procedure, constructor, property, indexer, action listener) gets executed at least once?• statement coverage: does the test suite make sure that every statement of every method

gets executed at least once?• branch coverage: does the test suite make sure that every transfer of control gets exe-

cuted at least once?• path coverage: does the test suite make sure that every execution path through the

program gets executed at least once?

Method coverage is the minimum one should expect from a test suite; in principle we knownothing at all about a method that has not been executed by the test suite.

Statement coverage is achieved by the white-box technique described in Section 2, and isoften the best coverage one can achieve in practice.

Branch coverage is more demanding, especially in relation to virtual method calls (so-called virtual dispatch) and exception throwing. Namely, consider a single method call state-ment a.m() where expression a has type A, and class A has many subclasses A1, A2 and so on,that override method m(). Then to achieve branch coverage, the test suite must make surethat a.m() gets executed for a being an object classs A1, an object of class A2, and so on.Similarly, there is a transfer of control from an exception-throwing statement throw exn tothe corresponding exception handler, if any, so to achieve branch coverage, the test suite mustmake sure that each such statement gets executed in the context of every relevant exceptionhandler.

Path coverage is usually impossible to achieve in practice, because any program thatcontains a loop will usually have an infinite number of possible execution paths.

4

2 White-box testing

The goal of white-box testing is to make sure that all parts of the program have been executed,for some notion of part, as described in Section 1.6 on test coverage. The approach describedin this section gives statement coverage. The resulting test suite includes enough input datasets to make sure that all methods have been called, that both the true and false brancheshave been executed in if statements, that every loop has been executed zero, one, and moretimes, that all branches of every switch statement have been executed, and so on. For everyinput data set, the expected output must be specified also. Then, the program is run withall the input data sets, and the actual outputs are compared to the expected outputs.

White-box testing cannot demonstrate that the program works in all cases, but it is asurprisingly efficient (fast), effective (thorough), and systematic way to discover errors in theprogram. In particular, it is a good way to find errors in programs with a complicated logic,and to find variables that are initialized with the wrong values.

2.1 Example 1 of white-box testing

The program below receives some integers as argument, and is expected to print out thesmallest and the greatest of these numbers. We shall see how one performs a white-box testof the program. (Be forewarned that the program is actually erroneous; is this obvious?)

public static void main ( String[] args ){int mi, ma;if (args.length == 0) /* 1 */System.out.println("No numbers");

else{mi = ma = Integer.parseInt(args[0]);for (int i = 1; i < args.length; i++) /* 2 */{int obs = Integer.parseInt(args[i]);if (obs > ma) ma = obs; /* 3 */else if (mi < obs) mi = obs; /* 4 */

}System.out.println("Minimum = " + mi + "; maximum = " + ma);

}}

The choice statements are numbered 1–4 in the margin. Number 2 is the for statement.First we construct a table that shows, for every choice statement and every possible outcome,which input data set covers that choice and outcome:

5

Choice Input property Input data set1 true No numbers A1 false At least one number B2 zero times Exactly one number B2 once Exactly two numbers C2 more than once At least three numbers E3 true Number > current maximum C3 false Number ≤ current maximum D4 true Number ≤ current maximum and > current minimum E, 3rd number4 false Number ≤ current maximum and ≤ current minimum E, 2nd number

While constructing the above table, we construct also a table of the input data sets:

Input data set Input contents Expected output Actual outputA (no numbers) No numbers No numbersB 17 17 17 17 17C 27 29 27 29 27 29D 39 37 37 39 39 39E 49 47 48 47 49 49 49

When running the above program on the input data sets, one sees that the outputs are wrong— they disagree with the expected outputs — for input data sets D and E. Now one mayrun the program manually on e.g. input data set D, which will lead one to discover that thecondition in the program’s choice 4 is wrong. When we receive a number which is less thanthe current minimum, then the variable mi is not updated correctly. The statement shouldbe:

else if (obs < mi) mi = obs; /* 4a */

After correcting the program, it may be necessary to reconstruct the white-box test. It maybe very time consuming to go through several rounds of modification and re-testing, so itpays off to make the program correct from the outset! In the present case it suffices to changethe comments in the last two lines of the table of choices and outcomes, because all we didwas to invert the condition in choice 4:

Choice Input property Input data set1 true No numbers A1 false At least one number B2 zero times Exactly one number B2 once Exactly two numbers C2 more than once At least three numbers E3 true Number > current maximum C3 false Number ≤ current maximum D4a true Number ≤ current maximum and < current minimum E, 2nd number4a false Number ≤ current maximum and ≥ current minimum E, 3rd number

The input data sets remain the same. The corrected program produced the expected outputfor all input data sets A–E.

6

2.2 Example 2 of white-box testing

The program below receives some non-negative numbers as input, and is expected to print outthe two smallest of these numbers, or the smallest, in case there is only one. (Is this problemstatement unambiguous?). This program, too, is erroneous; can you find the problem?

public static void main ( String[] args ){int mi1 = 0, mi2 = 0;if (args.length == 0) /* 1 */System.out.println("No numbers");

else{mi1 = Integer.parseInt(args[0]);if (args.length == 1) /* 2 */System.out.println("Smallest = " + mi1);

else{int obs = Integer.parseInt(args[1]);if (obs < mi1) /* 3 */{ mi2 = mi1; mi1 = obs; }

for (int i = 2; i < args.length; i++) /* 4 */{obs = Integer.parseInt(args[i]);if (obs < mi1) /* 5 */{ mi2 = mi1; mi1 = obs; }

else if (obs < mi2) /* 6 */mi2 = obs;

}System.out.println("The two smallest are " + mi1 + " and " + mi2);

}}

}

As before we tabulate the program’s choices 1–6 and their possible outcomes:

Choice Input property Input data set1 true No numbers A1 false At least one number B2 true Exactly one number B2 false At least two numbers C3 false Second number ≥ first number C3 true Second number < first number D4 zero time Exactly two numbers D4 once Exactly three numbers E4 more than once At least four numbers H5 true Third number < current minimum E5 false Third number ≥ current minimum F6 true Third number ≥ current minimum and < second least F6 false Third number ≥ current minimum and ≥ second least G

7

The corresponding input data sets might be:

Input data set Contents Expected output Actual outputA (no numbers) No numbers No numbersB 17 17 17C 27 29 27 29 27 0D 39 37 37 39 37 39E 49 48 47 47 48 47 48F 59 57 58 57 58 57 58G 67 68 69 67 68 67 0H 77 78 79 76 76 77 76 77

Running the program with these test data, it turns out that data set C produces wrongresults: 27 and 0. Looking at the program text, we see that this is because variable mi2retains its initial value, namely, 0. The program must be fixed by inserting an assignmentmi2 = obs just before the line labelled 3. We do not need to change the white-box test,because no choice statements were added or changed. The corrected program produces theexpected output for all input data sets A–H.

Note that if the variable declaration had not been initialized with mi2 = 0, the Javacompiler would have complained that mi2 might be used before its first assignment. If so, theerror would have been detected even without testing.

This is not the case in many other current programming languages (e.g. C, C++, Fortran),where one may well use an uninitialized variable — its value is just whatever happens to beat that location in the computer’s memory. The error may even go undetected by testing,when the value of mi2 equals the expected answer by accident. This is more likely than it maysound, if one runs the same (C, C++, Fortran) program on several input data sets, and thesame data values are used in several data sets. Therefore it is a good idea to choose differentdata values in the data sets, as done above.

2.3 Summary, white-box testing

Program statements should be tested as follows:

Statement Cases to testif Condition false and truefor Zero, one, and more than one iterationswhile Zero, one, and more than one iterationsdo-while One, and more than one, iterationsswitch Every case and default branch must be executedtry-catch-finally The try clause, every catch clause, and the finally clause

must be executed

A conditional expression such as (x != 0 ? 1000/x : 1) must be tested for the con-dition (x != 0) being true and being false, so that both alternatives have been evaluated.

8

Short-cut logical operators such as (x != 0) && (1000/x > y) must be tested for allpossible combinations of the truth values of the operands. That is,

(x != 0) && (1000/x > y)falsetrue falsetrue true

Note that the second operand in a short-cut (lazy) conjunction will be computed only if thefirst operand is true (in Java, C#, C, and C++). This is important, for instance, when thecondition is (x != 0) && (1000/x > y), where the second operand cannot be computed ifthe first one is false, that is, if x == 0. Therefore it makes no sense to require that thecombinations (false, false) and (false, true) be tested.

In a short-cut disjunction (x == 0) || (1000/x > y) it holds, dually, that the secondoperand is computed only if the first one is false. Therefore, in this case too there are onlythree possible combinations:

(x == 0) || (1000/x > y)truefalse falsefalse true

Methods The test suite must make sure that all methods have been executed. For recursivemethods one should test also the case where the method calls itself.

The test data sets are presented conveniently by two tables, as demonstrated in thissection. One table presents, for each statement, what data sets are used, and which propertyof the input is demonstrated by the test. The other table presents the actual contents of thedata sets, and the corresponding expected output.

9

3 Black-box testing

The goal of black-box testing is to make sure that the program solves the problem it issupposed to solve; to make sure that it works. Thus one must have a fairly precise idea ofthe problem that the program must solve, but in principle one does not need the programtext when designing a black-box test. Test data sets (with corresponding expected outputs)must be created to cover ‘typical’ as well as ‘extreme’ input values, and also inputs that aredescribed as exceptional cases or illegal cases in the problem statement. Examples:

• In a program to compute the sum of a sequence of numbers, the empty sequence willbe an extreme, but legal, input (with sum 0).• In a program to compute the average of a sequence of numbers, the empty sequence

will be an extreme, and illegal, input. The program should give an error message forthis input, as one cannot compute the average of no numbers.

One should avoid creating a large collection of input data sets, ‘just to be on the safe side’.Instead, one must carefully consider what inputs might reveal problems in the program, anduse exactly those. When preparing a black-box test, the task is to find errors in the program;thus destructive thinking is required. As we shall see below, this is just as demanding asprogramming, that is, as constructive thinking.

3.1 Example 1 of black-box testing

Problem: Given a (possibly empty) sequence of numbers, find the smallest and the greatestof these numbers.

This is the same problem as in Section 2.1, but now the point of departure is the aboveproblem statement, not any particular program which claims to solve the problem.

First we consider the problem statement. We note that an empty sequence does notcontain a smallest or greatest number. Presumably, the program must give an error messageif presented with an empty sequence of numbers.

The black-box test might consist of the following input data sets: An empty sequence (A).A non-empty sequence can have one element (B), or two or more elements. In a sequence withtwo elements, the elements can be equal (C1), or different, the smallest one first (C2) or thegreatest one first (C3). If there are more than two elements, they may appear in increasingorder (D1), decreasing order (D2), with the greatest element in the middle (D3), or with thesmallest element in the middle (D4). All in all we have these cases:

Input property Input data setNo numbers AOne number BTwo numbers, equal C1Two numbers, increasing C2Two numbers, decreasing C3Three numbers, increasing D1Three numbers, decreasing D2Three numbers, greatest in the middle D3Three numbers, smallest in the middle D4

10

The choice of these input data sets is not arbitrary. It is influenced by our own ideas abouthow the problem might be solved by a program, and in particular how it might be solved thewrong way. For instance, the programmer might have forgotten that the sequence could beempty, or that the smallest number equals the greatest number if there is only one number,etc.

The choice of input data sets may be criticized. For instance, it is not obvious that dataset C1 is needed. Could the problem really be solved (wrongly) in a way that would bediscovered by C1, but not by any of the other input data sets?

The data sets C2 and C3 check that the program does not just answer by returning thefirst (or last) number from the input sequence; this is a relevant check. The data sets D3 andD4 check that the program does not just compare that first and the last number; it is lessclear that this is relevant.

Input data set Contents Expected output Actual outputA (no numbers) Error messageB 17 17 17C1 27 27 27 27C2 35 36 35 36C3 46 45 45 46D1 53 55 57 53 57D2 67 65 63 63 67D3 73 77 75 73 77D4 89 83 85 83 89


Problem: Given a (possibly empty) sequence of numbers, find the greatest difference betweentwo consecutive numbers.

We shall design a black-box test for this problem. First we note that if there is onlyzero or one number, then there are no two consecutive numbers, and the greatest differencecannot be computed. Presumably, an error message must be given in this case. Furthermore,it is unclear whether the ‘difference’ is signed (possibly negative) or absolute (always non-negative). Here we assume that only the absolute difference should be taken into account, sothat the difference between 23 and 29 is the same as that between 29 and 23.

This gives rise to at least the following input data sets: no numbers (A), exactly onenumber (B), exactly two numbers. Two numbers may be equal (C1), or different, in increasingorder (C2) or decreasing order (C3). When there are three numbers, the difference may beincreasing (D1) or decreasing (D2). That is:

Input property Input data setNo numbers AOne number BTwo numbers, equal C1Two numbers, increasing C2Two numbers, decreasing C3Three numbers, increasing difference D1Three numbers, decreasing difference D2

11

The data sets and their expected outputs might be:

Input data set Contents Expected output Actual outputA (no numbers) Error messageB 17 Error messageC1 27 27 0C2 36 37 1C3 48 46 2D1 57 56 59 3D2 69 65 67 4

One might consider whether there should be more variants of each of D1 and D2, in which thethree numbers would appear in increasing order (56,57,59), or decreasing (59,58,56), orincreasing and then decreasing (56,57,55), or decreasing and then increasing (56,57,59).Although these data sets might reveal errors that the above data sets would not, they doappear more contrived. However, this shows that black-box testing may be carried on indefi-nitely: you will never be sure that all possible errors have been detected.


Problem: Given a day of the month day and a month mth, decide whether they determine alegal date in a non-leap year. For instance, 31/12 (the 31st day of the 12th month) and 31/8are both legal, whereas 29/2 and 1/13 are not. The day and month are given as integers, andthe program must respond with Legal or Illegal.

To simplify the test suite, one may assume that if the program classifies e.g. 1/4 and30/4 as legal dates, then it will consider 17/4 and 29/4 legal, too. Correspondingly, one mayassume that if the program classifies 31/4 as illegal, then also 32/4, 33/4, and so on. Thereis no guarantee that the these assumptions actually hold; the program may be written in acontorted and silly way. Assumptions such as these should be written down along with thetest suite.

Under those assumptions one may test only ‘extreme’ cases, such as 0/4, 1/4, 30/4, and31/4, for which the expected outputs are Illegal, Legal, Legal, and Illegal.

12

Contents Expected output Actual output0 1 Illegal1 0 Illegal1 1 Legal31 1 Legal32 1 Illegal28 2 Legal29 2 Illegal31 3 Legal32 3 Illegal30 4 Legal31 4 Illegal31 5 Legal32 5 Illegal30 6 Legal31 6 Illegal31 7 Legal32 7 Illegal31 8 Legal32 8 Illegal30 9 Legal31 9 Illegal31 10 Legal32 10 Illegal30 11 Legal31 11 Illegal31 12 Legal32 12 Illegal1 13 Illegal

It is clear that the black-box test becomes rather large and cumbersome. In fact it is just aslong as a program that solves the problem! To reduce the number of data sets, one mightconsider just some extreme values, such as 0/1, 1/0, 1/1, 31/12 and 32/12; some exceptionalvalues around February, such as 28/2, 29/2 and 1/3, and a few typical cases, such as 30/4,31/4, 31/8 and 32/8. But that would weaken the test a little: it would not discover whetherthe program mistakenly believes that June (not July) has 31 days.

13

4 Practical hints about testing

• Avoid test cases where the expected output is zero. In Java and C#, static and non-static fields in classes automatically get initialized to 0. The actual output may thereforeequal the expected output by accident.• In languages such as C, C++ and Fortran, where variables are not initialized automat-

ically, testing will not necessarily reveal uninitialized variables. The accidental value ofan uninitialized variable may happen to equal the expected output. This is not unlikely,if one uses the same input data in several test cases. Therefore, choose different inputdata in different test cases, as done in the preceding sections.• Automate the test, if at all possible. Then it can conveniently be rerun whenever the

program has been modified. This is usually done as so-called unit tests. For Java,the JUnit framework from www.junit.org is a widely used tool, well supported byintegrated development environments such as BlueJ and Eclipse. For C#, the NUnitframework from www.nunit.org is widely used. Microsoft’s Visual Studio Team Systemalso contains unit test facilities.• As mentioned in Section 3 one should avoid creating an excessively large test suite that

has redundant test cases. Software evolves over time, and the test suite must evolvetogether with the software. For instance, if you decide to change a method in yoursoftware so that it returns a different result for certain inputs, then you must look atall test cases for that method to see whether they are still relevant and correct; in thatsituation it is unpleasant to discover that the same functionality is tested by 13 differenttest cases. A test suite is a piece of software too, and should have no superfluous parts.• When testing programs that have graphical user interfaces with menus, buttons, and

so on, one must describe carefully step by step what actions — menu choices, mouseclicks, and so on — the tester must perform, and what the program’s expected reactionsare. Clearly, this is cumbersome and expensive to carry out manually, so professionalsoftware houses use various tools to simulate user actions.

14

5 Testing in perspective

• Testing can never prove that a program has no errors, but it can considerably improvethe confidence one has in its results.• Often it is easier to design a white-box test suite than a black-box one, because one

can proceed systematically on the basis of the program text. Black-box testing requiresmore guesswork about the possible workings of the program, but can make sure thatthe program does what is required by the problem statement.• It is a good idea to design a black-box test at the same time you write the program.

This reveals unclarities and subtle points in the problem statement, so that you can takethem into account while writing the program — instead of having to fix the programlater.• Writing the test cases and the documentation at the same time is also valuable. When

attempting to write a test case, one often realizes what information users of a methodor class will be looking for in the documentation. Conversely, when one makes a claim(‘when n+i>arr.length, then FooException is thrown’) about the behaviour of a classor method in the documentation, that should lead to one or more test cases that checkthis claim.• If you further use unit test tools to automate the test, you can actually implement

the tests before you implement the corresponding functionality. Then you can moreconfidently implement the functionality and measure your implementation progress bythe number of test cases that succeed. This is called test-driven development.• From the tester’s point of view, testing is successful if it does find errors in the program;

in this case it was clearly not a waste of time to do the test. From the programmer’spoint of view the opposite holds: hopefully the test will not find errors in the program.When the tester and the programmer are one and the same person, then there is apsychological conflict: one does not want to admit to making mistakes, neither whenprogramming nor when designing test suites.• It is a useful exercise to design a test suite for a program written by someone else. This

is a kind of game: the goal of the programmer is to write a program that contains noerrors; the goal of the tester is to find the errors in the program anyway.• It takes much time to design a test suite. One learns to avoid needless choice statements

when programming, because this reduces the number of test cases in the white-boxtest. It also leads to simpler programs that usually are more general and easier tounderstand.2

• It is not unusual for a test suite to be as large as the software it tests. The C5 GenericCollection Library for C#/.NET (http://www.itu.dk/research/c5) implementation has27,000 lines of code, and its unit test has 28,000 lines.• How much testing is needed? The effort spent on testing should be correlated with the

consequences of possible program errors. A program used just once for computing one’staxes need no testing. However, a program must be tested if errors could affect thesafety of people or animals, or could cause considerable economic losses. If scientificconclusions will be drawn from the outputs of a program, then it must be tested too.

2A program may be hard to understand even when it has no choice statements; see Exercises 10 and 11.

15

6 Exercises

1. Problem: Given a sequence of integers, find their average.Use black-box techniques to construct a test suite for this problem.

2. Write a program to solve the problem from Exercise 1. The program should take itsinput from the command line. Run the test suite you made.

3. Use white-box techniques to construct a test suite for the program written in Exercise 2,and run it.

4. Problem: Given a sequence of numbers, decide whether they are sorted in increasingorder. For instance, 17 18 18 22 is sorted, but 17 18 19 18 is not. The result must beSorted or Not sorted.Use black-box techniques to construct a test suite for this problem.

5. Write a program that solves the problem from Exercise 4. Run the test suite you made.6. Use white-box techniques to construct a test suite for the program written in Exercise 5.

Run it.7. Write a program to decide whether a given (day, month) pair in a non-leap year is legal,

as discussed in Section 3.3. Run your program with the (black-box) test suite giventhere.

8. Use white-box techniques to construct a test suite for the program written in Exercise 7.Run it.

9. Problem: Given a (day, month) pair, compute the number of the day in a non-leapyear. For instance, (1, 1) is number 1; (1,2), which means 1 February, is number 32,(1,3) is number 60; and (31,12) is number 365. This is useful for computing the distancebetween two dates, e.g. the length of a course, the duration of a bank deposit, or thetime from sowing to harvest. The date and month can be assumed legal for a non-leapyear.Use black-box techniques to construct a test suite for this problem.

10. We claim that this Java method solves the problem from Exercise 9.

static int dayno(int day, int mth){int m = (mth+9)%12;return (m/5*153+m%5*30+(m%5+1)/2+59)%365+day;

}

Test this method with the black-box test suite you made above.11. Use white-box techniques to construct a test suite for the method shown in Exercise 10.

This appears trivial and useless, since there are no choice statements in the programat all. Instead one may consider jumps (discontinuities) in the processing of data.In particular, integer division (/) and remainder (%) produce jumps of this sort. Formth < 3 we have m = (mth + 9) mod 12 = mth + 9, and for mth ≥ 3 we havem = (mth + 9) mod 12 = mth − 3. Thus there is a kind of hidden choice when goingfrom mth = 2 to mth = 3. Correspondingly for m / 5 and (m % 5 + 1) / 2. This canbe used for choosing test cases for white-box test. Do that.

12. Consider a method String toRoman(int n) that is supposed to convert a positiveinteger to the Roman numeral representing that integer, using the symbols I = 1,V = 5, X = 10, L = 50, C = 100, D = 500 and M= 1000. The following rules determinethe Roman numeral corresponding to a positive number:

16

• In general, the symbols of a Roman numeral are added together from left to right,so II = 2, XX = 20, XXXI = 31, and MMVIII = 2008.• The symbols I, X and C may appear up to three times in a row; the symbol M may

appear any number of times; and the symbols V, L and D cannot be repeated.• When a lesser symbol appears before a greater one, the lesser symbol is subtracted,

not added. So IV = 4, IX = 9, XL = 40 and CM = 900.The symbol I may appear once before V and X; the symbol X may appear oncebefore L and C; the symbol C may appear once before D and M; and the symbols V,L and D cannot appear before a greater symbol.So 45 is written XLV, not VL; and 49 is written XLIX, not IL; and 1998 is writtenMCMXCVIII, not IIMM.

Exercise: use black-box techniques to construct a test suite for the method toRoman.This can be done in two ways. The simplest way is to call toRoman(n) for suitably chosennumbers n and checking that it returns the expected string. The more ambitious wayis to implement (and test!) the method fromRoman described in Exercise 12 below, anduse that to check Roman.

13. Consider a method int fromRoman(String s) with this specification: The methodchecks that string s is a well-formed Roman numeral according to the rules in Exer-cise 12, and if so, returns the corresponding number; otherwise throws an exception.Use black-box techniques to construct a test suite for this method. Remember to includealso some ill-formed Roman numerals.

17

23/11/2012 Usability testing - Wikipedia, the free encyclopedia

1/6en.wikipedia.org/wiki/Usability_testing

Usability testingFrom Wikipedia, the free encyclopedia

Usability testing is a technique used in user-centered interaction design to evaluate a product by testing it onusers. This can be seen as an irreplaceable usability practice, since it gives direct input on how real users use the

system.[1] This is in contrast with usability inspection methods where experts use different methods to evaluate auser interface without involving users.

Usability testing focuses on measuring a human-made product's capacity to meet its intended purpose. Examplesof products that commonly benefit from usability testing are foods, consumer products, web sites or webapplications, computer interfaces, documents, and devices. Usability testing measures the usability, or ease ofuse, of a specific object or set of objects, whereas general human-computer interaction studies attempt toformulate universal principles.

Contents

1 History of usability testing2 Goals of usability testing

3 What usability testing is not

4 Methods

4.1 Hallway testing

4.2 Remote Usability Testing

4.3 Expert review4.4 Automated expert review

5 How many users to test?

6 See also

7 References

8 External links

History of usability testing

Henry Dreyfuss in the late 1940s contracted to design the state rooms for the twin ocean liners "Independence"and "Constitution." He built eight prototype staterooms and installed them in a warehouse. He then brought in aseries of travelers to "live" in the rooms for a short time, bringing with them all items they would normally takewhen cruising. His people were able to discover over time, for example, if there was space for large steamertrunks, if light switches needed to be added beside the beds to prevent injury, etc., before hundreds of state

rooms had been built into the ship.[2]

A Xerox Palo Alto Research Center (PARC) employee wrote that PARC used extensive usability testing in

creating the Xerox Star, introduced in 1981.[3]

The Inside Intuit book, says (page 22, 1984), "... in the first instance of the Usability Testing that later becamestandard industry practice, LeFevre recruited people off the streets... and timed their Kwik-Chek (Quicken)

usage with a stopwatch. After every test... programmers worked to improve the program."[4]) Scott Cook,Intuit co-founder, said, "... we did usability testing in 1984, five years before anyone else... there's a very bigdifference between doing it and having marketing people doing it as part of their... design... a very big difference



between doing it and having it be the core of what engineers focus on.[5]

Goals of usability testing

Usability testing is a black-box testing technique. The aim is to observe people using the product to discovererrors and areas of improvement. Usability testing generally involves measuring how well test subjects respondin four areas: efficiency, accuracy, recall, and emotional response. The results of the first test can be treated as abaseline or control measurement; all subsequent tests can then be compared to the baseline to indicateimprovement.

Efficiency -- How much time, and how many steps, are required for people to complete basic tasks?

(For example, find something to buy, create a new account, and order the item.)Accuracy -- How many mistakes did people make? (And were they fatal or recoverable with the right

information?)

Recall -- How much does the person remember afterwards or after periods of non-use?

Emotional response -- How does the person feel about the tasks completed? Is the person confident,stressed? Would the user recommend this system to a friend?

To assess the usability of the system under usability testing, quantitative and/or qualitative Usability goals (also

called usability requirements[6]) have to be defined beforehand.[7][6][8] If the results of the usability testing meetthe Usability goals, the system can be considered as usable for the end-users whose representatives have testedit.

What usability testing is not

Simply gathering opinions on an object or document is market research or qualitative research rather thanusability testing. Usability testing usually involves systematic observation under controlled conditions to

determine how well people can use the product.[9] However, often both qualitative and usability testing are usedin combination, to better understand users' motivations/perceptions, in addition to their actions.

Rather than showing users a rough draft and asking, "Do you understand this?", usability testing involveswatching people trying to use something for its intended purpose. For example, when testing instructions forassembling a toy, the test subjects should be given the instructions and a box of parts and, rather than beingasked to comment on the parts and materials, they are asked to put the toy together. Instruction phrasing,illustration quality, and the toy's design all affect the assembly process.

Methods

Setting up a usability test involves carefully creating a scenario, or realistic situation, wherein the person performsa list of tasks using the product being tested while observers watch and take notes. Several other testinstruments such as scripted instructions, paper prototypes, and pre- and post-test questionnaires are also usedto gather feedback on the product being tested. For example, to test the attachment function of an e-mailprogram, a scenario would describe a situation where a person needs to send an e-mail attachment, and ask himor her to undertake this task. The aim is to observe how people function in a realistic manner, so that developerscan see problem areas, and what people like. Techniques popularly used to gather data during a usability testinclude think aloud protocol, Co-discovery Learning and eye tracking.

Hallway testing



Hallway testing (or Hall Intercept Testing) is a general methodology of usability testing. Rather than using anin-house, trained group of testers, just five to six random people are brought in to test the product, or service.The name of the technique refers to the fact that the testers should be random people who pass by in the

hallway.[10]

Hallway testing is particularly effective in the early stages of a new design when the designers are looking for"brick walls," problems so serious that users simply cannot advance. Anyone of normal intelligence other thandesigners and engineers can be used at this point. (Both designers and engineers immediately turn from beingtest subjects into being "expert reviewers." They are often too close to the project, so they already know how toaccomplish the task, thereby missing ambiguities and false paths.)

Remote Usability Testing

In a scenario where usability evaluators, developers and prospective users are located in different countries andtime zones, conducting a traditional lab usability evaluation creates challenges both from the cost and logisticalperspectives. These concerns led to research on remote usability evaluation, with the user and the evaluatorsseparated over space and time. Remote testing, which facilitates evaluations being done in the context of theuser’s other tasks and technology can be either synchronous or asynchronous. Synchronous usability testingmethodologies involve video conferencing or employ remote application sharing tools such as WebEx. Theformer involves real time one-on-one communication between the evaluator and the user, while the latter

involves the evaluator and user working separately.[11]

Asynchronous methodologies include automatic collection of user’s click streams, user logs of critical incidents

that occur while interacting with the application and subjective feedback on the interface by users.[12] Similar toan in-lab study, an asynchronous remote usability test is task-based and the platforms allow you to captureclicks and task times. Hence, for many large companies this allows you to understand the WHY behind thevisitors' intents when visiting a website or mobile site. Additionally, this style of user testing also provides anopportunity to segment feedback by demographic, attitudinal and behavioural type. The tests are carried out inthe user’s own environment (rather than labs) helping further simulate real-life scenario testing. This approachalso provides a vehicle to easily solicit feedback from users in remote areas quickly and with lowerorganisational overheads.

Numerous tools are available to address the needs of both these approaches. WebEx and Go-to-meeting are

the most commonly used technologies to conduct a synchronous remote usability test.[13] However,synchronous remote testing may lack the immediacy and sense of “presence” desired to support a collaborativetesting process. Moreover, managing inter-personal dynamics across cultural and linguistic barriers may requireapproaches sensitive to the cultures involved. Other disadvantages include having reduced control over thetesting environment and the distractions and interruptions experienced by the participants’ in their native

environment.[14] One of the newer methods developed for conducting a synchronous remote usability test is by

using virtual worlds.[15]

Expert review

Expert review is another general method of usability testing. As the name suggests, this method relies onbringing in experts with experience in the field (possibly from companies that specialize in usability testing) toevaluate the usability of a product.

Automated expert review

Similar to expert reviews, automated expert reviews provide usability testing but through the use of programs



given rules for good design and heuristics. Though an automated review might not provide as much detail andinsight as reviews from people, they can be finished more quickly and consistently. The idea of creatingsurrogate users for usability testing is an ambitious direction for the Artificial Intelligence community.

How many users to test?

In the early 1990s, Jakob Nielsen, at that time a researcher at Sun Microsystems, popularized the concept ofusing numerous small usability tests—typically with only five test subjects each—at various stages of thedevelopment process. His argument is that, once it is found that two or three people are totally confused by thehome page, little is gained by watching more people suffer through the same flawed design. "Elaborate usabilitytests are a waste of resources. The best results come from testing no more than five users and running as many

small tests as you can afford."[10] Nielsen subsequently published his research and coined the term heuristicevaluation.

The claim of "Five users is enough" was later described by a mathematical model[16] which states for theproportion of uncovered problems U

where p is the probability of one subject identifying a specific problem and n the number of subjects (or testsessions). This model shows up as an asymptotic graph towards the number of real existing problems (see figurebelow).

In later research Nielsen's claim has eagerly been questioned with both empirical evidence[17] and more

advanced mathematical models.[18] Two key challenges to this assertion are:

1. Since usability is related to the specific set of users, such a small sample size is unlikely to be

representative of the total population so the data from such a small sample is more likely to reflect the



sample group than the population they may represent

2. Not every usability problem is equally easy-to-detect. Intractable problems happen to decelerate the

overall process. Under these circumstances the progress of the process is much shallower than predicted

by the Nielsen/Landauer formula.[19]

It is worth noting that Nielsen does not advocate stopping after a single test with five users; his point is thattesting with five users, fixing the problems they uncover, and then testing the revised site with five different usersis a better use of limited resources than running a single usability test with 10 users. In practice, the tests are runonce or twice per week during the entire development cycle, using three to five test subjects per round, and withthe results delivered within 24 hours to the designers. The number of users actually tested over the course of theproject can thus easily reach 50 to 100 people.

In the early stage, when users are most likely to immediately encounter problems that stop them in their tracks,almost anyone of normal intelligence can be used as a test subject. In stage two, testers will recruit test subjectsacross a broad spectrum of abilities. For example, in one study, experienced users showed no problem using

any design, from the first to the last, while naive user and self-identified power users both failed repeatedly.[20]

Later on, as the design smooths out, users should be recruited from the target population.

When the method is applied to a sufficient number of people over the course of a project, the objections raisedabove become addressed: The sample size ceases to be small and usability problems that arise with onlyoccasional users are found. The value of the method lies in the fact that specific design problems, onceencountered, are never seen again because they are immediately eliminated, while the parts that appearsuccessful are tested over and over. While it's true that the initial problems in the design may be tested by onlyfive users, when the method is properly applied, the parts of the design that worked in that initial test will go onto be tested by 50 to 100 people.

See also

ISO 9241

Software testing

Educational technologyUniversal usability

Commercial eye tracking

Don't Make Me Think

Software performance testingSystem Usability Scale (SUS)

Test method

Tree testing

RITE MethodComponent-Based Usability Testing

Crowdsource testing

Usability goals

References

1. ^ Nielsen, J. (1994). Usability Engineering, Academic Press Inc, p 165

2. ^ NN/G Usability Week 2011 Conference "Interaction Design" Manual, Bruce Tognazzini, Nielsen NormanGroup, 2011

3. ^ http://interactions.acm.org/content/XV/baecker.pdf

4. ^ http://books.google.com/books?id=lRs_4U43UcEC&printsec=frontcover&sig=ACfU3U1xvA7-



f80TP9Zqt9wkB9adVAqZ4g#PPA22,M1

5. ^ http://news.zdnet.co.uk/itmanagement/0,1000000308,2065537,00.htm

6. ̂a b International Standardization Organization. ergonomics of human system interaction - Part 210 -: Humancentred design for interactive systems (Rep N°9241-210). 2010, International Standardization Organization

7. ^ Nielsen, Usability Engineering, 1994

8. ^ Mayhew. The usability engineering lifecycle: a practitioner's handbook for user interface design. London,Academic press; 1999

9. ^ http://jerz.setonhill.edu/design/usability/intro.htm

10. ̂a b "Usability Testing with 5 Users (Jakob Nielsen's Alertbox)"(http://www.useit.com/alertbox/20000319.html) . useit.com. 13.03.2000.http://www.useit.com/alertbox/20000319.html.; references Jakob Nielsen, Thomas K. Landauer (April 1993)."A mathematical model of the finding of usability problems" (http://dl.acm.org/citation.cfm?id=169166&CFID=159890676&CFTOKEN=16006386) . Proceedings of ACM INTERCHI'93 Conference(Amsterdam, The Netherlands, 24-29 April 1993). http://dl.acm.org/citation.cfm?id=169166&CFID=159890676&CFTOKEN=16006386.

11. ^ Andreasen, Morten Sieker; Nielsen, Henrik Villemann; Schrøder, Simon Ormholt; Stage, Jan (2007). "Whathappened to remote usability testing?". Proceedings of the SIGCHI conference on Human factors in computingsystems - CHI '07. p. 1405. doi:10.1145/1240624.1240838 (http://dx.doi.org/10.1145%2F1240624.1240838) .ISBN 9781595935939.

12. ^ Dray, Susan; Siegel, David (2004). "Remote possibilities?". Interactions 11 (2): 10.doi:10.1145/971258.971264 (http://dx.doi.org/10.1145%2F971258.971264) .

13. ^ http://www.boxesandarrows.com/view/remote_online_usability_testing_why_how_and_when_to_use_it

14. ^ Dray, Susan; Siegel, David (March 2004). "Remote possibilities?: international usability testing at a distance".

Interactions 11 (2): 10–17. doi:10.1145/971258.971264 (http://dx.doi.org/10.1145%2F971258.971264) .

15. ^ Chalil Madathil, Kapil; Joel S. Greenstein (May 2011). "Synchronous remote usability testing: a new approachfacilitated by virtual worlds". Proceedings of the 2011 annual conference on Human factors in computingsystems. CHI '11: 2225–2234. doi:10.1145/1978942.1979267 (http://dx.doi.org/10.1145%2F1978942.1979267). ISBN 9781450302289.

16. ^ Virzi, R.A., Refining the Test Phase of Usability Evaluation: How Many Subjects is Enough? Human Factors,1992. 34(4): p. 457-468.

17. ^ http://citeseer.ist.psu.edu/spool01testing.html

18. ^ Caulton, D.A., Relaxing the homogeneity assumption in usability testing. Behaviour & InformationTechnology, 2001. 20(1): p. 1-7

19. ^ Schmettow, Heterogeneity in the Usability Evaluation Process. In: M. England, D. & Beale, R. (ed.),Proceedings of the HCI 2008, British Computing Society, 2008, 1, 89-98

20. ^ Bruce Tognazzini. "Maximizing Windows" (http://www.asktog.com/columns/000maxscrns.html) .http://www.asktog.com/columns/000maxscrns.html.

External links

Usability.gov (http://www.usability.gov/)

A Brief History of the Magic Number 5 in Usability Testing

(http://www.measuringusability.com/blog/five-history.php)

Retrieved from "http://en.wikipedia.org/w/index.php?title=Usability_testing&oldid=519424139"

Categories: Usability Software testing Educational technology Evaluation methods Tests

This page was last modified on 23 October 2012 at 17:34.


apply. See Terms of Use for details.

Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.

23/11/2012 Usability Testing | Usability.gov

1/2www.usability.gov/methods/test_refine/learnusa/index.html

Search

Text Size Print Download Reader

Home > Methods > Test & Refine the Site > Usability Testing

Usability TestingIntroduction | Test Plan | Preparation & Testing | Data Analyses & Report

Introduction

Topics on This PageIntroduction to Usability Testing

Four Things to Keep in Mind

Cost

Introduction to Usability TestingUsability testing is a technique used to evaluate a product by testing itwith representative users. In the test, these users will try to completetypical tasks while observers watch, listen and takes notes.

Your goal is to identify any usability problems, collect quantitative data on participants' performance (e.g., time on task, error rates), anddetermine participant's satisfaction with the product.

You should test early and test often. Usability testing lets the designand development teams identify problems before they get coded (i.e.,"set in concrete). The earlier those problems are found and fixed, theless expensive the fixes are.

You DO NOT need a formal usability lab to do testing. You can doeffective usability testing in any of these settings:

a fixed laboratory having two or three connected rooms outfittedwith audio-visual equipment

a conference room, or the user's home or work space, with portablerecording equipment

a conference room, or the user's home or work space, with norecording equipment, as long as someone is observing the user andtaking notes

remotely, with the user in a different location

You will learn if participants are able to complete identified routine taskssuccessfully and how long it takes to do that. You will find out howsatisfied participants are with your Web site. Overall, you will identifychanges required to improve user performance. And you can match theperformance to see if it meets your usability objectives.

Four Things to Keep in Mind

1. Testing the Site NOT the Users

We try hard to ensure that participants do not think that we aretesting them. We help them understand that they are helping ustest the prototype or Web site.

2. Performance vs. Subjective Measures

We measure both performance and subjective (preference) metrics.Performance measures include: success, time, errors, etc. Subjectivemeasures include: user's self reported satisfaction and comfortratings.

People's performance and preference do not always match. Oftenusers will perform poorly but their subjective ratings are very high.Conversely, they may perform well but subjective ratings are verylow.

3. Make Use of What You Learn

Usability testing is not just a milestone to be checked off on theproject schedule. The team must consider the findings, set priorities,and change the prototype or site based on what happened in theusability test.

4. Find the Best Solution

Most projects, including designing or revising Web sites, have todeal with constraints of time, budget, and resources. Balancing allthose is one of the major challenges of most projects.

CostCost depends on the size of the site, how much you need to test, howmany different types of participants you anticipate having, and howformal you want the testing to be. Remember to budget for more thanone usability test. Building usability into a Web site (or any product) is

Step-by-Step GuideWe provide a step-by-step visualmap to guide you through theuser-centered design process.

Planning the Project

Analyze Current Site

Design New Site

Test & Refine the Site

Types of Evaluations

Usability Testing

Heuristic Evaluations

Implement & Retest

Methods at a Glance

When to Test

No Lab Needed

What You Learn

| | | | | | | Home Basics Methods Templates Resources Across Government Articles & Discussion Guidelines About Us

23/11/2012 Usability Testing | Usability.gov

2/2www.usability.gov/methods/test_refine/learnusa/index.html

one usability test. Building usability into a Web site (or any product) isan iterative process.

Consider these elements in budgeting for usability testing:

You will need time to plan the usability test. It will take the usabilityspecialist and the team time to get familiarized with the site and do dryruns with scenarios. Budget for the time it takes to test users and foranalyzing the data, writing the report, and discussing the findings.

Recruiting Costs: time of in-house person or payment to a recruiting firm.Developing a user database either in-house or firm recruiting becomesless time consuming and cheaper. Also allow for the cost of paying orproviding gifts for the participants.

If you do not have equipment, you will have to budget for rental costsfor the lab or other equipment.

Back to Top

Accessibility | Privacy Policy | Viewers & Players | USA.gov

This is an official U.S. Government Web site managed by the U.S. Department of Health & Human Services.

Time

Recruiting Costs

Rental Costs

Research-Based Web Design & Usabi l i ty Guidel ines

Usab

ility

Tes

ting

18There are two major considerations when conducting usability testing. The first is to ensure that the best possible

method for testing is used. Generally, the best method is to conduct

a test where representative participants interact with representative

scenarios. The tester collects data on the participant’s success, speed of

performance, and satisfaction. The findings, including both quantitative

data and qualitative observations information, are provided to designers

in a test report. Using ’inspection evaluations,’ in place of well-controlled

usability tests, must be done with caution. Inspection methods, such as

heuristic evaluations or expert reviews, tend to generate large numbers

of potential usability ’problems’ that never turn out to be actual usability

problems.

The second major consideration is to ensure that an iterative approach

is used. After the first test results are provided to designers, they should

make changes and then have the Web site tested again. Generally, the

more iterations, the better the Web site.

Usability Testing

18:3 Prioritize Tasks 189


Usability Testing

Guideline: Develop and test prototypes through an iterative design approach to create the most useful and usable Web site.

Comments: Iterative design consists of creating paper or computer prototypes, testing the prototypes, and then making changes based on the test results. The ’test and make changes’ process is repeated until the Web site meets performance benchmarks (usability goals). When these goals are met, the iterative process ends.

The iterative design process helps to substantially improve the usability of Web sites. One recent study found that the improvements made between the original Web site and the redesigned Web site resulted in thirty percent more task completions, twenty-five percent less time to complete the tasks, and sixty-seven percent greater user satisfaction. A second study reported that eight of ten tasks were performed faster on the Web site that had been iteratively designed. Finally, a third study found that forty-six percent of the original set of issues were resolved by making design changes to the interface.

Sources: Badre, 2002; Bailey, 1993; Bailey and Wolfson, 2005; Bradley and Johnk, 1995; Egan, et al., 1989; Hong, et al., 2001; Jeffries, et al., 1991; Karat, Campbell, and Fiegel, 1992; LeDoux, Connor and Tullis, 2005; Norman and Murphy, 2004; Redish and Dumas, 1993; Tan, et al., 2001.

18:1 Use an Iterative Design Approach

Strength of Evidence:

Relative Importance:

See page xxii for detailed descriptions

of the rating scales

18:2 Solicit Test Participants’ Comments


190

Usab

ility

Tes

ting Guideline: Solicit usability testing participants’

comments either during or after the performance of tasks.

Comments: Participants may be asked to give their comments either while performing each task (’think aloud’) or after finishing all tasks (retrospectively). When using the ’think aloud’ method, participants report on incidents as soon as they happen. When using the retrospective approach, participants perform all tasks uninterrupted, and then watch their session video and report any observations (critical incidents).

Studies have reported no significant difference between the ’think aloud’ versus retrospective approaches in terms of the number of useful incident reports given by participants. However, the reports (with both approaches) tended to be positively biased and ’think aloud’ participants may complete fewer tasks. Participants tend not to voice negative reports. In one study, when using the ’think aloud’ approach, users tended to read text on the screen and verbalize more of what they were doing rather than what they were thinking.

Sources: Bailey, 2003; Bowers and Snyder, 1990; Capra, 2002; Hoc and Leplat, 1983; Ohnemus and Biers, 1993; Page and Rahimi, 1995; Van Den Haak, De Jong, and Schellens, 2003; Wright and Converse, 1992.




Guideline: Conduct ’before and after’ studies when revising a Web site to determine changes in usability.

Comments: Conducting usability studies prior to and after a redesign will help designers determine if changes actually made a difference in the usability of the site. One study reported that only twenty-two percent of users were able to buy items on an original Web site. After a major redesign effort, eighty-eight percent of users successfully purchased products on that site.

Sources: John and Marks, 1997; Karat, 1994a; Ramey, 2000; Rehman, 2000; Williams, 2000; Wixon and Jones, 1996.

18:3 Evaluate Web Sites Before and After Making Changes



18:3 Prioritize Tasks

Guideline: Distinguish between frequency and severity when reporting on usability issues and problems.

Comments: The number of users affected determines the frequency of a problem. To be most useful, the severity of a problem should be defined by analyzing difficulties encountered by individual users. Both frequency and severity data can be used to prioritize usability issues that need to be changed. For example, designers should focus first on fixing those usability issues that were shown to be most severe. Those usability issues that were encountered by many participants, but had a severity rating of ‘nuisance,’ should be given much less priority.

Sources: Woolrych and Cockton, 2001.

18:5 Distinguish Between Frequency and Severity



191


Usability Testing

Guideline: Give high priority to usability issues preventing ‘easy’ tasks from being easy.

Comments: When deciding which usability issues to fix first, address the tasks that users believe to be easy but are actually difficult. The Usability Magnitude Estimation (UME) is a measure that can be used to assess user expectations of the difficulty of each task. Participants judge how difficult or easy a task will be before trying to do it, and then make a second judgment after trying to complete the task. Each task is eventually put into one of four categories based on these expected versus actual ratings:

• Tasks that were expected to be easy, but were actually difficult;

• Tasks that were expected to be difficult, but were actually easy;

• Tasks that were expected to be easy and were actually easy; and

• Tasks that were expected to be difficult and were difficult to complete.

Sources: Rich and McGee, 2004.






18:2 Solicit Test Participants’ Comments192


Usab

ility

Tes

ting Guideline: Select the right number of participants

when using different usability techniques. Using too few may reduce the usability of a Web site; using too many wastes valuable resources.

Comments: Selecting the number of participants to use when conducting usability evaluations depends on the method being used:

Inspection evaluation by usability specialists:

– The typical goal of an inspection evaluation is to have usability experts separately inspect a user interface by applying a set of broad usability guidelines. This is usually done with two to five people.

– The research shows that as more experts are involved in evaluating the usability of the product, the greater the number of usability issues will be identified. However, for every true usability problem identified, there will be at least one usability issue that is not a real problem. Having more evaluators does decrease the number of misses, but is also increases the number of false positives. Generally, the more expert the usability specialists, the more useful the results.

Performance usability testing with users:

– Early in the design process, usability testing with a small number of users (approximately six) is sufficient to identify problems with the information architecture (navigation) and overall design issues. If the Web site has very different types of users (e.g., novices and experts), it is important to test with six or more of each type of user. Another critical factor in this preliminary testing is having trained usability specialists as the usability test facilitator and primary observers.

– Once the navigation, basic content, and display features are in place, quantitative performance testing (measuring times, wrong pathways, failure to find content, etc.) can be conducted to ensure that usability objectives are being met. To measure each usability objective to a particular confidence level, such as ninety-five percent, requires a larger number of users in the usability tests.

– When the performance of two sites is compared (i.e., an original site and a revised site), quantitative usability testing should be employed. Depending on how confident the usability specialist wants to be in the results, the tests could require a larger number of participants.

•

•

18:6 Select the Right Number of Participants





Usability Testing

– It is best to perform iterative cycles of usability testing over the course of the Web site’s development. This enables usability specialists and designers to observe and listen to many users.

Sources: Bailey, 1996; Bailey, 2000c; Bailey, 2000d; Brinck and Hofer, 2002; Chin, 2001; Dumas, 2001; Gray and Salzman, 1998; Lewis, 1993; Lewis, 1994; Nielsen and Landauer, 1993; Perfetti and Landesman, 2001; Virzi, 1990; Virzi, 1992.



Guideline: Create prototypes using the most appropriate technology for the phase of the design, the required fidelity of the prototype, and skill of the person creating the prototype.

Comments: Designers can use either paper-based or computer-based prototypes. Paper-based prototyping appears to be as effective as computer-based prototyping when trying to identify most usability issues. Several studies have shown that there was no reliable difference in the number of usability issues detected between computer and paper prototypes. However, usability test participants usually prefer interacting with computer-based prototypes. Paper prototypes can be used when it is necessary to view and evaluate many different (usually early) design ideas, or when computer-based prototyping does not support the ideas the designer wants to implement, or when all members of the design team need to be included–even those that do not know how to create computer-based prototypes.

Software tools that are available to assist in the rapid development of prototypes include PowerPoint, Visio, including other HTML base tools. PowerPoint can be used to create medium fidelity prototypes. These prototypes can be both interactive and dynamic, and are useful when the design requires more than a ’pencil-and-paper’ prototype. Sources: Sefelin, Tscheligi and Giller, 2003; Silvers, Voorheis and Anders, 2004; Walker, Takayama and Landay, 2002.

18:7 Use the Appropriate Prototyping Technology




Guideline: Use inspection evaluation results with caution.

Comments: Inspection evaluations include heuristic evaluations, expert reviews, and cognitive walkthroughs. It is a common practice to conduct an inspection evaluation to try to detect and resolve obvious problems before conducting usability tests. Inspection evaluations should be used cautiously because several studies have shown that they appear to detect far more potential problems than actually exist, and they also tend to miss some real problems. On average, for every hit there will be about 1.3 false positives and .5 misses.

Another recent study concluded that the low effectiveness of heuristic evaluations as a whole was worrisome because of the low problem detection rate (p=.09), and the large number of evaluators required (16) to uncover seventy-five percent of the potential usability issues.

Another difficulty when conducting heuristic evaluations is that evaluators frequently apply the wrong heuristic, which can mislead designers that are trying to fix the problem. One study reported that only thirty-nine percent of the heuristics were appropriately applied.

Evaluators seem to have the most success identifying usability issues that can be seen by merely looking at the display, and the least success finding issues that require users to take several steps (clicks) to a target.

Heuristic evaluations and expert reviews may best be used to identify potential usability issues to evaluate during usability testing. To improve somewhat on the performance of heuristic evaluations, evaluators can use the ’usability problem inspector’ (UPI) method or the ’Discovery and Analysis Resource’ (DARe) method.

Sources: Andre, Hartson and Williges, 2003; Bailey, Allen and Raiello, 1992; Catani and Biers, 1998; Cockton and Woolrych 2001; Cockton and Woolrych, 2002; Cockton, et al., 2003; Fu, Salvendy and Turley, 1998; Fu, Salvendy and Turley, 2002; Law and Hvannberg, 2002; Law and Hvannberg, 2004; Nielsen and Landauer, 1993; Nielsen and Mack, 1994; Rooden, Green and Kanis, 1999; Stanton and Stevenage, 1998; Virzi, Sorce and Herbert, 1993; Wang and Caldwell, 2002.

194


Usab

ility

Tes

ting



18:8 Use Inspection Evaluation Results Cautiously




Guideline: Use appropriate automatic evaluation methods to conduct initial evaluations on Web sites.

Comments: An automatic evaluation method is one where software is used to evaluate a Web site. An automatic evaluation tool can help find certain types of design difficulties, such as pages that will load slowly, missing links, use of jargon, potential accessibility problems, etc. While automatic evaluation methods are useful, they should not be used as a substitute for evaluations or usability testing with typical users. There are many commercially available automatic evaluation methods available for checking on a variety of Web site parameters.

Sources: Brajnik, 2000; Campbell and Stanley, 1963; Gray and Salzman, 1998; Holleran, 1991; Ivory and Hearst, 2002; Ramey, 2000; Scholtz, 1998; World Wide Web Consortium, 2001.

18:10 Apply Automatic Evaluation Methods



195


Usability Testing

Guideline: Beware of the ’evaluator effect’ when conducting inspection evaluations.

Comments: The ’evaluator effect’ occurs when multiple evaluators evaluating the same interface detect markedly different sets of problems. The evaluators may be doing an expert review, heuristic evaluation, or cognitive walkthrough. The evaluator effect exists for evaluators who are novice or experienced, while detecting cosmetic and severe problems, and when evaluating simple or complex Web sites. In fact, when using multiple evaluators, any one evaluator is unlikely to detect the majority of the ’severe’ problems that will be detected collectively by all evaluators. Evaluators also tend to perceive the problems they detected as more severe than the problems detected by others.

The main cause of the ’evaluator effect’ seems to be that usability evaluation is a complex cognitive activity that requires evaluators to exercise difficult judgments. Sources: Hertzum and Jacobsen, 2001; Jacobsen, Hertzum and John, 1998; Molich, et al., 1998; Molich, et al., 1999; Nielsen and Molich, 1990; Nielsen, 1992; Nielsen, 1993; Redish and Dumas, 1993; Selvidge, 2000.

18:9 Recognize the ‘Evaluator Effect’




Guideline: Use cognitive walkthroughs with caution.

Comments: Cognitive walkthroughs are often conducted to resolve obvious problems before conducting performance tests. The cognitive walkthrough appears to detect far more potential problems than actually exist, when compared with performance usability testing results. Several studies have shown that only about twenty-five percent of the potential problems predicted by the cognitive walkthrough were found to be actual problems in a performance test. About thirteen percent of actual problems in the performance test were missed altogether in the cognitive walkthrough. Cognitive walkthroughs may best be used to identify potential usability issues to evaluate during usability testing. Sources: Blackmon, et al., 2002; Desurvire, Kondziela and Atwood, 1992; Hassenzahl, 2000; Jacobsen and John, 2000; Jeffries and Desurvire, 1992; John and Mashyna, 1997; Karat, 1994b; Karat, Campbell and Fiegel, 1992; Spencer, 2000.

196


Usab

ility

Tes

ting

18:11 Use Cognitive Walkthroughs Cautiously



Guideline: Testers can use either laboratory or remote usability testing because they both elicit similar results.

Comments: In laboratory-based testing, the participant and the tester are in the same physical location. In remote testing, the tester and the participant are in different physical locations. Remote testing provides the opportunity for participants to take a test in their home or office. It is convenient for participants because it requires no travel to a test facility.

Studies have evaluated whether remote testing is as effective as traditional, lab-based testing. To date, they have found no reliable differences between lab-based and remote testing in terms of the number of types of usability issues identified. Also, they report no reliable differences in task completion rate, time to complete the tasks, or satisfaction scores. Sources: Brush, Ames and Davis, 2004; Hartson, et al., 1996; Thompson, Rozanski and Rochester, 2004; Tullis, et al., 2002.

18:12 Choosing Laboratory vs. Remote Testing





Usability Testing



Guideline: Use severity ratings with caution.

Comments: Most designers would like usability specialists to prioritize design problems that they found either by inspection evaluations or expert reviews. So that they can decide which issues to fix first, designers would like the list of potential usability problems ranked by each one’s ‘severity level’. The research literature is fairly clear that even highly experienced usability specialists cannot agree on which usability issues will have the greatest impact on usability.

One study had 17 expert review and usability test teams evaluate and test the same Web page. The teams had one week to do an expert review, or two weeks to do a usability test. Each team classified each usability issue as a minor problem, serious problem, or critical problem. There was considerable disagreement in which problems the teams judged as minor, serious or critical, and there was little agreement on which were the ’top five problems’. Another study reported that heuristic evaluators overestimated severity twenty-two percent of the time, and underestimated severity seventy-eight percent of the time when compared with usability testing results.

Sources: Bailey, 2005; Catani and Biers, 1998; Cockton and Woolrych, 2001; Dumas, Molich and Jeffries, 2004; Hertzum and Jacobsen, 2001; Jacobsen, Hertzum and John, 1998; Law and Hvannberg, 2004; Molich, 2005.

18:13 Use Severity Ratings Cautiously



Usability Testing Basics

An Overview


www.techsmith.com 1

Contents Usability Testing Defined ........................................................................................................................ 3

Decide What to Test ............................................................................................................................ 3

Determine When to Test What ........................................................................................................... 4

Decide How Many to Test ................................................................................................................... 4

Design the Test..................................................................................................................................... 5

Consider the Where, When, and How .......................................................................................... 5

Scenarios and Tasks ....................................................................................................................... 5

Prepare to Measure the Experience ............................................................................................. 7

Select Data to Capture .................................................................................................................... 8

Recruit Participants .............................................................................................................................. 8

Recruitment Ideas ............................................................................................................................ 8

Compensation ................................................................................................................................... 9

Prepare for Test Sessions .................................................................................................................. 9

Setting ................................................................................................................................................ 9

Schedule Participants ...................................................................................................................... 9

Stakeholders ..................................................................................................................................... 9

Observers .......................................................................................................................................... 9

Script ................................................................................................................................................ 10

Questionnaires and Surveys ........................................................................................................ 10

Conduct Test Sessions ..................................................................................................................... 10

Begin with a Run-through ............................................................................................................. 10

At the Test Session ........................................................................................................................ 10

Facilitation ....................................................................................................................................... 11

After the Session ............................................................................................................................ 11

Analyze Your Study ........................................................................................................................... 12

Step 1: Identify exactly what you observed ................................................................................ 12

Step 2: Identify the causes of any problems .............................................................................. 12

Step 3: Determine Solutions ......................................................................................................... 12

Deliverables......................................................................................................................................... 13

Appendix A: Participant Recruitment Screener ................................................................................. 15

Recruitment Script .............................................................................................................................. 15

About this Document

This document details usability testing basics—how to apply them with any product or prototype and when to apply them during any point in the development process. It also discusses how to conduct, analyze, and report on usability test findings. Then you can learn about how to do it all in Morae 3.


2 www.techsmith.com


www.techsmith.com 3

Usability Testing Defined Usability tests identify areas where people struggle with a product and help you make recommendations for improvement. The goal is to better understand how real users interact with your product and to improve the product based on the results. The primary purpose of a usability test is to improve a design.

In a typical usability test, real users try to accomplish typical goals, or tasks, with a product under controlled conditions. Researchers, stakeholders, and development team members watch, listen, collect data, and take notes.

Since usability testing employs real customers accomplishing real tasks, it can provide objective performance data, such as time on task, error-rate, and task success. There is also no substitute for watching users struggle with or have great success in completing a task when using a product. This observation helps designers and developers gain empathy with users, and help them think of alternative designs that better support tasks and workflow.

Decide What to Test Meet with stakeholders, including members of the development team when possible, to map out the goals for the test and discuss what areas of the system or product you will evaluate. In order to gather all of the information you will need to conduct your test, ask for feedback on:

Background: Product description and reasons for requesting feedback

Participants: The desired qualities of participants and characteristics of users or customers of the product

Usability Goals: What you hope to learn with this test

Key Points : What kinds of actions/features the test tasks should cover—this also may include a list of specific questions the team wants the usability test to answer

Timeli ne: The timeline for testing—when the product or prototype will be ready for testing, when the team would like to discuss the results, or any other constraints

Additional Information:

Anything else that needs to be taken into consideration

Be sure to identify user goals and needs as well. With this information you can then develop scenarios and tasks for participants to perform that will help identify where the team can make improvements.

For example:

• Who uses (or would use) the product?

• What are their goals for using the product?

• What tasks would those people want to or have to accomplish to meet that goals?

• Are there design elements that cause problems and create a lot of support calls?

• Are you interested in finding out if a new product feature makes sense to current users?


4 www.techsmith.com

Determine When to Test What Usability testing can employ many methods and work with products at many levels of development. If there is enough of an interface to complete tasks—or or even imagine completing a task—it is possible to perform a usability test. You can test a product at various stages of development:

• Low-fidelity prototype or paper prototype Hand drawn, mocked up, or wireframe version of a product or web site that allows for a “paper prototype” style test before work begins or early in development.

• High-fidelity prototype An interactive system that can be used on a computer, such as a Flash version of a product’s user interface and interactivity. High fidelity prototypes should include representative data and mimic the experience users would have when using the finished product to accomplish tasks. Usually performed as development progresses.

• Alpha and Beta versions These not-ready-for-release versions often are stable enough and rich enough to be sent or accessed by remote participants for a usability test.

• Release version A product that has been released to customers, and is especially effective for testing the workflow of the product from beginning to end.

• Comparative or A/B Multiple versions of a design are used in testing (often alternated between participants) to measure differences in performance and satisfaction.

Decide How Many to Test The number of participants varies based on the type and purpose of the test. Opinions vary, but at least four participants from each group of user types (user types are determined by stakeholders and development team members when determining testing goals) are usually needed to test a product. Different testing techniques require different numbers of participants, as explained in this table.

Recommended Number of Participants by Testing Techn ique

BENCHMARK

METRICS

DIAGNOSTIC (FORMATIVE) EVALUATION SUMMATIVE TESTING

How many? 8-24 users 4-6 users 6-12+ users

Metrics and Measures

Focus on metrics for time, failures, etc Tests current process or product

Less formal Increased focus on qualitative data

More formal Metrics based on usability goals

Why Establish baseline metrics

Find and fix problems

Measure success of new design

When Before a design project begins or early in development

During design At end of process

How often Once Iterative Once

Source: Ginny Redish


www.techsmith.com 5

Design the Test Document your test plan with a “protocol”. You may want to use a test planning checklist to help you track all the details. Examples of each are available on the Morae Resource CD. Scenario and task design are one of the most important factors to consider. Plan to have participants accomplish typical tasks with the product under controlled conditions. The tasks should provide the data that answers your design questions.

Consider the Where, When, and How You will need to schedule rooms, labs and equipment, and know where your participants will be located. Usability tests can take place in a lab, conference room, quiet office space, or a quiet public space. Morae enables you to capture the product or screen and software data, facial expressions, and verbal comments. UserVue, TechSmith’s remote testing service, lets participants participate in a test from home or work. Recordings from UserVue import seamlessly into Morae.

Scenarios and Tasks Tasks are the activities you will ask your participant to do during a usability test; scenarios frame tasks and provide motivation for the participant to perform those tasks. You may have one scenario or several, depending on your tasks. Both tasks and scenarios should be adjusted to meet goals and should be part of the conversation you have with stakeholders about the test.

Tips for Writing Scenarios

• You may find it easiest to write tasks first, then scenarios, or vice versa. In our examples, we start with writing a scenario.

Imagine why your users would want to use your product in general, then specifically what would motivate them to encounter the design elements you are evaluating. Scenarios should be a story that provides motivation to your participants.

• Effective tasks often contain scenario information, which give the test participant an understanding of their motivation and context for accomplishing the task and all information needed to complete the task. A scenario could be given to participants before beginning the tasks.

For example, to find out how users use the store on your Web site, a scenario could state, “You have been researching different types of video cameras to buy to record family videos and transfer them on to your computer. You want to use the Web site to find information about video cameras and purchase a camera based on your needs.”

• Another method is to fold scenario information into each task.

For example, a task might state, “Purchase a video camera,” but a task with scenario information would give more detail: “You want to purchase a video camera that is small and lightweight, and can transfer video files on to your computer. You have a budget of $400. Find and purchase a camera that meets your needs.”

Scenario Do’s and Don’ts

• DO: Create a believable scenario

• DON’T: Create long, complex scenarios – consider breaking them up into smaller scenarios for smaller groups of tasks


6 www.techsmith.com

Tips for Writing Tasks

Tasks can contain as little or as much information as necessary to aid participants and give them context and motivation. Different types of tests require different types of tasks—a paper prototype might seek a more open-ended task where another type of test may need very specific tasks. Tasks fall into three main categories:

• Prescribed tasks – as the test designer you determine what the participant will do.

Example: “You want to enhance your copy of SnagIt. Using TechSmith.com, download and install the “Typewriter” style numbers for SnagIt.”

• Participant defined – ask participants to tell you something they would normally do with your product, and then have them do the task they described.

Example: “Using SnagIt 9, take a capture of something that you normally capture in your work or personal life - or something similar to what you normally would do - and enhance and share it the way you normally would. Please feel free to customize or use SnagIt in any way that meets your needs.”

• Open ended – allow participants to organically explore the product based on a scenario you provide.

Example: “We are giving you $100 to buy software that will capture your screen. Using the internet, find the software you want to buy. When you are done you can keep the software you purchase as well as any remaining funds.”

The order of tasks will often follow a natural flow of product use. When order does not matter for the user, the order of tasks might need to be varied to avoid testing bias. It may be best to begin with a simple task to ease the user into the testing situation and build confidence.

Task Do’s and Don’ts

• DO: Use the language of the participant, and write tasks that the participant might realistically expect to do in his or her use of the product.

• DO: Identify specific activities that represent typical tasks that your users would perform with your product. The tasks should relate back to the goals for the test and relate to your scenario. There are several types of tasks that you might use based on the data you are interested in collecting.

• DO: Provide additional information such as a credit card number for payment transactions, receipts for an expense reporting system, email addresses, etc.

• DON’T: Use any of the terms used in the product – avoid clues about what to do. Avoid terms like “Click on your shopping cart to check out.”

• DON’T: Lead the participant by using directions that are too explicit. Avoid language such as “click on the red button to begin.”

• DON’T: Write so that you describe how to perform the task.

• DON’T: Write dependent tasks that require participants to complete one task before moving on; if data or other artifacts from the first task are needed, provide them in subsequent tasks.


www.techsmith.com 7

Prepare to Measure the Experience Usability testing tests a product under the most realistic circumstances possible while controlling the conditions. This method of user research lets the researcher collect data, as measured in numbers (quantitative) and documented as part of the test (qualitative). Different data are used to measure various aspects of usability.

Key Evaluation Measures for Usability Testing

NAME WHAT’S MEASURED WHEN TO USE THIS MEASURE

Task Success

Whether or not the participant was successful, and to what degree. (For example, completed with ease, completed with difficulty, failed to complete.)

Critical when effectiveness of the product is a primary goal.

Time on Task

The length of time it takes the participant to complete a task. May be averaged for all participants, and can be compared between tests.

Critical when efficiency is a primary usability goal, and when efficiency is a primary influence on satisfaction.

Errors A count of the errors each participant makes in each task. Errors may be categorized or predefined.

Critical to both efficiency and effectiveness, use this measure when you want to minimize the problems a user may encounter in the product.

Learnability A task is repeated at least once to determine whether the time on task is shorter, fewer errors are made, or the task is more successful.

Important to measure whether the interface will be easier to use over time.

Satisfaction Enumerates participants’ overall feelings about a product before, during and/or after a test.

Allows the participants to quantify and describe their emotional reaction to a product before, during or after a study.

Mouse Clicks

Measures the number of clicks that a participant makes.

Measures the effectiveness and efficiency of a product, suggests that a participant was able to accomplish a task with less effort.

Mouse Movement

Measures the distance the mouse travels.

Measures efficiency, suggests that a participant was able to accomplish a task with less effort.

Problem/ Issue Counts

Records, counts, ranks and/or categorizes problems observed.

Provides an overview of the issues that may be causing other measures to be less ideal. Allows comparison across studies to determine improvement. These are often weighted by how severe an issue may be.

Optimal Path Observes the path a participant takes to accomplish a task, and compares it to a predefined optimal path.

Measures the variance from the ideal path.

Make Your Own

With Rich Recording Technology ™ data, you can design the study that fits your needs.

Unlimited


8 www.techsmith.com

Select Data to Capture Rich Recording Technology ™ will automatically record a wide set of data about user activity and input on the computer. You can set up markers in Recorder that will let your observers log other activity, including:

• Task start and end points (to record time on task)

Places where the participant: • Reaches a milestone

• Makes an error

• Fails to complete a task

• Accesses help (or asks the facilitator)

• Encounters a problem

Capture “Qualitative” Data

Some things are not measured with numbers. Reactions, quotes, facial expressions and participant behaviors (like gesturing, pushing a chair back, and so on) are also important data points that require a human to interpret. Alert your observers to make note when these things happen as well – you’ll be able to highlight them later, as you review your recording. Set markers for quotes and behaviors in the Morae Study Configuration for observers to use.

Identify Success Paths

For each task you write, it’s good practice to have all stakeholders agree on the success paths so everyone has a common understanding about when participants are successful and when they are not. You might decide that there is only one success path or several depending on your product.

Test observers can then help count errors and problems associated with each task and you can identify when participants are able to successfully complete tasks or not.

Recruit Participants Recruiting is one of the most important components of a usability test. Your participants should adequately reflect your true base of users and the user types you have decided to test, and represent a range of new and experienced users in a way that would actually use your product.

Recruitment Ideas • Use your own customer databases or contacts

• Hire an outside agency: look for market research firms if there are none specializing in usability recruiting, good screeners are vital. There is a cost per candidate

• Post on Craig’s List: don’t identify your company, just qualifications

• Post something on your web site: start a usability testing page where site visitors can sign up to participate

• Place and ad in the paper: good for local audiences

• When doing your own recruiting you should identify criteria that will help you select qualified candidates. Experience with the product or the field, computer experience, age and other demographics may be important to consider. See Appendix A: Participant Recruitment Screener.

When recruiting using an outside recruiting firm, a screener helps you get the right participants. A recruiting screener is used to determine if a potential participant matches the user characteristics defined in the usability test protocol. Include questions about demographics, frequency of use, experience level, etc.


www.techsmith.com 9

Ask questions that will help you filter out participants that don’t match your target users and indicate when to thank people for their time and let them know that they do not qualify. Ask enough questions so that you know you have the right people. For example, qualified participants for a test of an online shopping site should have access to a computer at home or at work and meet the other required demographics (age range, etc).

Compensation You will need to think about what kind of compensation you will offer participants. Typically participants get cash or an equivalent, a gift certificate, even merchandise from your company.

Prepare for Test Sessions Now you have your test protocol and your participants, you’re ready to get started.

Setting The most important factors are that the environment be comfortable for participants, similar to their real-world environment, and reasonably similar between participants.

Schedule Participants Schedule your participants to have adequate time to work through the test at their own pace. Allow enough time between sessions to reset, debrief and regroup.

Stakeholders When working your stakeholders, help them understand how you will conduct your testing. Stakeholders need to understand how you will be interacting with your participant during test sessions. They need to understand that you are there to facilitate the test and observe behavior, not help the participant complete tasks. There are two basic models:

• Facilitator interacts with the participant – you often get more qualitative information, especially when the facilitator is good at asking neutral questions and encouraging participants to find their own answers in the product.

• Facilitator does not interact with the participant – you can get more natural behavior, but participants are left to struggle or quit on their own. You often will not get as much qualitative data as participants may not talk out loud as much. You may get more accurate measures of time on task and failure, however.

Observers • At least one person can be enlisted to help you log all of your recordings for the data points you’ve

set out. By having someone else log the sessions, the facilitator can concentrate on the test. At the same time, the recording will capture a rich set of data for later analysis.

• In addition to a designated person to help you observe and log data, there may be a long list of stakeholders who will benefit from observing a test. They commonly include the developers, managers, product managers, quality testing analysts, sales and marketing staff, technical support and documentation.

Watching users actually struggle with a product is a powerful experience. Observing test sessions helps make the team open to making changes.

Remember to warn your observers not to drive changes until you and the team have had an opportunity to analyze all testing and decide upon changes that will address the root causes.


10 www.techsmith.com

Script Create a facilitator script to help you and your facilitators present a consistent set of information to your participants. The script will also serve as a reminder to you to say certain things to your participants and provide them appropriate paper work at the right times.

In your script, remind participants that the usability test is an evaluation of the product and not of their performance, that all problems they find are helpful and that their feedback is valuable, and let participants know their data will be aggregated with the rest of the participant data and they will not be identified.

Questionnaires and Surveys A typically usability study usually has at least two surveys (questionnaires), one administered before the participant starts tasks and one administered at the end of the test, which collects subjective information such as how satisfied participant were with the product and how easy it is to use. You can also administer a survey after each task; these are typically used to measure satisfaction or ease of use for each task.

• Pre-test survey – collects demographic and product usage data about participants such as computer use, online shopping habits, internet usage, age, gender, etc and should help product teams understand more about typical customers and how they are reflected in the test participants

• Post-task survey – questions that rate subjective ease of use or satisfaction for each task. Ask other questions or other types of questions when appropriate. Limit the number of post-task questions so as not to overwhelm participants.

• Post-test survey – post-test surveys are often used to measure satisfaction; use SUS for a standard satisfaction measure or use your own questions. See our templates for ideas of questions you might use.

Survey Do’s and Don’ts

• DO: Check with your HR and legal departments to make sure there are no regulations or requirements about the data you can collect.

• DO: Use age ranges rather than specific ages when asking participants for their age.

• DO: Include comment fields for questions where you want to hear more from participants.

• DON’T: Collect gender information; if you want to collect gender note that information separately based on observation.

Conduct Test Sessions Good practices for conducting your test start before the participant comes and follows through after he or she leaves.

Begin with a Run-through • Run through your test yourself or with someone else to make sure the tasks make sense and can be

completed with the version of the product you are testing

• Conduct a pilot test with a participant – this participant can be a co-worker or someone you have access to that would be part of the target audience

• Allow enough time before the test session to make changes.

At the Test Session • Welcome your participant and make them comfortable


www.techsmith.com 11

• Use the script to help you remember what you need to do and say

• Ask participants to fill out the consent form (include a non-disclosure agreement if your company requires one)

• Remember your facilitation skills and start the test

• Allow enough time between test sessions to set equipment and prepare for your next participant

• See Usability Testing and Morae for details on using Morae when conducting your test sessions.

Facilitation Technique matters: An impartial facilitator conducts the test without influencing the participant. The facilitator keeps the test flowing, provides simple directions, and keeps the participant focused. The facilitator may be located near the participant or in another room with an intercom system. Often, participants are asked to keep a running narration (called the “think-aloud” protocol) and the facilitator must keep the participant talking.

Test Session Do’s and Don’ts

• DO: Ensure the participant’s physical comfort.

• DO: Ask open ended questions

o What are you thinking right now?

o What are you trying to do?

o Is there anything else you might try?

o Where would you go?

o What did you expect to happen?

o You seemed surprised or frustrated…?

o Exactly how did that differ from what you expected to happen?

o Would you expect that information to be provided?

o Please keep talking…

• DO: Provide open-ended hints only when asked: “Do you see something that will help you?”

• DON’T: Provide direction or tell the user how to accomplish the task.

• DON’T: Offer approval or disapproval with words, facial expressions or body language.

• DON’T: Crowd the participant physically; allow the participant to move, take a break or quit.

• DON’T: Make notes only when the participant does something interesting…keep the sound of your keyboard or pen consistent so that you avoid giving clues to the participant.

Techniques for Task Failures

Occasionally, a participant will fail to complete or will outright quit trying to complete a task. Indirect hints or encouragement such as “is there anything on the screen to help you?” may be used to encourage the participant to explore, but at some point he or she should be allowed to fail.

If a participant fails a task but needs the information from that task to continue, a recommended technique is to count the failure but have the participant try the required portion of the first task again. Doing this lets you understand better how long it takes participants to “get” a particular interaction. You can then gage how easy or hard it is to learn to perform the task and more about where they might be confused by your product.

Provide a more direct hint only as a very last resort.

After the Session • Reset your machine, clear data and save the participant’s work, if appropriate.



• Debrief with observers to note trends and issues

• Clean the environment for the next participant.

Analyze Your Study Analyzing is a three step process:

• Step 1: Identify exactly what you observed

• Step 2: Identify the causes of any problems

• Step 3: Determine Solutions

Source: Whitney Quesenbery

Step 1: Identify exactly what you observed Your analysis following the test lets you find the critical problems and issues that help you design a better product. Review what you’ve seen and note:

• How did people perform? Were they successful?

• How long did it take them to complete a task?

• What mistakes were made?

• What problems did they encounter? Where?

• How often and how many problems did they have?

• How did they feel? What did they say? What significant things did they do?

• What worked well?

You can begin with a review your recordings for those measures that you selected when you designed the test. Create your project, import the recordings of each participant, and look for the data you defined when you planned your test.

Step 2: Identify the causes of any problems Ask yourself and the team a series of questions about the problems observed.

• Was there a problem with workflow? The navigation? The terminology?

• How severe was the problem? Did the participant fail or was the participant significantly delayed? Did it present an obstacle? How difficult was the obstacle to overcome?

• “Why? “ Why did the participant have a problem? After you ask that question, ask “Why?” again. Repeat that process until you reach the fundamental, underlying problem.

Step 3: Determine Solutions In some cases, the researcher is tasked to make recommendations. In other environments, solutions are determined by designers and/or development teams. The researcher can mentor solutions that address the root causes of a usability problem and meet the needs of the user.

One technique is to have all stakeholders meet to review the findings and determine recommendations at a debrief meeting. Diagramming problems, listing them and discussing each can produce a shared understanding of how to address the problems.

Even when working alone, it is essential that you discuss your usability recommendations with your team – developers, marketing, sales – to learn what works and what doesn't work from a business and technical point of view. If you are the person from whom recommendations are expected, solicit other opinions and be prepared to set your ideas aside.



Tips for Great Recommendations

• Use your data to form conclusions and drive design changes.

• Remember to note that good things have happened; mention them first.

• Make sure your recommendations address the original problem noted, and limit the recommendation to that original problem. Create solutions that address the cause of the problem, not the symptoms.

• Keep them short and to the point.

• Make your recommendations specific. For example, rather than recommending a change in a workflow, diagram the optimal workflow based on the test findings.

• Address the needs of as many users (or types of users) as possible.

• Recommend the least possible change, and then recommend a quick usability test to see if you’ve solved the problem. If not, try another tweak, or move on to a larger change.

• A picture or video is worth a thousand words: enhance your recommendations with wireframes, video clips and annotated screenshots.

• Use the language of your audience: executives, developers, etc.

• Show an interest in what happens to your great recommendations. Ask follow-up questions if your great recommendations are not followed. Maybe you can learn something.

Avoid making recommendations that:

• Are based on opinions.

• Are vague or not actionable.

• Only list complaints.

• Create a new set of problems for users.

• Are targeted only to a single type of user, for example, a design targeted for expert users at the expense of other types.

Source: Rolf Molich, et al

Deliverables Deliverables – reports, presentations, highlight videos and so on -- document what was done for future reference. They often detail the usability problems found during the test plus any other data such as time on task, error rate, satisfaction, etc. The Usability Test Report on the Morae Resource CD is one template you might use to report results. Generally speaking, report or presentations will include:

Summary

Description of the product and the test objectives

Method

• Participants

• Context of the test

• Tasks

• Testing environment and equipment

• Experiment design What was measured (Metrics)

Results and findings

• How participants fared (Graphs and tables)

• Why they might not have done well (or why they did do well)



Recommendations or Next Steps

Depending on the project objectives and stakeholders, the report can also take the form of a presentation. Morae makes it easy for you to include highlight videos at important points, to illustrate the problem in the participant’s own words.

Resources

A Practical Guide to Usability Testing, Revised by Joe Dumas and Ginny Redish, Intellect, 2nd

Edition, 1999

Usability Testing and Research by Carol Barnum, Longman, 2002

Handbook of Usability Testing: how to Plan, Design and Conduct Effective Tests by Jeff Rubin and Dana Chisnell, Wiley 2nd Edition, 2008

References

Usability and Accessibility – STEC Workshop 2008 Whitney Quesenbery

Recommendations on Recommendations, Rolf Molich, Kasper Hornbaek, Steve Krug, Jeff Johnson, Josephine Scott, 2008, accepted for publication in User Experience Magazine, issue 7.4, October 2008



Appendix A: Participant Recruitment Screener The usability test of the X Product requires 12 participants from 2 user groups.

USER TYPE NUMBER CHARACTERISTICS

Experienced product users

6 Current product users/customers who have used X Product for at least 1 year and use it at least 3 times a month 3 males, 3 females

New product users

6 People who have no prior experience with X Product, but do have at least 1 year’s experience using similar products (e.g. data processing tools). 3 males, 3 females

Participation: All participants will spend about 60 minutes in the usability session. Incentive will be $50 in cash.

Schedule: The usability tests will be conducted from May 5-7, 2008. Use schedule of available testing time slots to schedule individual participants once they have passed the recruitment screener.

AVAILABLE TIME SLOTS TUES. MAY 5 WED. MAY 6 THURS. MAY 7

9-10 am

10:30-11:30 am

1-2 pm

2:30-3:30 pm

4-5 pm

Recruitment Script

Introduction

Hello, may I speak with ________. We are looking for participants to take part in a research study evaluating the usability of the X Product. There will be $50 cash in compensation for the hour long session, which will take the X Building located downtown. The session would involve one-on-one meeting with a researcher where you would sit down in front of a computer and try to use a product while being observed and answering questions about the product.

Would you be interested in participating?

If not: Thank you for taking the time to speak with me. If you know of anyone else who might be interested in participating please have them call me, [Name], at 555-1234.



Screening

I need to ask you a couple of questions to determine whether you meet the eligibility criteria—Do you have a couple of minutes?

If not: When is a good time to call back?

Keep in mind that your answers to these questions to not automatically allow or disallow you take part in the study—we just need accurate information about your background, so please answer as well as you can.

Have you ever used X product?

If yes: How long have you used it for? [criteria: at least 1 yr.]

And how often do you use it? [criteria: at least 3 times a month]

If no: Have you ever used any data processing products, such as [list competitor or similar products]? [criteria: Yes]

If yes: How long have you used it for? [criteria: at least 1 yr.]

And how often do you use it? [criteria: at least 3 times a month]

Self-identify participant gender via voice and name and other cues.

Scheduling

If participant meets criteria: Will you be able to come to the X Building located downtown for one hour between May 15 and 19? Free parking is available next to the building.

How is [name available times and dates]?

You will be participating in a one-on-one usability test session on [date and time]. Do you require any special accommodations?

I need to have an e-mail address to send specific directions and confirmation information to. Thanks again!

If participant does not meet criteria: Unfortunately, you do not fit the criteria for this particular evaluation and will not be able to participate. Thank you for taking the time to speak with me.

Use the screener questions in this script can in an email address for written recruitment.

Chapter 13

Functional Testing

A functional specification is a description of intended program1 behavior,distinct from the program itself. Whatever form the functional specificationtakes — whether formal or informal — it is the most important source of in-formation for designing tests. The set of activities for deriving test case spec-ifications from program specifications is called functional testing.

Functional testing, or more precisely, functional test case design, attemptsto answer the question “What test cases shall I use to exercise my program?”considering only the specification of a program and not its design or imple-mentation structure. Being based on program specifications and not on theinternals of the code, functional testing is also called specification-based orblack-box testing.

Functional testing is typically the base-line technique for designing testcases, for a number of reasons. Functional test case design can (and should)begin as part of the requirements specification process, and continue througheach level of design and interface specification; it is the only test design tech-nique with such wide and early applicability. Moreover, functional testing iseffective in finding some classes of fault that typically elude so-called “white-box” or “glass-box” techniques of structural or fault-based testing. Func-tional testing techniques can be applied to any description of program be-havior, from an informal partial description to a formal specification and atany level of granularity, from module to system testing. Finally, functionaltest cases are typically less expensive to design and execute than white-boxtests.

1In this chapter we use the term “program” generically for the artifact under test, whetherthat artifact is a complete application or an individual unit together with a test harness. This isconsistent with usage in the testing research literature.

47

48 Functional Testing

Required Background

� Chapters 14 and 15:The material on control and data flow graphs is required to understandsection 13.7, but it is not necessary to comprehend the rest of the chap-ter.

� Chapter 27:The definition of pre- and post-conditions can be helpful in understand-ing section 13.8, but it is not necessary to comprehend the rest of thechapter.

13.1 Overview

In testing and analysis aimed at verification2 — that is, at finding any dis-crepancies between what a program does and what it is intended to do —one must obviously refer to requirements as expressed by users and specifiedby software engineers. A functional specification, i.e., a description of the ex-pected behavior of the program, is the primary source of information for testcase specification.

Functional testing, also known as black-box or specification-based test-ing, denotes techniques that derive test cases from functional specifications.� ��

Usually functional testing techniques produce test case specifications thatidentify classes of test cases and be be instantiated to produce individual testcases.

A particular functional testing technique may be effective only for somekinds of software or may require a given specification style. For example,a combinatorial approach may work well for functional units characterizedby a large number of relatively independent inputs, but may be less effec-tive for functional units characterized by complex interrelations among in-puts. Functional testing techniques designed for a given specification nota-tion, e.g., finite state machines or grammars, are not easily applicable to otherspecification styles.

The core of functional test case design is partitioning the possible behav-iors of the program into a finite number of classes that can reasonably ex-pected to consistently be correct or incorrect. In practice, the test case de-signer often must also complete the job of formalizing the specification farenough to serve as the basis for identifying classes of behaviors. An impor-tant side effect of test design is highlighting weaknesses and incompletenessof program specifications.

Deriving functional test cases is an analytical process which decomposesspecifications into test cases. The myriad of aspects that must be taken into

2Here we focus on software verification as opposed to validation (see Chapter 2). The prob-lems of validating the software and its specifications, i.e., checking the program behavior and itsspecifications with respect to the users’ expectations, is treated in Chapter 12.

Draft version produced 20th March 2002

Overview 49

��

Test cases and test suites can be derived from several sources of information, includ-ing specifications (functional testing), detailed design and source code (structural test-ing), and hypothesized defects (fault-based testing). Functional test case design is anindispensable base of a good test suite, complemented but never replaced by by struc-tural and fault-based testing, because there are classes of faults that only functional test-ing effectively detects. Omission of a feature, for example, is unlikely to be revealed bytechniques which refer only to the code structure.

Consider a program that is supposed to accept files in either plain ASCII text, orHTML, or PDF formats and generate standard PostScript. Suppose the programmer over-looks the PDF functionality, so the program accepts only plain text and HTML files. Intu-itively, a functional testing criterion would require at least one test case for each item inthe specification, regardless of the implementation, i.e., it would require the program tobe exercised with at least one ASCII, one HTML, and one PDF file, thus easily revealingthe failure due to the missing code. In contrast, criterion based solely on the code wouldnot require the program to be exercised with a PDF file, since all of the code can be exer-cised without attempting to use that feature. Similarly, fault-based techniques, based onpotential faults in design or coding, would not have any reason to indicate a PDF file as apotential input even if “missing case” were included in the catalog of potential faults.

A functional specification often addresses semantically rich domains, and we can usedomain information in addition to the cases explicitly enumerated in the program spec-ification. For example, while a program may manipulate a string of up to nine alphanu-meric characters, the program specification may reveal that these characters represent apostal code, which immediately suggests test cases based on postal codes of various lo-calities. Suppose the program logic distinguishes only two cases, depending on whetherthey are found in a table of U.S. zip codes. A structural testing criterion would requiretesting of valid and invalid U.S. zip codes, but only consideration of the specification andricher knowledge of the domain would suggest test cases that reveal missing logic fordistinguishing between U.S.-bound mail with invalid U.S. zip codes and mail bound toother countries.

Functional testing can be applied at any level of granularity where some form of spec-ification is available, from overall system testing to individual units, although the level ofgranularity and the type of software influence the choice of the specification styles andnotations, and consequently the functional testing techniques that can be used.

In contrast, structural and fault-based testing techniques are invariably tied to pro-gram structures at some particular level of granularity, and do not scale much beyondthat level. The most common structural testing techniques are tied to fine-grain pro-gram structures (statements, classes, etc.) and are applicable only at the level of modulesor small collections of modules (small subsystems, components, or libraries).



account during functional test case specification makes the process error prone.Even expert test designers can miss important test cases. A methodology forfunctional test design systematically helps by decomposing the functionaltest design activity into elementary steps that cope with single aspect of theprocess. In this way, it is possible to master the complexity of the process andseparate human intensive activities from activities that can be automated.Systematic processes amplify but do not substitute for skills and experienceof the test designers.

In a few cases, functional testing can be fully automated. This is possiblefor example when specifications are given in terms of some formal model,e.g., a grammar or an extended state machine specification. In these (excep-tional) cases, the creative work is performed during specification and designof the software. The test designer’s job is then limited to the choice of the testselection criteria, which defines the strategy for generating test case specifi-cations. In most cases, however, functional test design is a human intensiveactivity. For example, when test designers must work from informal speci-fications written in natural language, much of the work is in structuring thespecification adequately for identifying test cases.

13.2 Random versus Partition Testing Strategies

With few exceptions, the number of potential test cases for a given programis unimaginably huge — so large that for all practical purposes it can be con-sidered infinite. For example, even a simple function whose input argumentsare two 32-bit integers has �� legal inputs. In contrast to input spaces,budgets and schedules are finite, so any practical method for testing must se-lect an infinitesimally small portion of the complete input space.

Some test cases are better than others, in the sense that some reveal faultsand others do not.3 Of course, we cannot know in advance which test casesreveal faults. At a minimum, though, we can observe that running the sametest case again is less likely to reveal a fault than running a different test case,and we may reasonably hypothesize that a test case that is very different fromthe test cases that precede it is more valuable than a test case that is verysimilar (in some sense yet to be defined) to others.

As an extreme example, suppose we are allowed to select only three testcases for a program that breaks a text buffer into lines of 60 characters each.Suppose the first test case is a buffer containing 40 characters, and the secondis a buffer containing 30 characters. As a final test case, we can choose a buffercontaining 16 characters or a buffer containing 100 characters. Although wecannot prove that the 100 character buffer is the better test case (and it mightnot be; the fact that 16 is a power of 2 might have some unforeseen signifi-cance), we are naturally suspicious of a set of tests which is strongly biasedtoward lengths less than 60.

3Note that the relative value of different test cases would be quite different if our goal were tomeasure dependability, rather than finding faults so that they can be repaired.


Random versus Partition Testing Strategies 51

��

While the informal meanings of words like “test” may be adequate for everyday con-versation, in this context we must try to use terms in a more precise and consistent man-ner. Unfortunately, the terms we will need are not always used consistently in the liter-ature, despite the existence of an IEEE standard that defines several of them. The termswe will use are defined below.

Independently testable feature (ITF): An ITF is a functionality that can be tested inde-pendently of other functionalities of the software under test. It need not correspondto a unit or subsystem of the software. For example, a file sorting utility may be ca-pable of merging two sorted files, and it may be possible to test the sorting andmerging functionalities separately, even though both features are implemented bymuch of the same source code. (The nearest IEEE standard term is “test item.”)

As functional testing can be applied at many different granularities, from unit test-ing through integration and system testing, so ITFs may range from the function-ality of an individual Java class or C function up to features of a integrated systemcomposed of many complete programs. The granularity of an ITF depends on theexposed interface at whichever granularity is being tested. For example, individualmethods of a class are part of the interface of the class, and a set of related methods(or even a single method) might be an ITF for unit testing, but for system testing theITFs would be features visible through a user interface or application programminginterface.

Test case: A test case is a set of inputs, execution conditions, and expected results. Theterm “input” is used in a very broad sense, which may include all kinds of stimulithat contribute to determining program behavior. For example, an interrupt is asmuch an input as is a file. (This usage follows the IEEE standard.)

Test case specification: The distinction between a test case specification and a test caseis similar to the distinction between a program and a program specification. Manydifferent test cases may satisfy a single test case specification. A simple test spec-ification for a sorting method might require an input sequence that is already insorted order. A test case satisfying that specification might be sorting the particularvector (“alpha,” “beta,” “delta.”) (This usage follows the IEEE standard.)

Test suite: A test suite is a set of test cases. Typically, a method for functional testingis concerned with creating a test suite. A test suite for a program, a system, or anindividual unit may be made up of several test suites for individual ITFs. (This usagefollows the IEEE standard.)

Test: We use the term test to refer to the activity of executing test cases and evaluatingtheir result. When we refer to “a test,” we mean execution of a single test case, ex-cept where context makes it clear that the reference is to execution of a whole testsuite. (The IEEE standard allows this and other definitions.)



Accidental bias may be avoided by choosing test cases from a random dis-tribution. Random sampling is often an inexpensive way to produce a largenumber of test cases. If we assume absolutely no knowledge on which toplace a higher value on one test case than another, then random samplingmaximizes value by maximizing the number of test cases that can be created(without bias) for a given budget. Even if we do possess some knowledge sug-gesting that some cases are more valuable than others, the efficiency of ran-dom sampling may in some cases outweigh its inability to use any knowledgewe may have.

Consider again the line-break program, and suppose that our budget isone day of testing effort rather than some arbitrary number of test cases. If thecost of random selection and actual execution of test cases is small enough,then we may prefer to run a large number of random test cases rather thanexpending more effort on each of a smaller number of test cases. We may ina few hours construct programs that generate buffers with various contentsand lengths up to a few thousand characters, as well as an automated proce-dure for checking the program output. Letting it run unattended overnight,we may execute a few million test cases. If the program does not correctlyhandle a buffer containing a sequence of more than 60 non-blank characters(a single “word” that does not fit on a line), we are likely to encounter thiscase by sheer luck if we execute enough random tests, even without havingexplicitly considered this case.

Even a few million test cases is an infinitesimal fraction of the completeinput space of most programs. Large numbers of random tests are unlikelyto find failures at single points (singularities) in the input space. Consider,for example, a simple procedure for returning the two roots of a quadraticequation �� and suppose we choose test inputs (values of thecoefficients �, �, and ) from a uniform distribution ranging from �� to��. While uniform random sampling would certainly cover cases in which�� (where the equation has no real roots), it would be very unlikely totest the case in which � � � and � � �, in which case a naive implementationof the quadratic formula

� ��

��

��

will divide by zero (see Figure 13.1).Of course, it is unlikely that anyone would test only with random values.

Regardless of the overall testing strategy, most test designers will also try some“special” values. The test designer’s intuition comports with the observationthat random sampling is an ineffective way to find singularities in a largeinput space. The observation about singularities can be generalized to anycharacteristic of input data that defines an infinitesimally small portion ofthe complete input data space. If again we have just three real-valued inputs�, �, and , there is an infinite number of choices for which � � , but randomsampling is unlikely to generate any of them because they are an infinitesimalpart of the complete input data space.


Random versus Partition Testing Strategies 53

- 1: ��

- 2: ��

- 3: ��

- 4: ��

- 5: ��

- 6: � ��

- 7: � �� !"# � �! � �

- 8: � $�� %� �� & ! � �%��% %�

- 9: � &�� '(

-10: ��

-11: $��

-12: ��

-13: ��

-14: �� )$$�* %� �� &�� &�� * %�+% ��%��

-15: �� +�� !��,�

-16: �� - .� �. � ��"# . /�� #�

-17: - �� . /��

-18: �& � 0 ' 11 � 2- '�

-19: �� 3& �"# 0 /�� %� &�� %��

-20: ��

-21: �� - #�

-22: � - �� 4�%(� ��

-23: �� - ��'.�� #��

-24: �� - ��'.�� . ��#��

-25: � �� & � --'�

-26: �� %� � �� %�� !��* ��

-27: �� 56� 78�8��

-28: �� - 9�

-29: �� - �'.��#��

-30: �� - ��

-31: � ��

-32: �� %� � �� %�� & �"# : /��

-33: �� - '�

-34: �� - .9�

-35: �� - .9�

-36: �

-37: �

-38: $��

-39: $�� &��

-40: $��

-41: �

Figure 13.1: The Java class “roots,” which finds roots of a quadratic equation.The case analysis in the implementation is incomplete: It does not properlyhandle the case in which �� and � � �. We cannot anticipate all suchfaults, but experience teaches that boundary values identifiable in a specifi-cation are disproportionately valuable. Uniform random generation of evenlarge numbers of test cases is ineffective at finding the fault in this program,but selection of a few “special values” based on the specification quickly un-covers it.



The observation about special values and random samples is by no meanslimited to numbers. Consider again, for example, breaking a text buffer intolines. Since line breaks are permitted at blanks, we would consider blanks a“special” value for this problem. While random sampling from the characterset is likely to produce a buffer containing a sequence of at least 60 non-blankcharacters, it is much less likely to produce a sequence of 60 blanks.

The reader may justifiably object that a reasonable test designer would notcreate text buffer test cases by sampling uniformly from the set of all char-acters, but would instead classify characters depending on their treatment,lumping alphabetic characters into one class and white space characters intoanother. In other words, a test designer will partition the input space intoclasses, and will then generate test data in a manner that is likely to choosedata from each partition.4 Test designers seldom use pure random sampling;usually they exploit some knowledge of application semantics to choose sam-ples that are more likely to include “special” or trouble-prone regions of theinput space.

A testing method that divides the infinite set of possible test cases into afinite set of classes, with the purpose of drawing one or more test cases fromeach class, is called a partition testing method. When partitions are chosen� ��

according to information in the specification, rather than the design or im-plementation, it is called specification-based partition testing, or more briefly,functional testing. Note that not all testing of product functionality is “func-� ��

tional testing.” Rather, the term is used specifically to refer to systematic test-ing based on a functional specification. It excludes ad hoc and random test-ing, as well as testing based on the structure of a design or implementation.� ��

Partition testing typically increases the cost of each test case, since in ad-dition to generation of a set of classes, creation of test cases from each classmay be more expensive than generating random test data. In consequence,partition testing usually produces fewer test cases than random testing forthe same expenditure of time and money. Partitioning can therefore be ad-vantageous only if the average value (fault-detection effectiveness) is greater.

If we were able to group together test cases with such perfect knowledgethat the outcome of test cases in each class were uniform (either all suc-cesses, or all failures), then partition testing would be at its theoretical best.In general we cannot do that, nor even quantify the uniformity of classes oftest cases. Partitioning by any means, including specification-based partitiontesting, is always based on experience and judgment that leads one to believethat certain classes of test case are “more alike” than others, in the sense thatfailure-prone test cases are likely to be concentrated in some classes. Whenwe appealed above to the test designer’s intuition that one should try bound-ary cases and special values, we were actually appealing to a combination ofexperience (many failures occur at boundary and special cases) and knowl-

4We are using the term “partition” in a common but rather sloppy sense. A true partitionwould separate the input space into disjoint classes, the union of which is the entire space. Parti-tion testing separates the input space into classes whose union is the entire space, but the classesmay not be disjoint.


A Systematic Approach 55

edge that identifiable cases in the specification often correspond to classes ofinput that require different treatment by an implementation.

Given a fixed budget, the optimum may not lie in only partition testing oronly random testing, but in some mix that makes use of available knowledge.For example, consider again the simple numeric problem with three inputs,�, �, and . We might consider a few special cases of each input, individuallyand in combination, and we might consider also a few potentially-significantrelationships (e.g., � � �). If no faults are revealed by these few test cases,there is little point in producing further arbitrary partitions — one might thenturn to random generation of a large number of test cases.

13.3 A Systematic Approach

Deriving test cases from functional specifications is a complex analytical pro-cess that partitions the input space described by the program specification.Brute force generation of test cases, i.e., direct generation of test cases fromprogram specifications, seldom produces acceptable results: test cases aregenerated without particular criteria and determining the adequacy of thegenerated test cases is almost impossible. Brute force generation of test casesrelies on test designers’ expertise and is a process that is difficult to monitorand repeat. A systematic approach simplifies the overall process by dividingthe process into elementary steps, thus decoupling different activities, divid-ing brain intensive from automatable steps, suggesting criteria to identify ad-equate sets of test cases, and providing an effective means of monitoring thetesting activity.

Although suitable functional testing techniques can be found for any gran-ularity level, a particular functional testing technique may be effective onlyfor some kinds of software or may require a given specification style. For ex-ample, a combinatorial approach may work well for functional units charac-terized by a large number of relatively independent inputs, but may be lesseffective for functional units characterized by complex interrelations amonginputs. Functional testing techniques designed for a given specification no-tation, e.g., finite state machines or grammars, are not easily applicable toother specification styles. Nonetheless we can identify a general pattern ofactivities that captures the essential steps in a variety of different functionaltest design techniques. By describing particular functional testing techniquesas instantiations of this general pattern, relations among the techniques maybecome clearer, and the test designer may gain some insight into adaptingand extending these techniques to the characteristics of other applicationsand situations.

Figure 13.2 identifies the general steps of systematic approaches. Thesteps may be difficult or trivial depending on the application domain and theavailable program specifications. Some steps may be omitted depending onthe application domain, the available specifications and the test designers’expertise. Instances of the process can be obtained by suitably instantiating



different steps. Although most techniques are presented and applied as standalone methods, it is also possible to mix and match steps from different tech-niques, or to apply different methods for different parts of the system to betested.

Identify Independently Testable Features Functional specifications can belarge and complex. Usually, complex specifications describe systems that canbe decomposed into distinct features. For example, the specification of a website may include features for searching the site database, registering users’profiles, getting and storing information provided by the users in differentforms, etc. The specification of each of these features may comprise severalfunctionalities. For example, the search feature may include functionalitiesfor editing a search pattern, searching the data base with a given pattern,and so on. Although it is possible to design test cases that exercise severalfunctionalities at once, the design of different tests for different functionali-ties can simplify the test generation problem, allowing each functionality tobe examined separately. Moreover, it eases locating faults that cause the re-vealed failures. It is thus recommended to devise separate test cases for eachfunctionality of the system, whenever possible.

The preliminary step of functional testing consists in partitioning the spec-ifications into features that can be tested separately. This can be an easy stepfor well designed, modular specifications, but informal specifications of largesystems may be difficult to decompose into independently testable features.Some degree of formality, at least to the point of careful definition and use ofterms, is usually required.

Identification of functional features that can be tested separately is dif-ferent from module decomposition. In both cases we apply the divide andconquer principle, but in the former case, we partition specifications accord-ing to the functional behavior as perceived by the users of the software undertest,5 while in the latter, we identify logical units that can be implementedseparately. For example, a web site may require a sort function, as a serviceroutine, that does not correspond to an external functionality. The sort func-tion may be a functional feature at module testing, when the program undertest is the sort function itself, but is not a functional feature at system test,while deriving test cases from the specifications of the whole web site. Onthe other hand, the registration of a new user profile can be identified as oneof the functional features at system level testing, even if such functionality isimplemented with several modules (unit at the design level) of the system.Thus, identifying functional features does not correspond to identifying sin-gle modules at the design level, but rather to suitably slicing the specificationsto be able to attack their complexity incrementally, aiming at deriving usefultest cases for the whole system under test.

5Here the word user indicates who uses the specified service. It can be the user of the system,when dealing with specification at system level; but it can be another module of the system,when dealing with specifications at unit level.



Functional Specifications�

Independently Testable Feature�

Model�Representative Values�

Test Case Specifications�

Test Cases�

Identify�

Representative�

Values�

Iden

tify�

Inde

pend

ently

�Te

stab

le�

Feat

ures

�

Derive�a Model�

Generate Test-Case�

Specifications� Generate Test-Case

�

Specifica

tions�

Gen

erat

e�Te

st C

ases

�

Scaffolding�

Inst

antia

te�

Test

s�Brute�Force�

Testing�

Finite State Machine�Grammar�Algebraic Specification�Logic Specification�Control/Data Flow Graph�

Semantic Constraints�Combinatorial Selection�Exaustive Enumeration�Random Selection�

Test Selection Criteria�

Manual Mapping�Symbolic Execution�A-posteriori Satisfaction�

Figure 13.2: The main steps of a systematic approach to functional programtesting.



Independently testable features are described by identifying all the inputsthat form their execution environments. Inputs may be given in differentforms depending on the notation used to express the specifications. In somecases they may be easily identifiable. For example, they can be the input al-phabet of a finite state machine specifying the behavior of the system. Inother cases, they may be hidden in the specification. This is often the case ofinformal specifications, where some inputs may be given explicitly as param-eters of the functional unit, but other inputs may be left implicit in the de-scription. For example, a description of how a new user registers at a web sitemay explicitly indicate the data that constitutes the user profile to be insertedas parameters of the functional unit, but may leave implicit the collection ofelements (e.g., database) in which the new profile must be inserted.

Trying to identify inputs may help in distinguishing different functions.For example, trying to identify the inputs of a graphical tool may lead to aclearer distinction between the graphical interface per se and the associatedcalbacks to the application. With respect to the web-based user registrationfunction, the data to be inserted in the database are part of the executionenvironment of the functional unit that performs the insertion of the userprofile, while the combination of fields that can be use to construct such datais part of the execution environment of the functional unit that takes care ofthe management of the specific graphical interface.

Identify Representative Classes of Values or Derive a Model The executionenvironment of the feature under test determines the form of the final testcases, which are given as combinations of values for the inputs to the unit.The next step of a testing process consists of identifying which values of eachinput can be chosen to form test cases. Representative values can be identi-fied directly from informal specifications expressed in natural language. Al-ternativey, representative values may be selected indirectly through a model,which can either be produced only for the sake of testing or be available aspart of the specification. In both cases, the aim of this step is to identify thevalues for each input in isolation, either explicitly through enumeration, orimplicitly trough a suitable model, but not to select suitable combinations ofsuch values, i.e., test case specifications. In this way, we separate the prob-lem of identifying the representative values for each input, from the problemof combining them to obtain meaningful test cases, thus splitting a complexstep into two simpler steps.

Most methods that can be applied to informal specifications rely on ex-plicit enumeration of representative values by the test designer. In this case,it is very important to consider all possible cases and take advantage of the in-formation provided by the specification. We may identify different categoriesof expected values, as well as boundary and exceptional or erroneous values.For example, when considering operations on a non-empty lists of elements,we may distinguish the cases of the empty list (an error value) and a single-ton element (a boundary value) as special cases. Usually this step determines



characteristics of values (e.g., any list with a single element) rather than actualvalues.

Implicit enumeration requires the construction of a (partial) model of thespecifications. Such a model may be already available as part of a specifi-cation or design model, but more often it must be constructed by the testdesigner, in consultation with other designers. For example, a specificationgiven as a finite state machine implicitly identifies different values for the in-puts by means of the transitions triggered by the different values. In somecases, we can construct a partial model as a mean for identifying differentvalues for the inputs. For example, we may derive a grammar from a specifi-cation and thus identify different values according to the legal sequences ofproductions of the given grammar.

Directly enumerating representative values may appear simpler and lessexpensive than producing a suitable model from which values may be de-rived. However, a formal model may also be valuable in subsequent steps oftest case design, including selection of combinations of values. Also, a for-mal model may make it easier to select a larger or smaller number of testcases, balancing cost and thoroughness, and may be less costly to modify andreuse as the system under test evolves. Whether to invest effort in producing amodel is ultimately a management decision that depends on the applicationdomain, the skills of test designers, and the availability of suitable tools.

Generate Test Case Specifications Test specifications are obtained by suit-ably combining values for all inputs of the functional unit under test. If rep-resentative values were explicitly enumerated in the previous step, then testcase specifications will be elements of the Cartesian product of values se-lected for each input. If a formal model was produced, then test case specifi-cations will be specific behaviors or combinations of parameters of the model,and single test case specification could be satisfied by many different con-crete inputs. Either way, brute force enumeration of all combinations is un-likely to be satisfactory.

The number of combinations in the Cartesian product of independentlyselected values grows as the product of the sizes of the individual sets. For asimple functional unit with 5 inputs each characterized by 6 values, the sizeof the Cartesian product is �� test case specifications, which may bean impractical number for test cases for a simple functional unit. Moreover, if(as is usual) the characteristics are not completely orthogonal, many of thesecombinations may not even be feasible.

Consider the input of a function that searches for occurrences of a com-plex pattern in a web database. Its input may be characterized by the lengthof the pattern and the presence of special characters in the pattern, amongother aspects. Interesting values for the length of the pattern may be zero,one, or many. Interesting values for the presence of special characters may bezero, one, or many. However, the combination of value “zero” for the lengthof the pattern and value “many” for the number of special characters in the



pattern is clearly impossible.The test case specifications represented by the Cartesian product of all

possible inputs must be restricted by ruling out illegal combinations and se-lecting a practical subset of the legal combinations. Illegal combinations areusually eliminated by constraining the set of combinations. For example, inthe case of the complex pattern presented above, we can constrain the choiceof one or more special characters to a positive length of the pattern, thus rul-ing out the illegal cases of patterns of length zero containing special charac-ters.

Selection of a practical subset of legal combination can be done by addinginformation that reflects the hazard of the different combinations as perceivedby the test designer or by following combinatorial considerations. In the for-mer case, for example, we can identify exceptional values and limit the com-binations that contain such values. In the pattern example, we may consideronly one test for patterns of length zero, thus eliminating many combinationsthat can be derived for patterns of length zero. Combinatorial considerationsreduce the set of test cases by limiting the number of combination of values ofdifferent inputs to a subset of the inputs. For example, we can generate onlytests that exhaustively cover all combinations of values for inputs consideredpair by pair.

Depending on the technique used to reduce the space represented by theCartesian product, we may be able to estimate the number of test cases gen-erated with the approach and modify the selected subset of test cases accord-ing to budget considerations. Subsets of combinations of values, i.e., poten-tial special cases, can be often derived from models of behavior by applyingsuitable test selection criteria that identify subsets of interesting behaviorsamong all behaviors represented by a model, for example by constraining theiterations on simple elements of the model itself. In many cases, test selectioncriteria can be applied automatically.

Generate Test Cases and Instantiate Tests The test generation process iscompleted by turning test case specifications into test cases and instantiatingthem. Test case specifications can be turned into test cases by selecting oneor more test cases for each item of the test case specification.

13.4 Category-Partition Testing

Category-partition testing is a method for generating functional tests from in-formal specifications. The main steps covered by the core part of the category-partition method are:

A. Decompose the specification into independently testable features: Test de-signers identify features to be tested separately, and identify parame-ters and any other elements of the execution environment the unit de-pends on. Environment dependencies are treated identically to explicit


Category-Partition Testing 61

parameters. For each parameter and environment element, test de-signers identify the elementary parameter characteristics, which in thecategory-partition method are usually called categories. � ��

� ��

B. Identify Relevant Values: Test designers select a set of representative classesof values for each parameter characteristic. Values are selected in isola- � �� ! ��

tion, independent of other parameter characteristics. In the category-partition method, classes of values are called choices, and this activity is � ��

called partitioning the categories into choices.

C. Generate Test Case Specifications: Test designers indicate invalid combi-nations of values and restrict valid combinations of values by imposingsemantic constraints on the identified values. Semantic constraints re-strict the values that can be combined and identify values that need notbe tested in different combinations, e.g., exceptional or invalid values.

Categories, choices, and constraints can be provided to a tool to auto-matically generate a set of test case specifications. Automating trivial andrepetitive activities such as these makes better use of human resources andreduces errors due to distraction. Just as important, it is possible to deter-mine the number of test cases that will be generated (by calculation, or by ac-tually generating them) before investing any human effort in test execution. Ifthe number of derivable test cases exceeds the budget for test execution andevaluation, test designers can reduce the number of test cases by imposingadditional semantic constraints. Controlling the number of test cases beforetest execution begins is preferable to ad hoc approaches in which one may atfirst create very thorough test suites and then test less and less thoroughly asdeadlines approach.

We illustrate the category-partition method using a specification of a fea-ture from the direct sales web site of Chipmunk Electronic Ventures. Cus-tomers are allowed to select and price custom configurations of Chipmunkcomputers. A configuration is a set of selected options for a particular modelof computer. Some combinations of model and options are not valid (e.g.,digital LCD monitor with analog video card), so configurations are tested forvalidity before they are priced. The check-configuration function (Table 13.3)is given a model number and a set of components, and returns the booleanvalue True if the configuration is valid or False otherwise. This function hasbeen selected by the test designers as an independently testable feature.

A. Identify Independently Testable Features and Parameter CharacteristicsWe assume that step starts by selecting the Check-configuration feature tobe tested independently of other features. This entails choosing to separatetesting of the configuration check per se from its presentation through a userinterface (e.g., a web form), and depends on the architectural design of thesoftware system.



Check-Configuration: ��

Model: � �� !�� "�!�� # �� !�� $��% � �� &�� &'( �� )*� �� + �� ,�� &-.-/- ��

Set of Components: � �� !�� $��% � �� )�� + ��

0� ��

�� &�� &'( �� '(�� 1 2( �� 3( �� ) �� !�� $��% � �� +�� "��* 2�'� �� "��* 2�' �� !�� 2( ��

Figure 13.3: The functional specification of the feature Check-configurationof the web site of a computer manufacturer.



Step requires the test designer to identify the parameter characteristics,i.e., the elementary characteristics of the parameters and environment ele-ments that affect the unit’s execution. A single parameter may have multi-ple elementary characteristics. A quick scan of the functional specificationwould indicate model and components as the parameters of check configu-ration. More careful consideration reveals that what is “valid” must be deter-mined by reference to additional information, and in fact the functional spec-ification assumes the existence of a data base of models and components.The data base is an environment element that, although not explicitly men-tioned in the functional specification, is required for executing and thus test-ing the feature, and partly determines its behavior. Note that our goal is notto test a particular configuration of the system with a fixed database, but totest the generic system which may be configured through different databasecontents.

Having identified model, components, and product database as the pa-rameters and environment elements required to test the Check-configurationfunctionality, the test designer would next identify the parameter character-istics of each.

Model may be represented as an integer, but we know that it is not to beused arithmetically, but rather serves as a key to the database and other ta-bles. The specification mentions that a model is characterized by a set of slotsfor required components and a set of slot for optional components. We mayidentify model number, number of required slots, and number of optional slotsas characteristics of parameter model.

Parameter components is a collection of �� pairs. The sizeof a collection is always an important characteristic, and since componentsare further categorized as required or optional, the test designer may identifynumber of required components with non-empty selection and number of op-tional components with non-empty selection as characteristics. The match-ing between the tuple passed to Check-Configuration and the one actuallyrequired by the selected model is important and may be identified as cate-gory Correspondence of selection with model slots. The actual selections arealso significant, but for now the test designer simply identifies required com-ponent selection and optional component selection, postponing selection ofrelevant values to the next stage in test design.

The environment element product database is also a collection, so num-ber of models in the database and number of components in the database areparameter characteristics. Actual values of database entries are deferred tothe next step in test design.

There are no hard-and-fast rules for choosing categories, and it is not atrivial task. Categories reflect the test designer’s judgment regarding whichclasses of values may be treated differently by an implementation, in addi-tion to classes of values that are explicitly identified in the specification. Testdesigners must also use their experience and knowledge of the applicationdomain and product architecture to look under the surface of the specifica-tion and identify hidden characteristics. For example, the specification frag-



ment in Table 13.3 makes no distinction between configurations of modelswith several required slots and models with none, but the experienced testdesigner has seen enough failures on “degenerate” inputs to test empty col-lections wherever a collection is allowed.

The number of options that can (or must) be configured for a particularmodel of computer may vary from model to model. However, the category-partition method makes no direct provision for structured data, such as setsof �� pairs. A typical approach is to “flatten” collections and de-scribe characteristics of the whole collection as parameter characteristics.Typically the size of the collection (the length of a string, for example, or inthis case the number of required or optional slots) is one characteristic, anddescriptions of possible combination of elements (occurrence of a specialcharacters in a string, for example, or in this case the selection of requiredand optional components) are separate parameter characteristics.

Suppose the only significant variation among �� pairs wasbetween pairs that are compatible and pairs that are incompatible. If wetreated each �� pair as a separate characteristic, and assumed� slots, the category-partition method would generate all �� combinations ofcompatible and incompatible slots. Thus we might have a test case in whichthe first selected option is compatible, the second is compatible, and the thirdincompatible, and a different test case in which the first is compatible but thesecond and third are incompatible, and so on, and each of these combina-tions could be combined in several ways with other parameter characteris-tics. The number of combinations quickly explodes, and moreover since thenumber of slots is not actually fixed, we cannot even place an upper boundon the number of combinations that must be considered. We will thereforechoose the flattening approach and select possible patterns for the collectionas a whole.

Should the representative values of the flattened collection of pairs be onecompatible selection, one incompatible selection, all compatible selections, allincompatible selections, or should we also include mix of 2 or more compatibleand 2 or more incompatible selections? Certainly the latter is more thorough,but whether there is sufficient value to justify the cost of this thoroughness isa matter of judgment by the test designer.

We have oversimplified by considering only whether a selection is com-patible with a slot. It might also happen that the selection does not appear inthe database. Moreover, the selection might be incompatible with the model,or with a selected component of another slot, in addition to the possibilitythat it is incompatible with the slot for which it has been selected. If we treateach such possibility as a separate parameter characteristic, we will gener-ate many combinations, and we will need semantic constraints to rule outcombinations like there are three options, at least two of which are compati-ble with the model and two of which are not, and none of which appears inthe database. On the other hand, if we simply enumerate the combinationsthat do make sense and are worth testing, then it becomes more difficult tobe sure that no important combinations have been omitted. Like all design



decisions, the way in which collections and complex data are broken into pa-rameter characteristics requires judgment based on a combination of analy-sis and experience.

B. Identify Relevant Values This step consists of identifying a list of rele-vant values (more precisely, a list of classes of relevant values) for each of theparameter characteristics identified during step . Relevant values shouldbe identified for each category independently, ignoring possible interactionsamong values for different categories, which are considered in the next step.

Relevant values may be identified by manually applying a set of rules knownas boundary value testing or erroneous condition testing. The boundary valuetesting rule suggests selection of extreme values within a class (e.g., maxi-mum and minimum values of the legal range), values outside but as close aspossible to the class, and “interior” (non-extreme) values of the class. Valuesnear the boundary of a class are often useful in detecting “off by one” errorsin programs. The erroneous condition rule suggests selecting values that areoutside the normal domain of the program, since experience suggests thatproper handling of error cases is often overlooked.

Table 13.1 summarizes the parameter characteristics and the correspond-ing relevant values identified for feature Check-configuration.6 For numericcharacteristics, whose legal values have a lower bound of �, i.e., number ofmodels in database and number of components in database, we identify �, theerroneous value, �, the boundary value, and��, the class of values greaterthan �, as the relevant value classes. For numeric characteristics whose lowerbound is zero, i.e., number of required slots for selected model and numberof optional slots for selected model, we identify � as a boundary value, � andmany as other relevant classes of values. Negative values are impossible here,so we do not add a negative error choice. For numeric characteristics whoselegal values have definite lower and upper-bounds, i.e., number of optionalcomponents with selection � empty and number of optional components withselection � empty, we identify boundary and (when possible) erroneous con-ditions corresponding to both lower and upper bounds.

Identifying relevant values is an important but tedious task. Test design-ers may improve manual selection of relevant values by using the catalog ap-proach described in Section 13.8, which captures the informal approachesused in this section with a systematic application of catalog entries.

C. Generate Test Case Specifications A test case specification for a featureis given as a combination of values, one for each identified parameter char-acteristic. Unfortunately, the simple combination of all possible relevant val-ues for each parameter characteristic results in an unmanageable number oftest cases (many of which are impossible) even for simple specifications. For

6At this point, readers may ignore the items in square brackets, which indicate the constraintsas identified in step � of the category-partition method.



Parameter: Model

Model number��

Number of required slots for se-lected model(#SMRS)

� � ��

Number of optional slots for se-lected model (#SMOS)

� � ��

Parameter: Components

Correspondence of selection withmodel slots

�� !��

Number of required componentswith selection �� empty

� � �� "�� #" �� "�� #" ��

Number of optional componentswith select �� empty

�� "�� "��

Required component selection�� "�� $ �! �� $ �! ��!�� $ �! ��

Optional component selection�� "�� $ �! �� $ �! ��!�� $ �! ��

Environment element: Product database

Number of models in database(#DBM)

� ��

Number of components indatabase (#DBC)

� ��

Table 13.1: An example category-partition test specification for the the con-figuration checking feature of the web site of a computer vendor.



example, in the Table 13.1 we find 7 categories with 3 value classes, 2 cate-gories with 6 value classes, and one with four value classes, potentially result-ing in � �� test cases, which would be acceptable only ifthe cost of executing and checking each individual test case were very small.However, not all combinations of value classes correspond to reasonable testcase specifications. For example, it is not possible to create a test case froma test case specification requiring a valid model (a model appearing in thedatabase) where the database contains zero models.

The category-partition method allows one to omit some combinations byindicating value classes that need not be combined with all other values. Thelabel �� indicates a value class that need be tried only once, in combina-tion with non-error values of other parameters. When �� constraints areconsidered in the category-partition specification of Table 13.1, the numberof combinations to be considered is reduced to �� . Note that we have treated “component not in database” asan error case, but have treated “incompatible with slot” as a normal case ofan invalid configuration; once again, some judgment is required.

Although the reduction from 314,928 to 2,711 is impressive, the numberof derived test cases may still exceed the budget for testing such a simple fea-ture. Moreover, some values are not erroneous per se, but may only be usefulor even valid in particular combinations. For example, the number of op-tional components with non-empty selection is relevant to choosing usefultest cases only when the number of optional slots is greater than 1. A num-ber of non-empty choices of required component greater than zero does notmake sense if the number of required components is zero.

Erroneous combinations of valid values can be ruled out with the propertyand if-property constraints. The property constraint groups values of a singleparameter characteristic to identify subsets of values with common proper-ties. The property constraint is indicated with label property PropertyName,where PropertyName identifies the property for later reference. For exam-ple, property RSNE (required slots non-empty) in Table 13.1 groups valuesthat correspond to non-empty sets of required slots for the parameter char-acteristic Number of Required Slots for Selected Model (#SMRS), i.e., values 1and many. Similarly, property OSNE (optional slots non-empty) groups non-empty values for the parameter characteristic Number of Optional Slots forSelected Model (#SMOS).

The if-property constraint bounds the choices of values for a parametercharacteristic once a specific value for a different parameter characteristichas been chosen. The if-property constraint is indicated with label if Proper-tyName, where PropertyName identifies a property defined with the propertyconstraint. For example, the constraint if RSNE attached to values 0 and �number of required slots of parameter characteristic Number of required com-ponents with selection � empty limits the combination of these values withthe values of the parameter characteristics Number of Required Slots for Se-lected Model (#SMRS), i.e., values 1 and many, thus ruling out the illegal com-bination of values 0 or�number of required slots for Number of required com-



ponents with selection � empty with value 0 for Number of Required Slots forSelected Model (#SMRS). Similarly, the if OSNE constraint limits the combina-tions of values of the parameter characteristics Number of optional compo-nents with selection � empty and Number of Optional Slots for Selected Model(#SMOS).

The property and if-property constraints introduced in Table 13.1 furtherreduce the number of combinations to be considered to �� . (Exercise Ex13.4 discusses derivation of thisnumber.)

The number of combinations can be further reduced by iteratively addingproperty and if-property constraints and by introducing the new single con-straint, which is indicated with label single and acts like the error constraint,i.e., it limits the number of occurrences of a given value in the selected com-binations to 1.

Introducing new property, if-property, and single constraints further doesnot rule out erroneous combinations, but reflects the judgment of the test de-signer, who decides how to restrict the number of combinations to be consid-ered by identifying single values (single constraint) or combinations (propertyand if-property constraints) that are less likely to need thorough test accord-ing to the test designer’s judgment.

The single constraints introduced in Table 13.1 reduces the number ofcombinations to be considered to �� ,which may be a reasonable tradeoff between costs and quality for the con-sidered functionality. The number of combinations can also be reduced byapplying combinatorial techniques, as explained in the next section.

The set of combinations of values for the parameter characteristics canbe turned into test case specifications by simply instantiating the identifiedcombinations. Table 13.2 shows an excerpt of test case specifications. Theerror tag in the last column indicates test cases specifications correspondingto the error constraint. Corresponding test cases should produce an errorindication. A dash indicates no constraints on the choice of values for theparameter or environment element.

Choosing meaningful names for parameter characteristics and value classesallows (semi)automatic generation of test case specifications.

13.5 The Combinatorial Approach

However one obtains sets of value classes for each parameter characteristic,the next step in producing test case specifications is selecting combinationsof classes for testing. A simple approach is to exhaustively enumerate allpossible combinations of classes, but the number of possible combinationsrapidly explodes.

Some methods, such as the category-partition method described in theprevious section, take exhaustive enumeration as a base approach to gener-ating combinations, but allow the test designer to add constraints that limit


The Combinatorial Approach 69

&�� &�� 4�� &5 �� 6""*"�&' ��

�� 6""*"

�& ��

( ��!�� ( ��

� �� /�70-

�&8 ��

� ��!��

� �� !��

��

/�70-

�&, � � � �� 6""*" �&� � � � �� 6""*"

Table 13.2: An excerpt of test case specifications derived from the valueclasses given in Table 13.1

growth in the number of combinations. This can be a reasonable approachwhen the constraints on test case generation reflect real constraints in theapplication domain, and eliminate many redundant combinations (for ex-ample, the “error” entries in category-partition testing). It is less satisfactorywhen, lacking real constraints from the application domain, the test designeris forced to add arbitrary constraints (e.g., “single” entries in the category-partition method) whose sole purpose is to reduce the number of combina-tions.

Consider the parameters that control the Chipmunk web-site display, shownin Table 13.3. Exhaustive enumeration produces 432 combinations, whichis too many if the test results (e.g., judging readability) involve human judg-ment. While the test designer might hypothesize some constraints, such asobserving that monochrome displays are limited mostly to hand-held de-vices, radical reductions require adding several “single” and “property” con-straints without any particular rationale.

Exhaustive enumeration of all �-way combinations of value classes for� parameters, on the one hand, and coverage of individual classes, on theother, are only the extreme ends of a spectrum of strategies for generatingcombinations of classes. Between them lie strategies that generate all pairs ofclasses for different parameters, all triples, and so on. When it is reasonableto expect some potential interaction between parameters (so coverage of in-dividual value classes is deemed insufficient), but covering all combinationsis impractical, an attractive alternative is to generate �-way combinations for� � �, typically pairs or triples.

How much does generating possible pairs of classes save, compared to



Display Mode�"��%��! ��%�� %��$ ��!

Language�� !&�� !�� !'��"�"��

Fonts� � ��(� "��%��

Color�� !��)��%��*%� �+�"�% ��

Screen size,��%!��-��&"��%� .�

Table 13.3: Parameters and values controlling Chipmunk web-site display

generating all combinations? We have already observed that the number ofall combinations is the product of the number of classes for each parameter,and that this product grows exponentially with the number of parameters.It turns out that the number of combinations needed to cover all possiblepairs of values grows only logarithmically with the number of parameters —an enormous saving.

A simple example may suffice to gain some intuition about the efficiencyof generating tuples that cover pairs of classes, rather than all combinations.Suppose we have just the three parameters display mode, screen size, andfonts from Table 13.3. If we consider only the first two, display mode andscreen size, the set of all pairs and the set of all combinations are identical,and contain � � � pairs of classes. When we add the third parameter,fonts, generating all combinations requires combining each value class fromfonts with every pair of display mode� screen size, a total of 27 tuples; extend-ing from � to �� parameters is multiplicative. However, if we are generatingpairs of values from display mode, screen size, and fonts, we can add valueclasses of fonts to existing elements of display mode� screen size in a way thatcovers all the pairs of fonts�screen size and all the pairs of fonts�display modewithout increasing the number of combinations at all (see Table 13.4). Thekey is that each tuple of three elements contains three pairs, and by carefulselecting value classes of the tuples we can make each tuple cover up to threedifferent pairs.

Table 13.5 shows 17 tuples that cover all pairwise combinations of valueclasses of the five parameters. The entries not specified in the table (“–”) cor-respond to open choices. Each of them can be replaced by any legal valuefor the corresponding parameter. Leaving them open gives more freedom forselecting test cases.

Generating combinations that efficiently cover all pairs of classes (or triples,or . . . ) is nearly impossible to perform manually for many parameters withmany value classes (which is, of course, exactly when one really needs to use



��

9�� :�� 9�� 7�� 9�� 9�� -��,�� :�� ,�� 7�� -��,�� 9�� 7�� :�� -��7�� 7�� 7�� 9��

Table 13.4: Covering all pairs of value classes for three parameters by extend-ing the cross-product of two parameters

the approach). Fortunately, efficient heuristic algorithms exist for this task,and they are simple enough to incorporate in tools.7

The tuples in Table 13.5 cover all pairwise combinations of value choicesfor parameters. In many cases not all choices may be allowed. For exam-ple, the specification of the Chipmunk web-site display may indicate thatmonochrome displays are limited to hand-held devices. In this case, the tu-ples covering the pairs �Monochrome�Laptop� and �Monochrome�Full-size�,i.e., the fifth and ninth tuples of Table 13.5, would not correspond to legal in-puts. We can restrict the set of legal combinations of value classes by addingsuitable constraints. Constraints can be expressed as tuples with wild-cardcharacters to indicate any possible value class. For example, the constraints

�� Monochrome�Laptop�� Monochrome�Full-size�

indicates that tuples that contain the pair �Monochrome�Hand-held� asvalues for the fourth and fifth parameter are not allowed in the relation of Ta-ble 13.3. Tuples that cover all pairwise combinations of value classes withoutviolating the constraints can be generated by simply removing the illegal tu-ples and adding legal tuples that cover the removed pairwise combinations.Open choices must be bound consistently in the remaining tuples, e.g., tuple

�Portuguese�Monochrome�Text-only� -� -�must become

�Portuguese�Monochrome�Text-only� -�Hand-held�Constraints can also be expressed with sets of tables to indicate only the

legal combinations, as illustrated in Table 13.6, where the first table indicates7Exercise Ex13.12 discusses the problem of computing suitable combinations to cover all

pairs.



��

6�� 9�� :��6�� &�� ,�� 9��6�� 5;�� 7�� < 9��6�� ,�� -�� 7��9�� 7�� 7��9�� &�� 9�� -�� 9��9�� 5;�� ,�� <9�� < < :�� < -�� 9�� &�� 7�� :�� 5;�� 9�� 7�� ,�� < :��4�� ,�� < <4�� &�� < �� 7��4�� 5;�� 7�� -�� :��4�� 9�� 9��4�� 7�� :��

Table 13.5: Covering all pairs of value classes for the five parameters

that the value class Hand-held for parameter Screen can be combined withany value class of parameter Color, including Monochrome, while the sec-ond table indicates that the value classes Laptop and Full-size for parameterScreen size can be combined with all values classes but Monochrome for pa-rameter Color.

If constraints are expressed as a set of tables that give only legal combi-nations, tuples can be generated without changing the heuristic. Althoughthe two approaches express the same constraints, the number of generatedtuples can be different, since different tables may indicate overlapping pairsand thus result in a larger set of tuples. Other ways of expressing constraintsmay be chosen according to the characteristics of the specifications and thepreferences of the test designer.

So far we have illustrated the combinatorial approach with pairwise cov-erage. As previously mentioned, the same approach can be applied for triplesor larger combinations. Pairwise combinations may be sufficient for somesubset of the parameters, but not enough to uncover potential interactionsamong other parameters. For example, in the Chipmunk display example,the fit of text fields to screen areas depends on the combination of language,fonts, and screen size. Thus, we may prefer exhaustive coverage of combi-nations of these three parameters, but be satisfied with pairwise coverage ofother parameters. In this case, we first generate tuples of classes from theparameters to be most thoroughly covered, and then extend these with the



Hand-held devices

Display Mode�"��%��! ��%�� %��$ ��!

Language�� !&�� !�� !'��"�"��

Fonts� � ��(� "��%��

Color)��%��*%� �+�"�% ��

Screen size,��%!��

Laptop and Full-size devices

Display Mode�"��%��! ��%�� %��$ ��!

Language�� !&�� !�� !'��"�"��

Fonts� � ��(� "��%��

Color�� !��)��%��*%� �+�"�% ��

Screen size-��&"�� .�

Table 13.6: Pairs of tables that indicate valid value classes for the Chipmunkweb-site display controller



parameters which require less coverage.8

13.6 Testing Decision Structures

The combinatorial approaches described above primarily select combina-tions of orthogonal choices. They can accommodate constraints among choices,but their strength is in generating combinations of (purportedly) indepen-dent choices. Some specifications, formal and informal, have a structure thatemphasizes the way particular combinations of parameters or their proper-ties determine which of several potential outcomes is chosen. Results of thecomputation may be determined by boolean predicates on the inputs. Insome cases, choices are specified explicitly as boolean expressions. More of-ten, choices are described either informally or with tables or graphs that canassume various forms. When such a decision structure is present, it can playa part in choosing combinations of values for testing.

For example, the informal specification of Figure 13.4 describes outputsthat depend on type of account (either educational, or business, or individ-ual), amount of current and yearly purchases, and availability of special prices.These can be considered as boolean conditions, e.g., the condition educa-tional account is either true or false (even if the type of account is actuallyrepresented in some other manner). Outputs can be described as booleanexpressions over the inputs, e.g., the output no discount can be associatedwith the boolean expression

individual account� current purchase � tier 1 individual threshold� special offer price � individual scheduled price� business account� current purchase � tier 1 business threshold� current purchase � tier 1 business yearly threshold� special offer price � business scheduled price

When functional specifications can be given as boolean expressions, agood test suite should exercise at least the effects of each elementary con-dition occurring in the expression. (In ad hoc testing, it is common to miss abug in one elementary condition by choosing test cases in which it is “masked”by other conditions.) For simple conditions, we might derive test case speci-fications for all possible combinations of truth values of the elementary con-ditions. For complex formulas, testing all �� combinations of � elementaryconditions is apt to be too expensive; we can select a much smaller subset ofcombinations that checks the effect of each elementary condition. A goodway of exercising all elementary conditions with a limited number of testcases is deriving a set of combinations such that each elementary conditioncan be shown to independently affect the outcome.

8See exercise Ex13.14 for additional details.


Testing Decision Structures 75

Pricing: The pricing function determines the adjusted price of a configuration for aparticular customer. The scheduled price of a configuration is the sum of thescheduled price of the model and the scheduled price of each component in theconfiguration. The adjusted price is either the scheduled price, if no discountsare applicable, or the scheduled price less any applicable discounts.

There are three price schedules and three corresponding discount schedules,Business, Educational, and Individual. The Business price and discount sched-ules apply only if the order is to be charged to a business account in good stand-ing. The Educational price and discount schedules apply to educational institu-tions. The Individual price and discount schedules apply to all other customers.Account classes and rules for establishing business and educational accountsare described further in [. . . ].

A discount schedule includes up to three discount levels, in addition to the pos-sibility of “no discount.” Each discount level is characterized by two thresholdvalues, a value for the current purchase (configuration schedule price) and acumulative value for purchases over the preceding 12 months (sum of adjustedprice).

Educational prices The adjusted price for a purchase charged to an educational ac-count in good standing is the scheduled price from the educational price sched-ule. No further discounts apply.

Business account discounts Business discounts depend on the size of the currentpurchase as well as business in the preceding 12 months. A tier 1 discount isapplicable if the scheduled price of the current order exceeds the tier 1 currentorder threshold, or if total paid invoices to the account over the preceding 12months exceeds the tier 1 year cumulative value threshold. A tier 2 discountis applicable if the current order exceeds the tier 2 current order threshold, orif total paid invoices to the account over the preceding 12 months exceeds thetier 2 cumulative value threshold. A tier 2 discount is also applicable if both thecurrent order and 12 month cumulative payments exceed the tier 1 thresholds.

Individual discounts Purchase by individuals and by others without an establishedaccount in good standing are based on current value alone (not on cumula-tive purchases). A tier 1 individual discount is applicable if the scheduled priceof the configuration in the current order exceeds the the tier 1 current orderthreshold. A tier 2 individual discount is applicable if the scheduled price of theconfiguration exceeds the tier 2 current order threshold.

Special-price non-discountable offers Sometimes a complete configuration is of-fered at a special, non-discountable price. When a special, non-discountableprice is available for a configuration, the adjusted price is the non-discountableprice or the regular price after any applicable discounts, whichever is less.

Figure 13.4: The functional specification of feature pricing of the Chipmunkweb site.



��

A predicate is a function with a boolean (True or False) value. When the input argu-ment of the predicate is clear, particularly when it describes some property of the input ofa program, we often leave it implicit. For example, the actual representation of accounttypes in an information system might be as three-letter codes, but in a specification wemay not be concerned with that representation — we know only that there is some pred-icate educational-account which is either True or False.

An elementary condition is a single predicate that cannot be decomposed further. Acomplex condition is made up of elementary conditions, combined with boolean con-nectives.

The boolean connectives include “and” (�), “or” (�), “not” ( ), and several less com-mon derived connectives such as “implies” and “exclusive or.”

A systematic approach to testing boolean specifications consists in firstconstructing a model of the boolean specification and then applying test cri-teria to derive test case specifications.

STEP 1: derive a model of the decision structure We can produce differentmodels of the decision structure of a specification, depending on the originalspecification and on the technique we use for deriving test cases. For exam-ple, if the original specification prescribes a sequence of decisions, either ina program-like syntax or perhaps as a decision tree, we may decide not to de-rive a different model but rather treat it as a conditional statement. Then wecan directly apply the methods described in Chapter 14 for structural testing,i.e., basic condition, compound condition, or modified condition/decisionadequacy criteria. On the other hand, if the original specification is expressedinformally as in Figure 13.4, we can transform it into either a boolean expres-sion or a graph or a tabular model before applying a test case generation tech-nique.

Techniques for deriving test case specifications from decision structureswere originally developed for graph models, and in particular cause effectgraphs, which have been used since the early seventies. Cause-effect graphsare tedious to derive and do not scale well to complex specifications. Tables,on the other hand, are easy to work with and scale well.

A decision structure can be represented with a decision table where rowscorrespond to elementary conditions and columns correspond to combina-tions of elementary conditions. The last row of the table indicates the ex-pected outputs. Cells of the table are labeled either true, false, or don’t care(usually written –), to indicate the truth value of the elementary condition.Thus, each column is equivalent to a logical expression joining the requiredvalues (negated, in the case of false entries) and omitting the elementary con-ditions with don’t care values.9

9The set of columns sharing a label is therefore equivalent to a logical expression in sum-of-


Decision tables are completed with a set of constraints that limit the pos-sible combinations of elementary conditions. A constraint language can bebased on boolean logic. Often it is useful to add some shorthand notationsfor common conditions that are tedious to express with the standard connec-tives, such as at-most-one(C1, . . . , Cn) and exactly-one(C1, . . . , Cn).

Figure 13.5 shows the decision table for the functional specification of fea-ture pricing of the Chipmunk web site presented in Figure 13.4.

The informal specification of Figure 13.4 identifies three customer pro-files: educational, business, and individual. Table 13.5 has only rows edu-cational and business. The choice individual corresponds to the combina-tion false, false for choices educational and business, and is thus redundant.The informal specification of Figure 13.4 indicates different discount poli-cies depending on the relation between the current purchase and two pro-gressive thresholds for the current purchase and the yearly cumulative pur-chase. These cases correspond to rows 3 through 6 of table 13.5. Conditionson thresholds that do not correspond to individual rows in the table can bedefined by suitable combinations of values for these rows. Finally, the infor-mal specification of Figure 13.4 distinguishes the cases in which special offerprices do not exceed either the scheduled or the tier 1 or tier 2 prices. Rows 7through 9 of the table, suitably combined, capture all possible cases of specialprices without redundancy.

Constraints formalize the compatibility relations among different elemen-tary conditions listed in the table: Educational and Business accounts areexclusive; A cumulative purchase exceeding threshold tier 2, also exceedsthreshold tier 1; a yearly purchase exceeding threshold tier 2, also exceedsthreshold tier 1; a cumulative purchase not exceeding threshold tier 1 doesnot exceed threshold tier 2; a yearly purchase not exceeding threshold tier 1does not exceed threshold tier 2; a special offer price not exceeding thresh-old tier 1 does not exceed threshold tier 2; and finally, a special offer priceexceeding threshold tier 2 exceeds threshold tier 1.

STEP 2: derive test case specifications from a model of the decision struc-ture Different criteria can be used to generate test suites of differing com-plexity from decision tables.

The basic condition adequacy criterion requires generation of a test casespecification for each column in the table, and corresponds to the intuitiveprinciple of generating a test case to produce each possible result. Don’t care � �� " ��

entries of the table can be filled out arbitrarily, so long as constraints are notviolated.

The compound condition adequacy criterion requires a test case specifi-cation for each combination of truth values of elementary conditions. The � �� " ��

compound condition adequacy criterion generates a number of cases expo-nential in the number of elementary conditions, and can thus be applied onlyto small sets of elementary conditions.

products form.

78Fu

nct

ion

alTe

stin

g

Education Individual BusinessEdu. T T F F F F F F - - - - - - - - - - - -Bus. - - F F F F F F T T T T T T T T T T T TCP� CT1 - - F F T T - - F F T T F F T T - - - -YP� YT1 - - - - - - - - F F F F T T T T - - - -CP� CT2 - - - - F F T T - - F F - - - - T T - -YP� YT2 - - - - - - - - - - - - F F - - - - T TSP� Sc F T F T - - - - F T - - - - - - - - - -SP� T1 - - - - F T - - - - F T F T - - - - - -SP� T2 - - - - - - F T - - - - - - F T F T F TOut Edu SP ND SP T1 SP T2 SP ND SP T1 SP T1 SP T2 SP T2 SP T2 SP

Constraints

at-most-one(Edu, Bus) at-most-one(YP � YT1, YP � YT2)YP � YT2 � YP � YT1 at-most-one(CP � CT1, CP � CT2)CP � CT2 � CP � CT1 at-most-one(SP � T1, SP � T2)SP � T2 � SP � T1

Abbreviations

Edu. Educational account Edu Educational priceBus. Business account ND No discountCP � CT1 Current purchase greater than threshold 1 T1 Tier 1YP � YT1 Year cumulative purchase greater than threshold 1 T2 Tier 2CP � CT2 Current purchase greater than threshold 2 SP Special PriceYP � YT2 Year cumulative purchase greater than threshold 2SP � Sc Special Price better than scheduled priceSP � T1 Special Price better than tier 1SP � T2 Special Price better than tier 2

Figure 13.5: The decision table for the functional specification of feature pricing of the Chipmunk web site of Figure 13.4.

Dra

ftve

rsio

np

rod

uce

d20

thM

arch

2002

Testing Decision Structures 79

For the modified condition/decision adequacy criterion (MC/DC), each col-umn in the table represents a test case specification. In addition, for each ofthe original columns, MC/DC generates new columns by modifying each ofthe cells containing True or False. If modifying a truth value in one column � #�� $��

�" ��results in a test case specification consistent with an existing column (agree-ing in all places where neither is don’t care), the two test cases are representedby one merged column, provided they can be merged without violating con-straints.

The MC/DC criterion formalizes the intuitive idea that a thorough testsuite would not only test positive combinations of values, i.e., combinationsthat lead to specified outputs, but also negative combinations of values, i.e.,combinations that differ from the specified ones and thus should producedifferent outputs, in some cases among the specified ones, in some othercases leading to error conditions.

Applying MC/DC to column 1 of table 13.5 generates two additional columns:one for Educational Account = false and Special Price better than scheduledprice = false, and the other for Educational Account = true and Special Pricebetter than scheduled price = true. Both columns are already in the table(columns 3 and 2, respectively) and thus need not be added.

Similarly, from column 2, we generate two additional columns correspond-ing to Educational Account = false and Special Price better than scheduledprice = true, and Educational Account = true and Special Price better thanscheduled price = false, also already in the table.

The generation of a new column for each possible variation of the booleanvalues in the columns, varying exactly one value for each new column, pro-duces 78 new columns, 21 of which can be merged with columns already inthe table. Figure 13.6 shows a table obtained by suitably joining the generatedcolumns with the existing ones. Many don’t care cells from the original tableare assigned either true or false values, to allow merging of different columnsor to obey constraints. The few don’t-care entries left can be set randomly toobtain a complete test case specification.

There are many ways of merging columns that generate different tables.The table in Figure 13.6 may not be the optimal one, i.e., the one with thefewest columns. The objective in test design is not to find an optimal testsuite, but rather to produce a cost effective test suite with an acceptable trade-off between the cost of generating and executing test cases and the effective-ness of the tests.

The table in Figure 13.6 fixes the entries as required by the constraints,while the initial table in Figure 13.5 does not. Keeping constraints separatefrom the table corresponding to the initial specification increases the num-ber of don’t care entries in the original table, which in turn increases the op-portunity for merging columns when generating new cases with the MC/DCcriterion. For example, if business account = false, the constraint at-most-one(Edu, Bus) can be satisfied by assigning either true or false to entry edu-cational account. Fixing either choice prematurely may later make mergingwith a newly generated column impossible.


80Fu

nct

ion

alTe

stin

g

Edu. T T F F F F F F F F F F F F F F F F F F T T T T F -Bus. F F F F F F F F T T T T T T T T T T T T F F F F F FCP � CT1 T T F F T T T T F F T T F F T T T T F F F F T - - FYP � YT1 F - F - - F T T F F F F T T T T F F T T T - - - T TCP � CT2 F F F F F F T T F F F F F F F F T T F F F F T T F FYP � YT2 - - - - - - - - - - - - F F F F - - T T F - - - T FSP � Sc F T F T F T - - F T F - F T - T - T - T F T - - - -SP � T1 F T F T F T F T F T F T F T F T F T F T F - - T T TSP � T2 F - F - F - F T F - F - F - F T F T F T F F F T T -Out Edu SP ND SP T1 SP T2 SP ND SP T1 SP T1 SP T2 SP T2 SP T2 SP Edu SP Edu SP SP SP

Abbreviations

Edu. Educational account Edu Educational priceBus. Business account ND No discountCP� CT1 Current purchase greater than threshold 1 T1 Tier 1YP� YT1 Year cumulative purchase greater than threshold 1 T2 Tier 2CP� CT2 Current purchase greater than threshold 2 SP Special PriceYP� YT2 Year cumulative purchase greater than threshold 2SP� Sc Special Price better than scheduled priceSP� T1 Special Price better than tier 1SP� T2 Special Price better than tier 2

Figure 13.6: The set of test cases generated for feature pricing of the Chipmunk web site applying the modified adequacycriterion.

Dra

ftve

rsio

np

rod

uce

d20

thM

arch

2002

Deriving Test Cases from Control and Data Flow Graphs 81

13.7 Deriving Test Cases from Control and Data FlowGraphs

Functional specifications are seldom given as flow graphs, but sometimesthey describe a set of mutually dependent steps to be executed in a given(partial) order, and can thus be modeled with flow graphs.

For example the specification of Figure 13.7 describes the Chipmunk func-tionality that processes shipping orders. The specification indicates a set ofsteps to check for the validity of fields in the order form. Type and validity ofsome of the values depend on other fields in the form. For example, shippingmethods are different for domestic and international customers, and allowedmethods of payment depend on the kind of customer.

The informal specification of Figure 13.7 can be modeled with a controlflow graph, where the nodes represent computations and branches modelflow of control consistently with the dependencies among computations, asillustrated in Figure 13.8. Given a control or a data flow graph model, wecan generate test case specifications using the criteria originally proposed forstructural testing and described in Chapters ?? and ??.

Control flow testing criteria require test cases that exercise all the ele-ments of a particular type in a graph. The node testing adequacy criterionrequires each node to be exercised at least once and corresponds to the state-ment testing structural adequacy criterion. It is easy to verify that test T-node � %�� &�'��

causes all nodes of the control flow graph of Figure 13.8 to be traversed andthus satisfies the node adequacy criterion.

T-node

Case Too Ship Ship Cust Pay Same CCsmall where method type method addr valid

TC-1 No Int Air Bus CC No YesTC-2 No Dom Air Ind CC – No (abort)

Abbreviations:Too small CostOfGoods �MinOrder ?Ship where Shipping address, Int = international, Dom = domesticShip how Air = air freight, Land = land freightCust type Bus = business, Edu = educational, Ind = individualPay method CC = credit card, Inv = invoiceSame addr Billing address = shipping address ?CC Valid Credit card information passes validity check?

The branch testing adequacy criterion requires each branch to be exer-cised at least once, i.e., each edge of the graph to be traversed for at least onetest case. Test T-branch covers all branches of the control flow graph of Fig- � � �� &�'��

ure 13.8 and thus satisfies the branch adequacy criterion.



Process shipping order: The Process shipping order function checks the validity of or-ders and prepares the receipt.

A valid order contains the following data:

cost of goods If the cost of goods is less than the minimum processable order(MinOrder) then the order is invalid.

shipping address The address includes name, address, city, postal code, andcountry.

preferred shipping method If the address is domestic, the shipping methodmust be either land freight, or expedited land freight, or overnight air.If the address is international, the shipping method must be either airfreight or expedited air freight; a shipping cost is computed based onaddress and shipping method.

type of customer which can be individual, business or educational

preferred method of payment Individual customers can use only credit cards,while business and educational customers can choose between creditcard and invoice.

card information if the method of payment is credit card, fields credit cardnumber, name on card, expiration date, and billing address, if differentthan shipping address, must be provided. If credit card information is notvalid the user can either provide new data or abort the order.

The outputs of Process shipping order are

validity Validity is a boolean output which indicates whether the order can beprocessed.

total charge The total charge is the sum of the value of goods and the com-puted shipping costs (only if validity = true).

payment status if all data are processed correctly and the credit card informa-tion is valid or the payment is invoice, payment status is set to valid, theorder is entered and a receipt is prepared; otherwise validity = false.

Figure 13.7: The functional specification of feature process shipping order ofthe Chipmunk web site.


Deriving Test Cases from Control and Data Flow Graphs 83

preferred shipping method = land freight,�OR expedited land freight OR overnight air�

��Process shipping order�

CostOfGoods < MinOrder�

shipping address�

no�

yes�

domestic�

preferred shipping method = air�freight OR expedited air freight�

international�

calculate domestic shipping charge�calculate international shipping charge�

total charge = goods + shipping�

individual customer� no�

yes�

obtain credit card data: number,�name on card, expiration date�

method of payement�

credit card�

invoice�

billing address = shipping address�

obtain billing address�

no�

yes�

valid credit card�information�

no�

yes�

payement status = valid�enter order�

prepare receipt�

invalid order�

no�no�

abort order?�

no�

yes�

Figure 13.8: The control flow graph corresponding to functionality Processshipping order of Figure 13.7



T-branch

Case Too Ship Ship Cust Pay Same CCsmall where method type method addr valid

TC-1 No Int Air Bus CC No YesTC-2 No Dom Land – – – –TC-3 Yes – – – – – –TC-4 No Dom Air – – – –TC-5 No Int Land – – – –TC-6 No – – Edu Inv – –TC-7 No – – – CC Yes –TC-8 No – – – CC – No (abort)TC-9 No – – – CC – No (no abort)

Abbreviations:(as above)

In principle, other test adequacy criteria described in Chapter 14 can beapplied to more control structures derived from specifications, but in practicea good specification should rarely result in a complex control structure, sincea specification should abstract details of processing.

13.8 Catalog Based Testing

The test design techniques described above require judgment in deriving valueclasses. Over time, an organization can build experience in making thesejudgments well. Gathering this experience in a systematic collection can speedup the process and routinize many decisions, reducing human error. Catalogscapture the experience of test designers by listing all cases to be consideredfor each possible type of variable that represents logical inputs, outputs, andstatus of the computation. For example, if the computation uses a variablewhose value must belong to a range of integer values, a catalog might indi-cate the following cases, each corresponding to a relevant test case:

1. The element immediately preceding the lower bound of the interval

2. The lower bound of the interval

3. A non-boundary element within the interval

4. The upper bound of the interval

5. The element immediately following the upper bound

The catalog would in this way cover the intuitive cases of erroneous con-ditions (cases 1 and 5), boundary conditions (cases 2 and 4), and normal con-ditions (case 3).

The catalog based approach consists in unfolding the specification, i.e.,decomposing the specification into elementary items, deriving an initial set


Catalog Based Testing 85

of test case specifications from pre-conditions, post-conditions, and defini-tions, and completing the set of test case specifications using a suitable testcatalog.

STEP 1: identify elementary items of the specification The initial specifi-cation is transformed into a set of elementary items that have to be tested.Elementary items belong to a small set of basic types:

Pre-conditions represent the conditions on the inputs that must be satisfiedbefore invocation of the unit under test. Preconditions may be checkedeither by the unit under test (validated preconditions) or by the caller(assumed preconditions).

Post-conditions describe the result of executing the unit under test.

Variables indicate the elements on which the unit under test operates. Theycan be input, output, or intermediate values.

Operations indicate the main operations performed on input or intermedi-ate variables by the unit under test

Definitions are shorthand used in the specification

As in other approaches that begin with an informal description, it is notpossible to give a precise recipe for extracting the significant elements. Theresult will depend on the capability and experience of the test designer.

Consider the informal specification of a function for converting URL-en-coded form data into the original data entered through an html form. Aninformal specification is given in Figure 13.7.10

The informal description of cgi decode uses the concept of hexadecimaldigit, hexadecimal escape sequence, and element of a cgi encoded sequence.This leads to the identification of the following three definitions:

DEF 1 hexadecimal digits are: ’0’, ’1’, ’2’, ’3’, ’4’, ’5’, ’6’, ’7’, ’8’, ’9’, ’A’, ’B’, ’C’, ’D’,’E’, ’F’, ’a’, ’b’, ’c’, ’d’, ’e’, ’f’

DEF 2 a CGI-hexadecimal is a sequence of three characters: ’��’, where �

and � are hexadecimal digits

DEF 3 a CGI item is either an alphanumeric character, or character ’�’, or aCGI-hexadecimal

In general, every concept introduced in the description as a support fordefining the problem can be represented as a definition.

The description of cgi decode mentions some elements that are inputsand outputs of the computation. These are identified as the following vari-ables:

10The informal specification is ambiguous and inconsistent, i.e., it is the kind of spec one ismost likely to encounter in practice.



cgi decode: 9�� &00�� )&=0+� ��

&=0 �� #>#� �� ,�� !�� !�� ! �� #># � # #�$?,�% )�� , �� ,�� + � � �� &00��

INPUT: encoded � �� &=0 ��!�� 0��

� ��

� �� #>#

� �� #?,�#� �� , �� ,��

��

OUTPUT: decoded � �� &00 �� &=0 ��!��

� ��

� � �� #># ��

� � �� &00 �� ,�� #?,�# � ��

OUTPUT: return value �� !�� ! ��

� ( ��

� 5 � ��

�� 52�@� �� )�� + ��



VAR 1 Encoded: string of ASCII characters

VAR 2 Decoded: string of ASCII characters

VAR 3 return value: Boolean

Note the distinction between a variable and a definition. Encoded and de-coded are actually used or computed, while hexadecimal digits, CGI-hexadecimal,and CGI item are used to describe the elements but are not objects in theirown right. Although not strictly necessary for the problem specification, ex-plicit identification of definitions can help in deriving a richer set of test cases.

The description of cgi decode indicates some conditions that must be sat-isfied upon invocation, represented by the following preconditions:

PRE 1 (Assumed) the input string Encoded is a null-terminated string of char-acters.

PRE 2 (Validated) the input string Encoded is a sequence of CGI items.

In general, preconditions represent all the conditions that should be truefor the intended functioning of a module. A condition is labeled as validatedif it is checked by the module (in which case a violation has a specified effect,e.g., raising an exception or returning an error code). Assumed preconditionsmust be guaranteed by the caller, and the module does not guarantee a par-ticular behavior in case they are violated.

The description of cgi decode indicates several possible results. These canbe represented as a set of postconditions:

POST 1 if the input string Encoded contains alphanumeric characters, theyare copied to the corresponding position in the output string.

POST 2 if the input string Encoded contains characters ’+’, they are replacedby ASCII SPACE characters in the corresponding positions in the outputstring.

POST 3 if the input string Encoded contains CGI-hexadecimals, they are re-placed by the corresponding ASCII characters.

POST 4 if the input string Encoded is a valid sequence, cgi decode returns 0.

POST 5 if the input string Encoded contains a malformed CGI-hexadecimal,i.e., a substring ’%xy’, where either x or y is absent or are not hexadeci-mal digits, cgi decode returns 1

POST 6 if the input string Encoded contains any illegal character, cgi decodereturns a positive value.

The postconditions should, together, capture all the expected outcomes ofthe module under test. When there are several possible outcomes, it is pos-sible to capture them all in one complex postcondition or in several simple



PRE 1 (Assumed) the input string 6�� is a null-terminated string ofcharacters

PRE 2 (Validated) the input string 6�� is a sequence of CGI itemsPOST 1 if the input string6�� contains alphanumeric characters, they

are copied to the output string in the corresponding positions.POST 2 if the input string 6�� contains characters ’+’, they are re-

placed in the output string by ASCII SPACE characters in the cor-responding positions

POST 3 if the input string 6�� contains CGI-hexadecimals, they arereplaced by the corresponding ASCII characters.

POST 4 if the input string 6�� is well-formed, cgi-decode returns 0POST 5 if the input string 6�� contains a malformed CGI hexadeci-

mal, i.e., a substring ’%xy’, where either x or y are absent or arenot hexadecimal digits, cgi decode returns 1

POST 6 if the input string 6�� contains any illegal character,cgi decode returns a positive value

VAR 1 6��: a string of ASCII charactersVAR 2 -��: a string of ASCII charactersVAR 3 "�� : a booleanDEF 1 ��,�� are ASCII characters in range [’0’ .. ’9’, ’A’ .. ’F’,

’a’ .. ’f’]DEF 2 &=0��,�� are sequences “%xy”, where x and y are hexadec-

imal digitsDEF 3 A &=0 �� is an alphanumeric character, or ’+’, or a CGI-

hexadecimalOP 1 Scan 6��

Table 13.8: Elementary items of specification cgi-decode

postconditions; here we have chosen a set of simple contingent postcondi-tions, each of which captures one case.

Although the description of cgi decode does not mention explicitly howthe results are obtained, we can easily deduce that it will be necessary to scanthe input sequence. This is made explicit in the following operation:

OP 1 Scan the input string Encoded.

In general, a description may refer either explicitly or implicitly to elemen-tary operations which help to clearly describe the overall behavior, like defini-tions help to clearly describe variables. As with variables, they are not strictlynecessary for describing the relation between pre- and postconditions, butthey serve as additional information for deriving test cases.

The result of step 1 for cgi decode is summarized in Table 13.8.



STEP 2 Derive a first set of test case specifications from preconditions, post-conditions and definitions The aim of this step is to explicitly describe thepartition of the input domain:

Validated Preconditions: A simple precondition, i.e., a precondition that isexpressed as a simple boolean expression without and or or, identifiestwo classes of input: values that satisfy the precondition and values thatdo not. We thus derive two test case specifications.

A compound precondition, given as a boolean expression with and oror, identifies several classes of inputs. Although in general one couldderive a different test case specification for each possible combinationof truth values of the elementary conditions, usually we derive only asubset of test case specifications using the modified condition decisioncoverage (MC/DC) approach, which is illustrated in Section 13.6 and inChapter ??. In short, we derive a set of combinations of elementary con-ditions such that each elementary condition can be shown to indepen-dently affect the outcome of each decision. For each elementary condi-tion �, there are two test case specifications in which the truth valuesof all conditions except � are the same, and the compound conditionas a whole evaluates to True for one of those test cases and False for theother.

Assumed Preconditions: We do not derive test case specifications for casesthat violate assumed preconditions, since there is no defined behaviorand thus no way to judge the success of such a test case. We also do notderive test cases when the whole input domain satisfies the condition,since test cases for these would be redundant. We generate test casesfrom assumed preconditions only when the MC/DC criterion generatesmore than one class of valid combinations (i.e., when the condition is alogical disjunction of more elementary conditions).

Postconditions: In all cases in which postconditions are given in a condi-tional form, the condition is treated like a validated precondition, i.e.,we generate a test case specification for cases that satisfy and cases thatdo not satisfy the condition.

Definition: Definitions that refer to input or output variables are treated likepostconditions, i.e., we generate a set of test cases for each definitiongiven in conditional form with the same criteria used for validated pre-conditions. The test cases are generated for each variable that refers tothe definition.

The elementary items of the specification identified in step 1 are scannedsequentially and a set of test cases is derived applying these rules. Whilescanning the specifications, we generate test case specifications incremen-tally. When new test case specifications introduce a finer partition than anexisting case, or vice versa, the test case specification that creates the coarser



partition becomes redundant and can be eliminated. For example, if an ex-isting test case specification requires a non-empty set, and we have to addtwo test case specifications that require a size that is a power of two and onewhich is not, the existing test case specification can be deleted.

Scanning the elementary items of the cgi decode specification given inTable 13.7, we proceed as follows:

PRE 1: The first precondition is a simple assumed precondition, thus, ac-cording to the rules, we do not generate any test case specification. Theonly condition would be �� , butthis matches every test case and thus it does not identify a useful parti-tion.

PRE 2: The second precondition is a simple validated precondition, thus wegenerate two test case specifications, one that satisfies the conditionand one that does not:

TC-PRE2-1 6��: a sequence of CGI items

TC-PRE2-2 6��: not a sequence of CGI items

postconditions: all postconditions in the cgi decode specification are givenin a conditional form with a simple condition. Thus, we generate twotest case specifications for each of them. The generated test case speci-fications correspond to a case that satisfies the condition and a case thatviolates it.

POST 1:

TC-POST1-1 6��: contains one or more alphanumeric char-acters

TC-POST1-2 6��: does not contain any alphanumeric char-acters

POST 2:

TC-POST2-1 6��: contains one or more character ’+’Tc-POST2-2 6��: does not any contain character ’+’

POST 3:

TC-POST3-1 6��: contains one or more CGI-hexadecimalsTC-POST3-2 6��: does not contain any CGI-hexadecimal

POST 4: we do not generate any new useful test case specifications, be-cause the two specifications are already covered by the specifica-tions generated from POST 2.



POST 5: we generate only the test case specification that satisfies thecondition; the test case specification that violates the specificationis redundant with respect to the test case specifications generatedfrom POST 3

TC-POST5-1 : 6�� contains one or more malformed CGI-hexadecimals

POST 6: as for POST 5, we generate only the test case specification thatsatisfies the condition; the test case specification that violates thespecification is redundant with respect to most of the test case spec-ifications generated so far.

TC-POST6-1 6��: contains one or more illegal characters

definitions none of the definitions in the specification of cgi decode is givenin conditional terms, and thus no test case specifications are generatedat this step.

The test case specifications generated from postconditions refine test casespecification TC-PRE2-1, which can thus be eliminated from the checklist.The result of step 2 for cgi decode is summarized in Table 13.9.

STEP 3 Complete the test case specifications using catalogs The aim of thisstep is to generate additional test case specifications from variables and op-erations used or defined in the computation. The catalog is scanned sequen-tially. For each entry of the catalog we examine the elementary componentsof the specification and we add test case specifications as required by the cat-alog. As when scanning the test case specifications during step 2, redundanttest case specifications are eliminated.

Table 13.10 shows a simple catalog that we will use for the cgi decoder ex-ample. A catalog is structured as a list of kinds of elements that can occur ina specification. Each catalog entry is associated with a list of generic test casespecifications appropriate for that kind of element. We scan the specificationfor elements whose type is compatible with the catalog entry, then generatethe test cases defined in the catalog for that entry. For example, the catalog ofTable 13.10 contains an entry for boolean variables. When we find a booleanvariable in the specification, we instantiate the catalog entry by generatingtwo test case specifications, one that requires a True value and one that re-quires a False value.

Each generic test case in the catalog is labeled in, out, or in/out, meaningthat a test case specification is appropriate if applied to either an input vari-able, or to an output variable, or in both cases. In general, erroneous valuesshould be used when testing the behavior of the system with respect to inputvariables, but are usually impossible to produce when testing the behavior ofthe system with respect to output variables. For example, when the value ofan input variable can be chosen from a set of values, it is important to testthe behavior of the system for all enumerated values and some values out-side the enumerated set, as required by entry ENUMERATION of the catalog.



PRE 2 ��/ �!� ��"� �� #"�� )01 ��[+)%'��2%2] �� : not a sequence of CGI items

POST 1 � �!� ��"� �� !��"�� !�� !�� !� �"��"� �� !� ��

[+)%'��+�%�] �� : contains alphanumeric characters[+)%'��+�%2] �� : does not contain alphanumeric characters

POST 2 � �!� ��"� �� 345 !�� !�� !� �"��"� �� 3 5 � �!� ��

[+)%'��+2%�] �� : contains ’+’[+)%'��+2%2] �� : does not contain ’+’

POST 3 � �!� ��"� �� )01%!�� !�� !� �� )11 !�� 6

[+)%'��+7%�] �� : contains CGI-hexadecimals[+)%'��+7%2] �� : does not contain a CGI-hexadecimal

POST 4 � �!� ��"� �� $��%�� "��

POST 5 � �!� ��"� �� )01%!�� 6�6�� "�� 89��:� $!�� !�� !�� "��

[+)%'��+;%�] �� : contains malformed CGI-hexadecimals

POST 6 � �!� ��"� �� !�� "�� "�

[+)%'��+*%�] �� : contains illegal characters

VAR 1 �� < � �� )11 !��

VAR 2 (� ��< � �� )11 !��

VAR 3 ��"�� "�< � ��

DEF 1 !�� 5�5 66 5=5� 5�5 66 5&5� 5�5 66 5�5

DEF 2 )01%!�� #"�� 59��5� $!�� !��

DEF 3 )01 �� !�� !��"�� !�� 545� �� )01%!��

OP 1 � ��

Table 13.9: Test case specifications for cgi-decode generated after step 2.



>��[ �?�"�] True[ �?�"�] False

��"�� [ �?�"�] Each enumerated value[ �] Some value outside the enumerated set

�� [ �] �� (the element immediately preceding the lower bound)[ �?�"�] � (the lower bound)[ �?�"�] A value between � and �

[ �?�"�] � (the upper bound)[ �] � � � (the element immediately following the upper bound)

�"�� )�� [ �?�"�] � (the constant value)[ �] � � � (the element immediately preceding the constant value)[ �] � � � (the element immediately following the constant value)[ �] Any other constant compatible with �

��%�"�� )�� [ �?�"�] � (the constant value)[ �] Any other constant compatible with �

[ �] Some other compatible value

��#"�� [ �?�"�] Empty[ �?�"�] A single element[ �?�"�] More than one element[ �?�"�] Maximum length (if bounded) or very long[ �] Longer than maximum length (if bounded)[ �] Incorrectly terminated

� �� $ �! � � �� [ �] � occurs at beginning of sequence[ �] � occurs in interior of sequence[ �] � occurs at end of sequence[ �] �� occurs contiguously[ �] � does not occur in sequence[ �] �� where � is a proper prefix of �[ �] Proper prefix � occurs at end of sequence

Table 13.10: Part of a simple test catalog.



However, when the value of an output variable belongs to a finite set of values,we should derive a test case for each possible outcome, but we cannot derivea test case for an impossible outcome, so entry ENUMERATION of the cata-log specifies that the choice of values outside the enumerated set is limitedto input variables. Intermediate variables, if present, are treated like outputvariables.

Entry Boolean of the catalog applies to "�� (VAR 3). The catalogrequires a test case that produces the value True and one that produces thevalue False. Both cases are already covered by test cases TC-PRE2-1 and TC-PRE2-2 generated for precondition PRE 2, so no test case specification is ac-tually added.

Entry Enumeration of the catalog applies to any variable whose values arechosen from an explicitly enumerated set of values. In the example, the valuesof &=0 �� (DEF 3) and of improper &=0 ��,�� in POST 5 are defined byenumeration. Thus, we can derive new test case specifications by applyingentry enumeration to POST 5 and to any variable that can contain &=0 ��.

The catalog requires creation of a test case specification for each enumer-ated value and for some excluded values. For ��, which uses DEF 3, wegenerate a test case specification where a CGI-item is an alphanumeric char-acter, one where it is the character ’+’, one where it is a CGI-hexadecimal,and some where it is an illegal value. We can easily ascertain that all the re-quired cases are already covered by test case specifications for TC-POST1-1, TC-POST1-2, TC-POST2-1, TC-POST2-2, TC-POST3-1, and TC-POST3-2, soany additional test case specifications would be redundant.

From the enumeration of malformed CGI-hexadecimals in POST 5, we de-rive the following test cases: %y, %x, %ky, %xk, %xy (where x and y are hex-adecimal digits and k is not). Note that the first two cases, %x (the secondhexadecimal digit is missing) and %y (the first hexadecimal digit is missing)are identical, and %x is distinct from %xk only if %x are the last two charactersin the string. A test case specification requiring a correct pair of hexadecimaldigits (%xy) is a value out of the range of the enumerated set, as required bythe catalog.

The added test case specifications are:

TC-POST5-2 ��: terminated with %x, where x is a hexadecimal digit

TC-POST5-3 ��: contains %ky, where k is not a hexadecimal digit and yis a hexadecimal digit.

TC-POST5-4 ��: contains %xk, where x is a hexadecimal digit and k isnot.

The test case specification corresponding to the correct pair of hexadec-imal digits is redundant, having already been covered by TC-POST3-1. Thetest case TC-POST5-1 can now be eliminated because it is more general thanthe combination of TC-POST5-2, TC-POST5-3, and TC-POST5-4.



Entry Range applies to any variable whose values are chosen from a finiterange. In the example, ranges appear three times in the definition of hexadec-imal digit. Ranges also appear implicitly in the reference to alphanumericcharacters (the alphabetic and numeric ranges from the ASCII character set)in DEF 3. For hexadecimal digits we will try the special values ’/’ and ’:’ (thecharacters that appear before ’0’ and after ’9’ in the ASCII encoding), the val-ues ’0’ and ’9’ (upper and lower bounds of the first interval), some value be-tween ’0’ and ’9’, and similarly ’@’, ’G’, ’A’, ’F’, and some value between ’A’ and’F’ for the second interval and ’"’, ’g’, ’a’, ’f’, and some value between ’a’ and ’f’for the third interval.

These values will be instantiated for variable ��, and result in 30 ad-ditional test case specifications (5 values for each subrange, giving 15 valuesfor each hexadecimal digit and thus 30 for the two digits of CGI-hexadecimal).The full set of test case specifications is shown in Table ??. These test casespecifications are more specific than (and therefore replace) test case specifi-cations TC-POST3-1, TC-POST5-3, and TC-POST5-4.

For alphanumeric characters we will similarly derive boundary, interiorand excluded values, which result in 15 additional test case specifications,also given in Table ??. These test cases are more specific than (and thereforereplace) TC-POST1-1, TC-POST1-2, TC-POST6-1.

Entry Numeric Constant does not apply to any element of this specifica-tion.

Entry Non-Numeric Constant applies to ’+’ and ’%’, occurring in DEF 3 andDEF 2 respectively. Six test case specifications result, but all are redundant.

Entry Sequence applies to �� (VAR 1), �� (VAR 2), and ��(DEF 2). Six test case specifications result for each, of which only five are mu-tually non-redundant and not already in the list. From VAR 1 (��) wegenerate test case specifications requiring an empty sequence, a sequencecontaining a single element, and a very long sequence. The catalog entry re-quiring more than one element generates a redundant test case specification,which is discarded. We cannot produce reasonable test cases for incorrectlyterminated strings (the behavior would vary depending on the contents ofmemory outside the string), so we omit that test case specification.

All test case specifications that would be derived for �� (VAR 2) wouldbe redundant with respect to test case specifications derived for �� (VAR1).

From &=0��,�� (DEF 2) we generate two additional test case spec-ifications for variable ��: a sequence that terminates with ’%’ (the onlyway to produce a one-character subsequence beginning with ’%’) and a se-quence containing ’%xyz’, where x, y, and z are hexadecimal digits.

Entry Scan applies to �� 6�� (OP 1) and generates 17 test case spec-ifications. Three test case specifications (alphanumeric, ’+’, and &=0 ��) aregenerated for each of the first 5 items of the catalog entry. One test case spec-ification is generated for each of the last two items of the catalog entry whenScan is applied to CGI item. The last two items of the catalog entry do notapply to alphanumeric characters and ’+’, since they have no non-trivial pre-



fixes. Seven of the 17 are redundant. The ten generated test case specifica-tions are summarized in Table 13.11.

Test catalogs, like other check-lists used in test and analysis (e.g., inspec-tion check-lists), are an organizational asset that can be maintained and en-hanced over time. A good test catalog will be written precisely and suitablyannotated to resolve ambiguity (unlike the sample catalog used in this chap-ter). Catalogs should also be specialized to an organization and applicationdomain, typically using a process such as defect causal analysis or root causeanalysis. Entries are added to detect particular classes of faults that have beenencountered frequently or have been particularly costly to remedy in previ-ous projects. Refining check-lists is a typical activity carried out as part ofprocess improvement. When a test reveals a program fault, it is useful tomake a note of which catalog entries the test case originated from, as an aidto measuring the effectiveness of catalog entries. Catalog entries that are noteffective should be removed.

13.9 Deriving Test Cases from Finite State Machines

Finite state machines are often used to specify sequences of interactions be-tween a system and its environment. State machine specifications in oneform or another are common for control and interactive systems, such as em-bedded systems, communication protocols, menu driven applications, threadsof control in a system with multiple threads or processes.

In several application domains, specifications may be expressed directlyas some form of finite-state machine. For example, embedded control sys-tems are frequently specified with Statecharts, communication protocols arecommonly described with SDL diagrams, and menu driven applications aresometimes modeled with simple diagrams representing states and transitions.In other domains, the finite state essence of the systems are left implicit ininformal specifications. For instance, the informal specification of featureMaintenance of the Chipmuk web site given in Figure 13.9 describes a set ofinteractions between the maintenance system and its environment that canbe modeled as transitions through a finite set of process states. The finitestate nature of the interaction is made explicit by the finite state machineshown in Figure 13.10. Note that some transitions appear to be labeled byconditions rather than events, but they can be interpreted as shorthand foran event in which the condition becomes true or is discovered (e.g., “lackcomponent” is shorthand for “discover that a required component is not instock.”

Many control or interactive systems are characterized by an infinite set ofstates. Fortunately, the non-finite-state parts of the specification are oftensimple enough that finite state machines remain a useful model for testing aswell as specification. For example, communication protocols are frequentlyspecified using finite state machines, often with some extensions that make


Deriving Test Cases from Finite State Machines 97

TC-POST2-1 �� contains charac-ter ’+’

TC-POST2-2 �� does not containcharacter ’+’

TC-POST3-2 �� does not containa CGI-hexadecimal

TC-POST5-2 �� terminates with%x

TC-VAR1-1 �� is the empty se-quence

TC-VAR1-2 �� is a sequencecontaining a single char-acter

TC-VAR1-3 �� is a very long se-quence

TC-DEF2-1 �� contains ’%/y’TC-DEF2-2 �� contains ’%0y’TC-DEF2-3 �� contains ’%xy’,

with x in [1..8]TC-DEF2-4 �� contains ’%9y’TC-DEF2-5 �� contains ’%:y’TC-DEF2-6 �� contains ’%@y’TC-DEF2-7 �� contains ’%Ay’TC-DEF2-8 �� contains ’%xy’,

with x in [B..E]TC-DEF2-9 �� contains ’%Fy’TC-DEF2-10 �� contains ’%Gy’TC-DEF2-11 �� contains ’%‘y’TC-DEF2-12 �� contains ’%ay’TC-DEF2-13 �� contains ’%xy’,

with x in [b..e]TC-DEF2-14 �� contains ’%fy’TC-DEF2-15 �� contains ’%gy’TC-DEF2-16 �� contains ’%x/’TC-DEF2-17 �� contains ’%x0’TC-DEF2-18 �� contains ’%xy’,

with y in [1..8]TC-DEF2-19 �� contains ’%x9’TC-DEF2-20 �� contains ’%x:’TC-DEF2-21 �� contains ’%x@’TC-DEF2-22 �� contains ’%xA’

TC-DEF2-23 �� contains ’%xy’,with y in [B..E]

TC-DEF2-24 �� contains ’%xF’TC-DEF2-25 �� contains ’%xG’TC-DEF2-26 �� contains ’%x‘’TC-DEF2-27 �� contains ’%xa’TC-DEF2-28 �� contains ’%xy’,

with y in [b..e]TC-DEF2-29 �� contains ’%xf’TC-DEF2-30 �� contains ’%xg’TC-DEF2-31 �� contains ’%$’TC-DEF2-32 �� contains ’%xyz’TC-DEF3-1 �� contains 5/’TC-DEF3-2 �� contains 50’TC-DEF3-3 �� contains , with c

in [’1’..’8’]TC-DEF3-4 �� contains 59’TC-DEF3-5 �� contains 5:’TC-DEF3-6 �� contains 5@’TC-DEF3-7 �� contains 5A’TC-DEF3-8 �� contains , with c

in [’B’..’Y’]TC-DEF3-9 �� contains 5Z’TC-DEF3-10 �� contains 5[’TC-DEF3-11 �� contains 5‘’TC-DEF3-12 �� contains 5a’TC-DEF3-13 �� contains , with c

in [’b’..’y’]TC-DEF3-14 �� contains 5z’TC-DEF3-15 �� contains 5�’TC-OP1-1 �� contains ’�’TC-OP1-2 �� contains ’�+’TC-OP1-3 �� contains �%xy’TC-OP1-4 �� contains ’$’TC-OP1-5 �� contains ’+$’TC-OP1-6 �� contains ’%xy$’TC-OP1-7 �� contains ’’TC-OP1-8 �� contains ’++’TC-OP1-9 �� contains

’%xy%zw’TC-OP1-10 �� contains

’%x%yz’where � �� are hexadecimal digits, is an alphanumeric character, � representsthe beginning of the string, and $ represents the end of the string.

Table 13.11: Summary table: Test case specifications for cgi-decode gener-ated with a catalog.



Maintenance: ��

0� �� !��

0� �� !�� A � A6 �� ,��

0� ��

0� �� !�� 0��

�� 0� �� !�� )� � A � 6A+ � � �� !�� )��+�

0� �� !�� !��

��

*��

Figure 13.9: The functional specification of feature Maintenace of the Chip-muk web site.



NO�Maintenance�

Maintenance�(no warranty)�

e�s�t

�i�m�a

�t�e�

c�o�s�

t�s�

r�e�q�u�e�s�t� �a�t�

m�a�i�n�t�e�n�a�n�c�e� �s�t�a�t�i�o�n�

(�n�o� �w�a�r�r�a�n�t�y�)�

r�e�q�u�e�s�t�b�y� �p�h�o�n�e� �o�r� �w�e�b�

[�U�S� �o�r� �U�E� �r�e�s�i�d�e�n�t�]�

(�c�o�n�t�r�a�c�t� �n�u�m�b�e�r�)�

Wait for�pick up�

Repair�(maintenance�

station)�

p�i�c�k� �u�p�

r�e�q

�u�e

�s�t� �a

�t�m

�a�i�n

�t�e�n

�a�n

�c�e�

�s�t�a

�t�i�o�n

�o�

r� �b�

y� �e�

x�p�

r�e�s�

s� �c�

o�u�r�i

�e�r�

(�c�o

�n�t�r

�a�c�t

� �n�u

�m�b

�e�r�)

�

Wait for�acceptance�

accept�estimate�

Wait for�returning�

r�e�j�e�c�t� �e�s�t�i�m�a�t�e�

p�i�c�k� �u�p�

Repaired�repair completed�

return�

Repair�(regional�

headquarters)�

Repairi�

(main�headquarters)�

s�u�c�c�e�s�s�f�u�l� �r�e�p�a�i�r�

u�n�a�b�l�e� �t�o� �r�e�p�a�i�r�

(�U�S� �o�r� �U�E� �r�e�s�i�d�e�n�t�)�

s�u�c�c

�e�s�s�f

�u�l� �r�

e�p�a

�i�r�

u�n�a�b�l�e� �t�o�r�e�p�a�i�r�

Wait for�component�

l�a�c�k� �c�o�m�p�o�n�e�n�t� �(�a�)�

lack component (b)�

l�a�c�k� �c�o�m�p�o�n�e�n�t� �(�c�)�component�arrives (c)�

component�arrives (b)�

component�arrives (a)�

i�n�v�a�l�i�d�c�o�n�t�r�a�c�t�

n�u�m�b�e�r�

unable to repair�(not (US or UE resident)�

1� 2� 3�

0�

4� 5� 6�

7� 8�

9�

Figure 13.10: The finite state machine corresponding to functionality Main-tenance specified in Figure 13.9



T-CoverTC-1 0 – 2 – 4 – 1 – 0TC-2 0 – 5 – 2 – 4 – 5 – 6 – 0TC-3 0 – 3 – 5 – 9 – 6 – 0TC-4 0 – 3 – 5 – 7 – 5 – 8 – 7 – 8 – 9 – 7 – 9 – 6 – 0

Table 13.12: A set of test specifications in the form of paths in a finite-statemachine specification. States are indicated referring to the numbers given inFigure 13.10. For example, TC-1 is a test specification requiring transitions(0,2), (2,4), (4,1), and (1,0) be traversed, in that order.

them not truly finite-state. A state machine that simply receives a messageon one port and then sends the same message on another port is not reallyfinite-state unless the set of possible messages is finite, but is often renderedas a finite state machine, ignoring the contents of the exchanged messages.

State-machine specifications can be used both to guide test selection andin construction of an oracle that judges whether each observed behavior iscorrect. There are many approaches for generating test cases from finite statemachines, but most are variations on a basic strategy of checking each statetransition. One way to understand this basic strategy is to consider that eachtransition is essentially a specification of a precondition and postcondition,e.g., a transition from state � to state � on stimulus � means “if the systemis in state � and receives stimulus �, then after reacting it will be in state � .”For instance, the transition labeled accept estimate from state Wait for accep-tance to state Repair (maintenance station) of Figure 13.10 indicates that if anitem is on hold waiting for the customer to accept an estimate of repair costs,and the customer accepts the estimate, then the maintenance station beginsrepairing the item.

A faulty system could violate any of these precondition, postconditionpairs, so each should be tested. For instance, the state Repair (maintenancestation) can be arrived through three different transitions, and each shouldbe checked.

Details of the approach taken depend on several factors, including whethersystem states are directly observable or must be inferred from stimulus/responsesequences, whether the state machine specification is complete as given orincludes additional, implicit transitions, and whether the size of the (possiblyaugmented) state machine is modest or very large.

A basic criterion for generating test cases from finite state machines istransition coverage, which requires each transition to be traversed at leastonce. Test case specifications for transition coverage are often given as sets of� � �� " ��

state sequences or transition sequences. For example, T-Cover in Table 13.12is a set of four paths, each beginning at the initial state, which together coverall transitions of the finite state machine of Figure 13.10. T-Cover thus satisfiesthe transition coverage criterion.

The transition coverage criterion depends on the assumption that the finite-



state machine model is a sufficient representation of all the “important” state,e.g., that transitions out of a state do not depend on how one reached thatstate. Although it can be considered a logical flaw, in practice one often findsstate machines that exhibit “history sensitivity,” i.e., the transitions from astate depend on the path by which one reached that state. For example, inFigure 13.10, the transition taken from state Wait for component when thecomponent becomes available depends on how the state was entered. This isa flaw in the model — there really should be three distinct Wait for componentstates, each with a well-defined action when the component becomes avail-able. However, sometimes it is more expedient to work with a flawed state-machine model than to repair it, and in that case test suites may be based onmore than the simple transition coverage criterion.

Coverage criteria designed to cope with history sensitivity include sin-gle state path coverage, single transition path coverage, and boundary interiorloop coverage.The single state path coverage criterion requires each path thattraverses states at most once to be exercised. The single transition path cover- � �� " ��

age criterion requires each path that traverses transitions at most once to beexercised. The boundary interior loop coverage criterion requires each dis- � �� " ��

� �� (�� )�� " ��

tinct loop of the state machine to be exercised the minimum, an intermedi-ate, and the maximum number of times11. These criteria may be practical forvery small and simple finite-state machine specifications, but since the num-ber of even simple paths (without repeating states) can grow exponentiallywith the number of states, they are often impractical.

Specifications given as finite-state machines are typically incomplete, i.e.,they do not include a transition for every possible (state, stimulus) pair. Oftenthe missing transitions are implicitly error cases. Depending on the system,the appropriate interpretation may be that these are don’t care transitions(since no transition is specified, the system may do anything or nothing), selftransitions (since no transition is specified, the system should remain in thesame state), or (most commonly) error transitions that enter a distinguishedstate and possibly trigger some error handling procedure. In at least the lattertwo cases, thorough testing includes the implicit as well as the explicit statetransitions. No special techniques are required; the implicit transitions aresimply added to the representation before test cases are selected.

The presence of implicit transitions with a don’t care interpretation is typ-ically an implicit or explicit statement that those transitions are impossible,e.g., because of physical constraints. For example, in the specification of themaintenance procedure of Figure 13.10, the effect of event lack of compo-nent is specified only for the states that represent repairs in progress. Some-times it is possible to test such sequences anyway, because the system doesnot prevent such events from occurring Where possible, it may be best totreat don’t care transitions as self transitions (allowing the possibility of im-perfect translation from physical to logical events, or of future physical layers

11The boundary interior path coverage was originally proposed for structural coverage of pro-gram control flow, and is described in Chapter 14



Advanced search: The Advanced search function allows for searching elements in thewebsite database.

The key for searching can be:

a simple string , i.e., a simple sequence of characters,

a compound string , i.e.,

� a string terminated with character *, used as wild character, or

� a string composed of substrings included in braces and separatedwith commas, used to indicate alternatives.

a combination of strings , i.e., a set of strings combined with the boolean op-erators NOT, AND, OR, and grouped within parenthesis to change the pri-ority of operators.

Examples:

laptop The routine searches for string “laptop”

�DVD*,CD*� The routine searches for strings that start with substring “DVD”or “CD” followed by any number of characters

NOT (C2021*) AND C20* The routine searches for strings that start with sub-string “C20” followed by any number of characters, except substring “21”

Figure 13.11: The functional specification of feature advanced search of theChipmunk web site.

with different properties), or as error transitions (requiring that unanticipatedsequences be recognized and handled). If it is not possible to produce testcases for the don’t care transitions, then it may be appropriate to pass themto other validation or verification activities, for example, by including explicitassumptions in a requirements or specification document that will undergoinspection.

13.10 Deriving Test Cases from Grammars

Sometimes, functional specifications are given in the form of grammars orregular expressions. This is often the case in description of languages, suchas specifications of compilers or interpreters. More often syntactic struc-tures are described with natural or domain specific languages, such as simplescripting rules and complex document structures.

The informal specification of the advanced search functionality of the Chip-muk website shown in Figure 13.11 defines the syntax of the search pattern.Not surprisingly, this specification can easily be expressed as a grammar. Fig-ure 13.12 expresses the specification as a grammar in Bachus Naur Form (BNF).


Deriving Test Cases from Grammars 103

�search� �� search� �binop� �term� � �� search� � �term��binop� ��

�term� �� regexp� � �search� �

�regexp� �� Char �regexp� � Char � � �choices� � � #

�choices� �� regexp� � �regexp� � �choices�

Figure 13.12: The BNF description of functionality Advanced search.

A second example is given in Figure 13.13, which specifies a product con-figuration of the Chipmuk website. In this case, the syntactic structure ofproduct configuration is described by an XML schema, which defines an ele-ment Model of type ProductConfigurationType. XML schemata are essentiallya variant of BNF, so it is not difficult to render the schema in the same BNFnotation, as shown in Figure 13.13.

In general, grammars are well suited to represent inputs of varying and un-bounded size, boundary conditions, and recursive structures. None of whichcan be easily captured with fixed lists of parameters, as required by mostmethods presented in this chapter. � #��

��Generating test cases from grammar specifications is straightforward andcan easily be automated. To produce a string, we start from a non-terminalsymbol and we progressively substitute non-terminals occurring in the cur-rent string with substrings, as indicated by the applied productions, until weobtain a string composed only of terminal symbols. In general at each step,several rules can be applied. A minimal set of test cases can be generatedby requiring each production to be exercised at least once. Test cases canbe generated by starting from the start symbol and applying all productions.The number and complexity of the generated test cases depend on the or-der of application of the productions. If we first apply productions with non-terminals on the right hand side, we generate a smaller set of test cases, eachone tending to be a large test case. On the contrary, first applying productionswith only terminals on the right hand side, we generate larger sets of smallertest cases. An algorithm that favors non-terminals applied to the BNF for Ad-vanced Search of Figure 13.11, generates the test case

not Char �*, Char� and (Char or Char)that exercise all productions. The derivation tree for this test case is given

in Figure 13.15. It shows that all productions of the BNF are exercised at leastonce. � ��

�� The minimal set of test cases can be enriched by considering boundaryconditions. Boundary conditions apply to recursive productions. To gener-ate test cases for boundary conditions we need to identify the minimum andmaximum number of recursive applications of a production and then gener-ate a test case for the minimum, maximum, one greater than minimum and



�� $�%!�� &%��'$(())))*��(+��(�,(-./0�%!��&1

�� $��1

�� $ ��!��1

2%�'��3 2��'��!� 4 5�� 2�� 0�%!��

2�'��%� +�� .�� 5!66! �� .��%�� 7��

�(� $ ��!��1

�(� $��1

�� $!�!�!�� !�&.� !�& ��'!�&5�� 2��'!&(1

�� $��'�!��'! ��!�&5�� 2��'!&1

�� $��8��! ��!�&�� !�9��8!�& ��'!�&� $��& �!�&�!:��! &(1

�� $!:�!��!1

�� $!�!�!�� !�&2��'��!��& ��&�& ��&��8�� ! &1

�� $!�!�!�� !�&2��'��!��'!& ��'!�&��&(1

�� $!�!�!�� !�&2��'��!��;��!& ��'!�&��&(1

�(� $!�!�!��1

�(� $!:�!��!1

�� $!�!�!�� !�&<'��2��'��!��& ��&�& ��&��8�� ! &1

�� $�� !�8!��'!�&2��'��!��=�� &

�� $!�!�!�� !�&2��'��!��'!& ��'!�&��&(1

�(� $!�!�!��1

�(� $��'�!��'!1

XML is a parenthetical language: descriptions of items are either enclosed inangular parenthesis (��) or terminated with “/item” clauses. Schema and

annotation (�xsd:schema ...� and �xsd:annotation� ...�/xsd:annotation�) giveinformation about the XML version and the authors. The first clause

(�xsd:element ...� describes a Model as an instance of typeProductConfigurationType. The clause �xsd:complexType� ...

�/xsd:complexType� describes type ProductConfigurationType as composedof

� a field modelNumber of type String. Field modelNumber is required.

� a possibly empty set of Components, each characterized by fields Com-ponentType and ComponentValue, both of type string.

� a possibly empty set of OptionalComponents, each characterized by aComponentType of type string

Figure 13.13: The XML Schema that describes a Product configuration of theChipmuk website


Deriving Test Cases from Grammars 105

�Model� �� modelNumber� �compSequence� �optCompSequence��compSequence� �� Component� �compSequence� � !�'��

�optCompSequence� �� OptionalComponent� �optCompSequence� � !�'��

�Component� �� ComponentType� �ComponentValue��OptionalComponent� �� ComponentType��modelNumber� ��

�ComponentType� ��

�ComponentValue� ��

Figure 13.14: The BNF description of Product Configuration.

<search>�

<search> <binop> <term>�

not �<search>� and� (�<search>�)�

<term>�

<regexp>�

Char �<regexp>�

<regexp> �, �<choices>�

*�

Char�

or�

{�<choices>�}�

<regexp>�

<term>�

<search> <binop> <term>�

<regexp>�

Char�

<regexp>�

Char�

Figure 13.15: The derivation tree of a test case for functionality AdvancedSearch derived from the BNF specification of Figure 13.12.



Model �Model� �� modelNumber� �compSequence� �optCompSequenc

compSeq1 limit=16 �compSequence� �� Component� �compSequence�compSeq2 �compSequence� �� !�'��

optCompSeq1 limit=16 �optCompSequence� �� OptionalComponent� �optCompSequence�optCompSeq2 �optCompSequence� �� !�'��

Comp �Component� �� ComponentType� �ComponentValue�OptComp �OptionalComponent� �� ComponentType�modNum �modelNumber� ��

CompTyp �ComponentType� ��

CompVal �ComponentValue� ��

Figure 13.16: The BNF description of Product Configuration extended withproduction names and limits.

one smaller than maximum number of application of each production.To apply boundary condition grammar based criteria, we need to add lim-

its to the recursive productions. Names and limits are shown in Figure 13.16,which extends the grammar of Figure 13.14. Compound productions are de-composed into their elementary components. Production names are usedfor references purpose. Limits are added only to recursive productions. Inthe example of Figure 13.16, the limit of both productions compSeq1 and opt-CompSeq1 is set to 16, i.e., we assume that each model can have at most 16required and 16 optional components.

The boundary condition grammar based criteria would extend the mini-mal set by adding test cases that cover the following choices:

� zero required components (compSeq1 applied 0 times)

� one required component (compSeq1 applied 1 time)

� fifteen required components (compSeq1 applied �� times)

� sixteen required components (compSeq1 applied � times)

� zero optional components (optCompSeq1 applied 0 times)

� one optional component (optCompSeq1 applied 1 time)

� fifteen optional components (optCompSeq1 applied �� times)

� sixteen optional components (optCompSeq1 applied � times)


Choosing a Suitable Approach 107

weight Model 1weight compSeq1 10weight compSeq2 0weight optCompSeq1 10weight optCompSeq2 0weight Comp 1weight OptComp 1weight modNum 1weight CompTyp 1weight CompVal 1

Figure 13.17: A sample seed that assigns probabilities to productions of theBNF specification of the BNF of Product Configuration.

Additional boundary condition grammar based criteria can be defined byalso requiring specific combinations of applications of productions, e.g., re-quiring all productions to be simultaneously applied the minimum or themaximum number of times. This additional requirement applied to the ex-ample of Figure 13.16 would require additional test cases corresponding tothe cases of (1) both no required and no optional components (both compSeq1and optCompSeq1 applied 0 times), and (2) 16 required and 16 additionalcomponents (both compSeq1 and optCompSeq1 applied � times). � � ��

� ��Probabilistic grammar based criteria assign probabilities to productions,thus indicating which production to select at each step to generate test cases.Unlike names and limits, probabilities are attached to grammar productionsas a separate set of annotations, called seed. In this way, we can generateseveral sets of test cases from the same grammar with different seeds. Fig-ure 13.17 shows a sample seed for the grammar that specify the product con-figuration functionality of the Chipmuk web site presented in Figure 13.16.

Probabilities are indicated as weights that determine the relative occur-rences of the production in a sequence of applications that generate a testcase. The same weight for compSeq1 and optCompSeq1 indicates that testcases are generated by balancing the applications of these two productions,i.e., they contain the same number of required and optional components.Weight � disables the productions, which are then applied only when the ap-plication of competing productions reaches the limit indicated in the gram-mar.

13.11 Choosing a Suitable Approach

We have seen several approaches to functional testing, each applying to dif-ferent kinds of specifications. Given a specification, there may be one or moretechniques well suited for deriving functional test cases, while some other



techniques may be hard or even impossible to apply, or may lead to unsat-isfactory results. Some techniques can be interchanged, i.e., they can be ap-plied to the same specification and lead to similar results. Other techuiquesare complementary, i.e., they apply to different aspects of the same specifica-tion or at different stages of test case generation. In some cases, approachesapply directly to the form in which the specification is given, in some othercases, the specification must be transformed into a suitable form.

The choice of approach for deriving functional testing depends on severalfactors: the nature of the specification, the form of the specification, exper-tieses and experience of test designers, the structure of the organization, theavailability of tools, the budget and quality constraints, and the costs of de-signing and implementing the scaffolding.

Nature and form of the specification Different approaches exploit differ-ent characteristics of the specification. For example, the presence of severalconstraints on the input domain may suggest the category partition method,while lack of constraints may indicate a combinatorial approach. The pres-ence of a finite set of states could suggest a finite state machine approach,while inputs of varying and unbounded size may be tackled with grammarbased approaches. Specifications given in a specific format, e.g., as finite statemachines, or decision structures suggest the corresponding techniques. Forexample, functional test cases for SDL specifications of protocols are oftenderived with finite state machine based criteria.

Experience of test designers and organization Experience of testers andcompany procedures may drive the choice of the testing technique. For ex-ample, test designers expert in category partition may prefer this techniqueover a catalog based approach when both are applicable, while a companythat works in a specific domain may require the use of catalogs suitably pro-duced for the domain of interest.

Tools Some techniques may require the use of tools, whose availability andcost should be taken into account when choosing a specific testing technique.For example, several tools are available for deriving test cases from SDL spec-ifications. The availability of one of these tools may suggest the use of SDL forcapturing a subset of the requirements expressed in the specification.

Budget and quality constraints Different quality and budget constraintsmay lead to different choices. For example, the need of quickly check a soft-ware product without stringent reliability requirements may lead to chose arandom test generation approach, while a thorough check of a safety criti-cal application may require the use of sophisticated methods for functionaltest case generation. When choosing a specific approach, it is important to



evaluate all cost related aspects. For example, the generation of a large num-ber of random tests may require the design of sophisticated oracles, whichmay raise the costs of testing over an acceptable threshold; the cost of a spe-cific tool and the related training may go beyond the advantages of adoptinga specific approach, even if the nature and the form of the specification maysuggest the suitability of that approach.

Many engineering activities require careful trading off different aspects.Functional testing is not an exception: successfully balancing the many as-pects is a difficult and often underestimated problem that requires highlyskilled designers. Functional testing is not an exercise of choosing the opti-mal approach, but a complex set of activities for finding a suitable combina-tion of models and techniques that can lead to a set of test cases that satisfycost and quality constraints. This balancing extends beyond test design tosoftware design for test. Appropriate design not only improves the softwaredevelopment process, but can greatly facilitate the job of test designers, andthus lead to substantial savings.

Too often test designers make the same mistake as non-expert program-mers, that is to start generating code in one case, test cases in the other,without prior analysis of the problem domain. Expert test designers care-fully examine the available specifications, their form, domain and companyconstraints for identifying a suitable framework for designing test case speci-fications before even starting to consider the problem of test case generation.

Open research issues

Functional testing is by far the most popular way of deriving test cases in in-dustry, but both industrial practice and research are still far from general andsatisfactory methodologies. Key reasons for the relative shortage of resultsare the intrinsic difficulty of the problem and the difficulty of working withinformal specifications. Research in functional testing is increasingly activeand progresses in many directions.

A hot research area concerns the use of formal methods for deriving testcases. In the past three decades, formal methods have been mainly studied asa means for formally proving software properties. Recently, a lot of attentionhas been moved towards the use of formal methods for deriving test cases.There are three main open research topics in this area:

� definition of techniques for automatically deriving test cases from par-ticular formal methods. Formal methods present new challenges andopportunities for deriving test cases. We can both adapt existing tech-niques borrowed from other disciplines or research areas and definenew techniques for test case generation. The formal nature can sup-port fully automatic generation of test cases, thus opening additionalproblems and research challenges.



� adaptation of formal methods to be more suitable for test case gener-ation. As illustrated in this chapter, test cases can be derived in twobroad ways, either by identifying representative values or by deriving amodel of the unit under test. The possibility of automatically generat-ing test cases from different formal methods offers the opportunities ofa large set of models to be used in testing. The research challenge reliesin the capability of identifying a tradeoff between costs of generatingformal models and savings in automatically generating test cases. Thepossibility of deriving simple formal models capturing only the aspectsof interests for testing has been already studied in some specific areas,like concurrency, where test cases can be derived from models of theconcurrency structure ignoring other details of the system under test,but the topic presents many new challenges if applied to wider classesof systems and models.

� identification of a general framework for deriving test cases from anyparticular formal specification. Currently research is moving towardsthe study of techniques for generating test cases for specific formal meth-ods. The unification of methods into a general framework will consti-tute an additional important result that will allow the interchange offormal methods and testing techniques.

Another hot research area is fed by the increasing interest in different spec-ification and design paradigms. New software development paradigms, suchas the object oriented paradigm, as well as techniques for addressing increas-ingly important topics, such as software architectures and design patterns,are often based on new notations. Semi-formal and diagrammatic notationsoffer several opportunities for systematically generating test cases. Resarch isactive in investigating different possibilities of (semi) automatically derivingtest cases from these new forms of specifications and studying the effective-ness of existing test case generation techniques12.

Most functional testing techniques do not satisfactory address the prob-lem of testing increasingly large artifacts. Existing functional testing tech-niques do not take advantages of test cases available for parts of the artifactunder test. Compositional approaches for deriving test cases for a given sys-tem taking advantage of test cases available for its subsystems is an importantopen research problem.

Further Reading

Functional testing techniques, sometimes called “black-box testing” or “specification-based testing,” are presented and discussed by several authors. Ntafos [DN81]makes the case for random, rather than systematic testing; Frankl, Hamlet,

12Problems and state-of-art techniques for testing object oriented software and software ar-chitectures are discussed in Chapters ?? and ??



Littlewood and Strigini [FHLS98] is a good starting point to the more recentliterature considering the relative merits of systematic and statistical approaches.

Category partition testing is described by Ostrand and Balcer [OB88]. Thecombinatorial approach described in this chapter is due to Cohen, Dalal,Fredman, and Patton [CDFP97]; the algorithm described by Cohen et al. ispatented by Bellcore. Myers’ classic text [Mye79] describes a number of tech-niques for testing decision structures. Richardson, O’Malley, and Tittle [ROT89]and Stocks and Carrington [SC96] are among more recent attempts to gener-ate test cases based on the structure of (formal) specifications. Beizer’s BlackBox Testing [Bei95] is a popular presentation of techniques for testing basedon control and data flow structure of (informal) specifications.

Catalog-based testing of subsystems is described in depth by Marick’s TheCraft of Software Testing [Mar97].

Test design based on finite state machines has been important in the do-main of communication protocol development and conformance testing; Fu-jiwara, von Bochmann, Amalou, and Ghedamsi [FvBK�91] is a good introduc-tion. Gargantini and Heitmeyer [GH99] describe a related approach applica-ble to software systems in which the finite-state machine is not explicit butcan be derived from a requirements specification.

Test generation from context-free grammars is described by Celentano etal. [CCD�80] and apparently goes back at least to Hanford’s test generatorfor an IBM PL/I compiler [Han70]. The probabilistic approach to grammar-based testing is described by Sirer and Bershad [SB99], who use annotatedgrammars to systematically generate tests for Java virtual machine imple-mentations.

Related topics

Readers interested in the complementarites between functional and struc-tural testing as well as readers interested in the testing decision structures andcontrol and data flow graphs may continue with the next Chapters that de-scribe structural and data flow testing. Readers interested in finite state ma-chine based testing may go to Chapters 17 and ?? that discuss testing of objectoriented and distributed system, respectively. Readers interested in the qual-ity of specifications may goto Chapters 25 and ??, that describe inspectiontechniques and methods for testing and analysis of specifications, respec-tively. Readers interested in other aspect of functional testing may move toChapters 16 and ??, that discuss technuqes for testing complex data struc-tures and GUIs, respectively.



Exercises

Ex13.1. In the “Extreme Programming” (XP) methodology [?], a written descrip-tion of a desired feature may be a single sentence, and the first step to design-ing the implementation of that feature is designing and implementing a setof test cases. Does this aspect of the XP methodology contradict our assertionthat test cases are a formalization of specifications?

Ex13.2. Compute the probability of selecting a test case that reveals the fault in-serted in line 25 of program Root of Figure 13.1 by randomly sampling theinput domain, assuming that type double has range �� . Com-pute the probability of selecting a test case that reveals a fault, asuming thatboth lines 18 and 25 of program Root contains the same fault, i.e., missingcondition � � �. Compare the two probabilities.

Ex13.3. Identify independently testable units in the following specification.

Desk calculator Desk calculator performs the following algebraic opera-tions: sum, subtraction, product, division, and percentage on integers andreal numbers. Operands must be of the same type, except for percentage,which allows the first operator to be either integer or real, but requires thesecond to be an integer that indicates the percentage to be computed. Oper-ations on integers produce integer results. Program Calculator can be usedwith a textual interface that provides the following commands:

Mx=N where Mx is a memory location,i.e., M0,.. M9 and N is a number. In-tegers are given as non-empty sequences of digits, with or without sign.Real numbers are given as non-empty sequences of digits that include adot “.”, with or without sign. Real numbers can be terminated with anoptional exponent, i.e., character “E” followed by an integer. The com-mand displays the stored number.

Mx=display , where Mx is a memory location and display indicates the valueshown on the last line.

operand1 operation operand2 , where operand1 and operand2 are num-bers or memory locations or display and operation is one of the follow-ing symbols: “+”, “-”, “*”, “/”, “%”, where each symbol indicates a partic-ular operation. Operands must follow the type conventions. The com-mand displays the result or the string Error.

or with a graphical interface that provides a display with 12 characters andthe following keys:

� , � , + , * , > , ? , @ , A , , , B , the 10 digits

� , 4 , # , ( , � , the operations

� to display the result of a sequence of operations



2 , to clear display

. , .� , .0 , .C , .2 , where . is pressed before a digit to indicatethe target memory, 0. . . 9, keys .� , .0 , .C , .2 pressed after .

and a digit indicate the operation to be performed on the target mem-ory: add display to memory, store display in memory, restore memory,i.e., move the value in memory to the display and clear memory.Example: ? � � � . * .0 , � 4 . * .C � prints 65 (thevalue 15 is stored in memory cell 3 and then retrieved to compute 80 -15).

Ex13.4. Assume we have a set of parameter caracteristics (categories) and valueclasses (choices) obtained by applying the category partition method to aninformal specification. Write an algorithm for computing the number ofcombinations of value classes for each of the following restricted cases:

� (Case 1) Parameter characteristics and value classes are given withoutconstraints

� (Case 2) Only constraints error and single are used (without constraintsproperty and if-property)

� (Case 3) Constraints are used, but constraints property and if-propertyare not used for value classes of the same paramter characteristics, i.e.,only one of these two types of contrain can be used for value classes ofthe same parameter characteristic. Moreover, constraints are not nested,i.e., if a value class of a given parameter characteristic is constrainedwith if-property with repect to a set of different parameter characteris-tics �, then � cannot be further constrained with if-property.

Ex13.5. Given a set of parameter characteristics (categories) and value classes (choices)obtained by applying the category partition method to an informal specifi-cation, explain either with a deduction or with examples why unrestricteduse of constraints property and if-property makes it difficult to compute thenumber of derivable combinations of value classes.Write heuristics to compute a reasonable upper bound for the number ofderivable combinations of value classes when constraints can be used with-out limits.

Ex13.6. Consider the following specification, which extends the specification ofthe feature Check-configuration of the Chipmuk web site given in Figure13.3. Derive a test case specification using the category partition methodand compare the test specification you obtain with the specification of Ta-ble 13.1. Try to identify a procedure for deriving the test specifications of thenew version of the functional specification from the former version. Discussthe suitability of category-partition test design for incremental developmentwith evolving specifications.



Check-Configuration: the Check-configuration function checks the va-lidity of a computer configuration. The parameters of check-configurationare:

Product line: A product line identifies a set of products sharing severalcomponents and accessories. Different product lines have distinctcomponents and accessories.Example: Product lines include desktops, servers, notebooks, digi-tal cameras, printers.

Model: A model identifies a specific product and determines a set ofconstraints on available components. Models are characterized bylogical slots for components, which may or may not be implementedby physical slots on a bus. Slots may be required or optional. Re-quired slots must be assigned a suitable component to obtain a le-gal configuration, while optional slots may be left empty or filleddepending on the customer’s needs.Example: The required “slots” of the Chipmunk C20 laptop com-puter include a screen, a processor, a hard disk, memory, and anoperating system. (Of these, only the hard disk and memory areimplemented using actual hardware slots on a bus.) The optionalslots include external storage devices such as a CD/DVD writer.

Set of Components: A set of �� pairs, which must corre-spond to the required and optional slots associated with the model.A component is a choice that can be varied within a model, andwhich is not designed to be replaced by the end user. Available com-ponents and a default for each slot is determined by the model. Thespecial value “empty” is allowed (and may be the default selection)for optional slots.In addition to being compatible or incompatible with a particularmodel and slot, individual components may be compatible or in-compatible with each other.Example: The default configuration of the Chipmunk C20 includes20 gigabytes of hard disk; 30 and 40 gigabyte disks are also avail-able. (Since the hard disk is a required slot, “empty” is not an al-lowed choice.) The default operating system is RodentOS 3.2, per-sonal edition, but RodentOS 3.2 mobile server edition may also beselected. The mobile server edition requires at least 30 gigabytes ofhard disk.

Set of Accessories: An accessory is a choice that can be varied within amodel, and which is designed to be replaced by the end user. Avail-able choices are determined by a model and its line. Unlike compo-nents, an unlimited number of accessories may be ordered, and thedefault value for accessories is always “empty.” The compatibility ofsome accessories may be determined by the set of components, butaccessories are always considered compatible with each other.



Example: Models of the notebook family may allow accessories in-cluding removable drives (zip, cd, etc.), PC card devices (modem,lan, etc.), additional batteries, port replicators, carrying case, etc.

Ex13.7. Update the specification of feature Check-configuration of the Chipmukweb site given in Figure 13.3 by using information from the test specificationprovided in Table 13.1.

Ex13.8. Derive test specifications using the category partition method for the fol-lowing Airport connection check function:

Airport connection check: The airport connection check is part of an(imaginary) travel reservation system. It is intended to check the va-lidity of a single connection between two flights in an itinerary. It isdescribed here at a fairly abstract level, as it might be described in apreliminary design before concrete interfaces have been worked out.

Specification Signature: Valid Connection (Arriving Flight: flight, De-parting Flight: flight) returns Validity CodeValidity Code 0 (OK) is returned if Arriving Flight and Departing Flightmake a valid connection (the arriving airport of the first is the de-parting airport of the second) and there is sufficient time betweenarrival and departure according to the information in the airportdatabase described below.Otherwise, a validity code other than 0 is returned, indicating whythe connection is not valid.Data typesFlight: A ”flight” is a structure consisting of� A unique identifying flight code, three alphabetic characters fol-

lowed by up to four digits. (The flight code is not used by thevalid connection function.)

� The originating airport code (3 characters, alphabetic)� The scheduled departure time of the flight (in universal time)� The destination airport code (3 characters, alphabetic)� The scheduled arrival time at the destination airport.

Validity Code: The validity code is one of a set of integer values withthe following interpretations0: The connection is valid.10: Invalid airport code (airport code not found in database)15: Invalid connection, too short: There is insufficient time between

arrival of first flight and departure of second flight.16: Invalid connection, flights do not connect. The destination air-

port of Arriving Flight is not the same as the originating airportof Departing Flight.

20: Another error has been recognized (e.g., the input argumentsmay be invalid, or an unanticipated error was encountered).



Airport DatabaseThe Valid Connection function uses an internal, in-memory tableof airports which is read from a configuration file at system initial-ization. Each record in the table contains the following informa-tion:

� Three-letter airport code. This is the key of the table and can beused for lookups.

� Airport zone. In most cases the airport zone is a two-letter coun-try code, e.g., ”us” for the United States. However, where passagefrom one country to another is possible without a passport, theairport zone represents the complete zone in which passport-free travel is allowed. For example, the code ”eu” represents theEuropean countries which are treated as if they were a singlecountry for purposes of travel.

� Domestic connect time. This is an integer representing the min-imum number of minutes that must be allowed for a domesticconnection at the airport. A connection is ”domestic” if the orig-inating and destination airports of both flights are in the sameairport zone.

� International connect time. This is an integer representing theminimum number of minutes that must be allowed for an in-ternational connection at the airport. The number -1 indicatesthat international connections are not permitted at the airport.A connection is ”international” if any of the originating or des-tination airports are in different zones.

Ex13.9. Derive test specifications using the category partition method for the func-tion SUM of Excel�� from the following description taken from the Excelmanual:

SUM: Adds all the numbers in a range of cells.

SyntaxSUM(number1,number2, ...)Number1, number2, ...are 1 to 30 arguments for which you wantthe total value or sum.

� Numbers, logical values, and text representations of numbersthat you type directly into the list of arguments are counted. Seethe first and second examples following.

� If an argument is an array or reference, only numbers in that ar-ray or reference are counted. Empty cells, logical values, text, orerror values in the array or reference are ignored. See the thirdexample following.

� Arguments that are error values or text that cannot be trans-lated into numbers cause errors.



Examples

SUM(3, 2) equals 5

SUM(”3”, 2, TRUE) equals 6 because the text values are translatedinto numbers, and the logical value TRUE is translated into thenumber 1.Unlike the previous example, if A1 contains ”3” and B1 containsTRUE, then:

SUM(A1, B1, 2) equals 2 because references to nonnumeric valuesin references are not translated.If cells A2:E2 contain 5, 15, 30, 40, and 50:

SUM(A2:C2) equals 50

SUM(B2:E2, 15) equals 150

Ex13.10. Eliminate from the test specifications of the feature check configurationgiven in Table 13.1 all constraints that do not correspond to infeasible tuples,but have been added for the sake of reducing the number of test cases.Compute the number of test cases corresponding to the new specifications.Apply the combinatorial approach to derive test cases covering all pairwisecombinations.Compute the number of derived test cases.

Ex13.11. Consider the value classes obtained by applying the category partitionapproach to the Airport Connection Check example of Exercise Ex13.8. Elim-inate from the test specifications all constraints that do not correspond toinfeasible tuples and compute the number of derivable test cases. Apply thecombinatorial approach to derive test cases covering all pairwise combina-tions, and compare the number of derived test cases.

Ex13.12. Given a set of parameter characteristics and value classes, write a heuris-tic algorithm that selects a small set of tuples that cover all possible pairs ofthe value classes using the combinatorial approach. Assume that parametercharacteristics and value classes are given without constraints.

Ex13.13. Given a set of parameter characteristics and value classes, compute alower bound on the number of tuples required for covering all pairs of valuesaccording to the combinatorial approach.

Ex13.14. Generate a set of tuples that cover all triples of language, screen-size, andfont and all pairs of other parameters for the specification given in Table13.3.

Ex13.15. Consider the following columns that correspond to educational and in-dividual accounts of feature pricing of Figure 13.4:



Education IndividualEdu. T T F F F F F FCP � CT1 - - F F T T - -CP � CT2 - - - - F F T TSP � Sc F T F T - - - -SP � T1 - - - - F T - -SP � T2 - - - - - - F TOut Edu SP ND SP T1 SP T2 SP

write a set of boolean expressions for the outputs and apply the modifiedcondition/decision adequacy criterion (MC/DC) presented in Chapter 14to derive a set of test cases for the derived boolean expressions. Compare theresult with the test case specifications given in Figure 13.6.

Ex13.16. Derive a set of test cases for the Airport Connection Check example ofExercise Ex13.8 using the catalog based approach.Extend the catalog of Table 13.10 as needed to deal with specification con-structs.

Ex13.17. Derive sets of test cases for functionality Maintenance applying Transi-tion Coverage, Single State Path Coverage, Single Tranistion Path Coverage,and Boundary Interior Loop Coverage to the FSM specification of Figure13.9

Ex13.18. Derive test cases for functionality Maintenance applying Transition Cov-erage to the FSM specification of Figure 13.9, assuming that implicit transi-tions are (1) error conditions or (2) self transitions.

Ex13.19. We have stated that the transitions in a state-machine specification canbe considered as precondition, postcondition pairs. Often the finite-statemachine is an abstraction of a more complex system which is not truly finite-state. Additional “state” information is associated with each of the states, in-cluding fields and variables that may be changed by an action attached to astate transition, and a predicate that should always be true in that state. Thesame system can often be described by a machine with a few states and com-plicated predicates, or a machine with more states and simpler predicates.Given this observation, how would you combine test selection methods forfinite-state machine specifications with decision structure testing methods?Can you devise a method that selects the same test cases regardless of thespecification style (more or fewer states)? Is it wise to do so?


Types of Software Testing

Documents

Transcript of Types of Software Testing