Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

139

Transcript of Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Page 1: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries
Page 2: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Guide toEvaluating the Effectiveness of Strategies for

Preventing Work Injuries:

How to Show Whether a Safety Intervention Really Works

Lynda S. Robson, Harry S. Shannon, Linda M. Goldenhar, Andrew R. Hale

DEPARTMENT OF HEALTH AND HUMAN SERVICESPublic Health Service

Centers for Disease Control and PreventionNational Institute for Occupational Safety and Health

April 2001

Page 3: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

ii

DISCLAIMER

Mention of any company name or product does not constitute endorsement by theNational Institute for Occupational Safety and Health.

This document is in the public domain and may be freely copied or reprinted.

Copies of this and other NIOSH documents are available from NIOSH.

For information about occupational safety and health topics contact NIOSH at:

1-800-35-NIOSH (1-800-356-4674)Fax: 513-533-8573

E-mail: [email protected]/niosh

National Institute for Occupational Safety and HealthPublications Dissemination

4676 Columbia ParkwayCincinnati, OH 45226-1998

For information about the Institute For Work & Health and its research:

416-927-2027Fax: 416-927-2167

E-mail: [email protected]

DHHS (NIOSH) Publication No. 2001-119

Page 4: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Table of Contents

Acknowledgements ................................................................................................................................. ix

Preface ........................................................................................................................................................ xi

Information About Authors....................................................................................................................... xiii

Chapter 1 Introduction: Safety Intervention Effectiveness Evaluation ................................... 1

1.1 What is a safety intervention? ........................................................................................ 1

1.2 Effectiveness evaluation .................................................................................................. 2

1.3 Overview of the evaluation process and the guide .................................................... 2

1.4 Other types of evaluations .............................................................................................. 3

Chapter 2 Planning Right from the Start ....................................................................................... 5

2.1 Introduction ...................................................................................................................... 6

2.2 Defining the scope of the evaluation ............................................................................. 6

2.3 Who should be involved with the evaluation? ............................................................ 62.3.1 Evaluation committee ......................................................................................................... 62.3.2 Internal vs. external evaluators .......................................................................................... 72.3.3 Technical or methodological expertise ................................................................................. 72.4 Models to assist planning ...................................................................................................... 82.4.1 Conceptual models .............................................................................................................. 82.4.2 Program logic models .......................................................................................................... 102.5 Quantitative vs. qualitative methods for collecting evaluation data ........................ 11

2.6 Choosing the evaluation design ........................................................................................... 122.6.1 Strength of evidence provided by different evaluation designs ........................................... 132.6.2 Ethical considerations ......................................................................................................... 142.7 Practical tips ...................................................................................................................... 142.7.1 Time management ............................................................................................................... 142.7.2 Dealing with reaction to interim results ............................................................................. 142.7.3 Intervention diary ............................................................................................................... 142.7.4 Getting cooperation of workplace parties ............................................................................ 152.8 Summary ........................................................................................................................... 15

Chapter 3 Before-and-after design: A simple evaluation design ............................................. 17

3.1 Introduction ...................................................................................................................... 18

3.2 Design terminology ......................................................................................................... 18

3.3 Non-experimental designs .............................................................................................. 18

3.4 Before-and-after design ................................................................................................... 19

iii

Page 5: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

3.5 Threats to internal validity of before-and-after designs ......................................... 193.5.1 History threat ...................................................................................................................... 203.5.2 Instrumentation/reporting threat ....................................................................................... 223.5.3 Regression-to-the-mean threat ............................................................................................ 233.5.4 Testing threat ...................................................................................................................... 243.5.5 Placebo and Hawthorne threats .......................................................................................... 253.5.6 Maturation threat ............................................................................................................... 263.5.7 Dropout threat ..................................................................................................................... 263.6 Summary 27

Chapter 4 Quasi-experimental and experimental designs: more powerful evaluation designs ........................................................................................................... 29

4.1 Introduction ...................................................................................................................... 30

4.2 Quasi-experimental designs ........................................................................................... 304.2.1 Strategy #1: Add a control group (e.g., pre-post with non-randomized control) ............... 304.2.2 Strategy #2: take more measurements (time series designs) ............................................... 324.2.3 Strategy #3: Stagger the introduction of the intervention (e.g., multiple baseline ...........

design across groups) .................................................................................................... 334.2.4 Strategy #4: Reverse the intervention ............................................................................... 354.2.5 Strategy #5: Measure multiple outcomes .......................................................................... 354.3 Experimental designs ...................................................................................................... 374.3.1 Experimental designs with “before” and “after” measurements ....................................... 374.3.2 Experimental designs with “after”-only measurements .................................................... 394.4 Threats to internal validity in designs with control groups ....................................... 404.4.1 Selection threats .................................................................................................................. 404.4.2 Selection interaction threats ................................................................................................ 404.4.3 Diffusion or contamination threat .................................................................................... 414.4.4 Rivalry or resentment threat ............................................................................................... 414.5 Summary ........................................................................................................................... 42

Chapter 5 Study sample: Who should be in your intervention and evaluation? ................... 43

5.1 Introduction ...................................................................................................................... 44

5.2 Some definitions ............................................................................................................... 44

5.3 Choosing people, groups or workplaces for the study sample ................................ 445.3.1 How to choose a (simple) random sample ........................................................................... 455.3.2 How to choose a stratified random sample .......................................................................... 475.4 Randomization - forming groups in experimental designs ....................................... 485.4.1 Why randomize? ................................................................................................................. 485.4.2 Randomized block design and matching ............................................................................. 495.5 Forming groups in quasi-experimental designs .......................................................... 49

5.6 Summary ........................................................................................................................... 50

Table of Contents

iv

Page 6: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Chapter 6 Measuring outcomes ....................................................................................................... 51

6.1 Introduction ...................................................................................................................... 52

6.2 Reliability and validity of measurements ..................................................................... 52

6.3 Different types of safety outcome measures ................................................................ 546.3.1 Administrative data collection - injury statistics ............................................................... 546.3.2 Administrative data collection - other statistics ................................................................. 586.3.3 Behavioral and work-site observations ................................................................................ 596.3.4 Employee surveys ................................................................................................................ 606.3.5 Analytical equipment measures .......................................................................................... 626.3.6 Workplace audits ..................................................................................................................... 626.4 Choosing how to measure the outcomes ...................................................................... 626.4.1 Evaluation design and outcome measures .......................................................................... 626.4.2 Measuring unintended outcomes ........................................................................................ 646.4.3 Characteristics of measurement method ............................................................................. 646.4.4 Statistical power and measurement method ....................................................................... 656.4.5 Practical considerati............................................................................................................. 656.4.6 Ethical aspects ..................................................................................................................... 656.5 Summary ........................................................................................................................... 65

Chapter 7 Qualitative methods for effectiveness evaluation: When numbers are not enough ..................................................................................... 67

7.1 Introduction ...................................................................................................................... 68

7.2 Methods of collecting qualitative data .......................................................................... 687.2.1 Interviews and focus groups ............................................................................................... 687.2.2 Questionnaires with open-ended questions ........................................................................ 707.2.3 Observations ....................................................................................................................... 707.2.4 Document analysis .............................................................................................................. 707.3 Ways to use qualitative methods in effectiveness evaluation .................................... 717.3.1 Identifying implementation and intermediate outcomes .................................................... 717.3.2 Verifying and complementing quantitative outcome measures .......................................... 717.3.3 Eliminating threats to internal validity .............................................................................. 727.3.4 Identifying unintended outcomes ....................................................................................... 727.3.5 Developing quantitative measures ...................................................................................... 727.4 Selecting a sample for qualitative purpose .................................................................. 73

7.5 Qualitative data management and analysis ................................................................. 74

7.6 Ensuring good quality data ............................................................................................ 75

7.7 Summary ................................................................................................................................. 76

Table of Contents

v

Page 7: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Chapter 8 Statistical Issues: Are the results significant? ........................................................... 77

8.1 Introduction ...................................................................................................................... 78

8.2 Why statistical analysis is necessary ............................................................................. 78

8.3 P-values and statistical significance .............................................................................. 79

8.4 Statistical power and sample size ................................................................................. 80

8.5 Confidence intervals ........................................................................................................ 81

8.6 Choosing the type of statistical analysis ....................................................................... 828.6.1 Type of data ......................................................................................................................... 828.6.2 Evaluation design ............................................................................................................... 838.6.3 Unit of analysis ................................................................................................................... 848.7 Avoiding pitfalls in data analysis .................................................................................. 84

8.8 Summary ........................................................................................................................... 84

Chapter 9 Summary of recommended practices ........................................................................... 85

9.1 Introduction ....................................................................................................................... 85

9.2 Summary of recommended practices............................................................................. 86

Glossary ...................................................................................................................................................... 89

Appendix A Some models to assist in planning ............................................................................... 93

A.1 A model for interventions in the technical sub-system .............................................. 93

A.2 Models for interventions in the human sub-system ................................................... 95

A.3 Models for interventions in the safety management system ..................................... 97

Appendix B Examples of statistical analyses .................................................................................... 101

B.1 Analyses for before-and-after designs .......................................................................... 102B.1.1 Before-and-after design with injury rate data...................................................................... 102B.1.2 Before-and-after design with continuous data .................................................................... 104B.2 Analyses with pre-post measures and a control group .............................................. 105B.2.1 Pre-post with control group and rate data .......................................................................... 105B.2.2 Pre-post with control group and continuous data .............................................................. 108B.3 Analyses for designs with after-only measures and a control group ....................... 108B.3.1 After-only measurements with two groups and rate data .................................................. 108B.3.2 After-only measurements with several groups and rate data ............................................. 108B.3.3 After-only measurements with two groups and continuous data ...................................... 109B.3.4 After-only measurements with several groups and continuous data ................................. 109B.4 Multiple measurements over time ................................................................................. 110

Table of Contents

vi

Page 8: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Table of Contents

vii

Appendix C Reporting your evaluation results ................................................................................ 113

C.1 Introduction ...................................................................................................................... 113

C.2 Evaluation report .............................................................................................................. 113C.2.1 Structure of the report ......................................................................................................... 113C.2.2 Audience specificity ............................................................................................................ 115C.2.3 Clear language .................................................................................................................... 115C.3 Communication beyond the report ............................................................................... 116

C.4 Summary ........................................................................................................................... 116

Bibliography .....................................................................................................................................................117

Page 9: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries
Page 10: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Acknowledgements

The idea for this guide arose in a meeting of the Scientific Committee on Accident Prevention (SCOAP)at the International Commission on Occupational Health (ICOH) meeting in Tampere, Finland, July 1997.

The co-authors thank the many individuals and organizations which gave valuable feedback to draftversions of the Guide. These include: other members of the NORA Intervention Effectiveness Team (inparticular, Larry Chapman, Catherine Heaney, Ted Katz, Paul Landsbergis, Ted Scharf, Marjorie Wallace);other SCOAP Committee members (Tore Larsson, Jorma Saari); additional academic colleagues (DonaldCole, Xavier Cuny, Michel Guillemin, Richie Gun, Gudela Grote, Per Langaa Jensen, Urban Kjellén,Richard Wells); individuals within Ministries or Departments of Labour (Brian Zaidman, Barry Warrack),the Workplace Safety Insurance Board of Ontario (Richard Allingham, Marian Levitsky, KathrynWoodcock), Workers’ Compensation Board of BC (Jayne Player), and Workplace Health, Safety andCompensation Commission of New Brunswick (Susan Linton); representatives of Ontario Safe WorkplaceAssociations (Dave Snaith (Construction), Linda Saak and R. Stahlbaum (Electrical & Utilities), JamesHansen (IAPA), Louisa Yue-Chan (Services), Mark Diacur (Transportation)); Irene Harris and membersof the Occupational Health and Safety Committee of the Ontario Federation of Labour; Mary Cook,Occupational Health Clinics for Ontario Workers. Institute members who shared reference materials arealso thanked - John Lavis and Dorcas Beaton - as is Linda Harlowe (IWH), who constructed all of thedetailed graphical figures.

The final word of acknowledgement goes to the existing body of evaluation expertise that we drew uponand adapted to a safety application. We are especially indebted to the classic works of Cook and Campbell[1979] and Patton [1987, 1990].

ix

Page 11: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries
Page 12: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Preface

Our aim in this book is to provide students, researchers and practitioners with the tools and conceptsrequired to conduct systematic evaluations of injury prevention initiatives and safety programs. Successfulevaluations will advance the discipline of occupational safety by building a body of knowledge, basedon scientific evidence, that can be applied confidently in the workplace. This knowledge will provide asolid foundation for good safety practice, as well as inform the development of standards and regulations.Building such a knowledge base will help practitioners avoid the temptation of adopting safety proceduressimply because they appear “intuitively obvious” when no scientific evidence actually exists for thosepractices.

Users of the guide are encouraged to demonstrate the strongest level of evidence available for anintervention by measuring the effect on safety outcomes in an experimental design. Even when this levelof evidence is not obtained, much useful information can still be gained by following the recommendationsin the book. In doing so, the safety field will become current with other disciplines, such as clinicalmedicine, where evaluation information is increasingly available and allows for evidence-based decision-making.

We hope that this guide will assist safety specialists to meet the challenge of effectiveness evaluations.Please let us know if you found it useful by completing the evaluation form provided at the end to thedocument.

xi

Page 13: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries
Page 14: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Information About Authors

Linda M. Goldenhar

Linda Goldenhar is a Research Psychologist at the National Institute for Occupational Safety and Heath.She received her Ph.D. in Health Behavior at the University of Michigan. Her interests include interventionresearch, quantitative and qualitative data collection methods, work-organization, job stress, and women’shealth. She is currently the team leader for the NIOSH NORA Intervention Effectiveness Research team.The mission of the team is to educate occupational researchers about intervention research issues and toencourage the development, implementation, and evaluation of occupational safety and healthinterventions. Linda has published numerous peer-reviewed articles and delivered conferencepresentations covering a wide array of occupational health-related and behavioral topic areas. She is onthe editorial board of Health Education and Behavior and Journal of Safety Research.

Andrew R. Hale

Andrew Hale is professor of Safety Science at the Delft University of Technology in the Netherlands andeditor of the journal Safety Science. His background is as an occupational psychologist. He worked for18 years in the UK at the National Institute of Industrial Psychology and later at the University of Astonin Birmingham. His experience in research in safety and health began with studies of accidents andhuman error in industry and moved later into studies of risk perception, safety training and more recentlysafety management and regulation. Since moving to Delft from Aston he has expanded his area of researchapplications from industry to include transport (road, water and air), hospitals and public safety.

As editor of Safety Science and as a reviewer for both this journal and a number of others he is responsiblefor assessing the articles presented for publication. In that capacity he has seen at close quarters the needfor improvement in the quality and application of the research and analysis methods used in safetyscience.

Lynda S. Robson

Lynda Robson is a Research Associate at the Institute for Work & Health, Toronto, Canada. She obtainedher Ph.D. in Biochemistry from the University of Toronto, working afterwards in research labs. More recentexperience includes graduate-level training in epidemiology and evaluation methods, as well ascollaboration in economic evaluation projects. Her current areas of interest are the measurement of“healthy workplace” performance and safety intervention effectiveness evaluation methodology. She isthe non-management co-chair of the Institute’s Joint Health & Safety Committee.

xiii

Page 15: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Harry S. Shannon

Harry Shannon is currently a Professor in the Department of Clinical Epidemiology Biostatistics andDirector of the Program in Occupational Health and Environmental Medicine at McMaster University,Hamilton, Canada. He is also seconded to the Institute for Work & Health, Toronto, Canada as a SeniorScientist.

Harry holds a B.A. in Mathematics from Oxford University, an M.Sc. from Birmingham University anda Ph.D. in Applied Statistics at the University of London, U.K. He has authored over 75 peer-reviewed scientific papers, mainly in occupational health.

Harry’s current research focus is on work-related injuries and musculoskeletal disorders, and psychosocialconditions at work. Recent research includes studies of back pain in auto workers and upper extremitydisorders (“RSIs”) in newspaper workers. He has also studied the relationship between workplaceorganizational factors and injury rates. Harry is a member of the Scientific Committee on AccidentPrevention (SCOAP) of the International Commission on Occupational Health (ICOH). He is on theeditorial board of Safety Science; and is a member of the NIOSH NORA group on the Social and EconomicConsequences of Occupational Illness and Injury.

xiv

Page 16: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

1

1.1 What is a safety intervention?

A safety intervention is defined very simply as anattempt to change how things are done in orderto improve safety. Within the workplace it couldbe any new program, practice, or initiativeintended to improve safety (e.g., engineeringintervention, training program, administrativeprocedure).

Safety interventions occur at different levels of aworkplace safety system (Figure 1.1), includingthe level of safety management and varioushuman and technical sub-system levels in theorganization that management can influence. Anadditional level of the system in Figure 1.1, abovethe organizational, pertains to the laws,regulations, standards and programs put in placeby governments, industries, professional bodies,and others. Examples of interventions at thislevel include the Safe Communities IncentiveProgram (Safe Communities Foundation,Ontario, Canada), the Safety Achievers BonusScheme (South Australian WorkcoverCorporation) and small business insurance

pooling (CRAM, France). This guide does notdeal with interventions at the community level,although some of the issues discussed areapplicable.

1.1 What is a safety intervention?

1.2 Effectiveness evaluation

1.3 Overview of the evaluation process and the guide

1.4 Other types of evaluation

Figure 1.1 Levels of intervention in theworkplace safety system

Introduction: Safety InterventionEffectiveness Evaluation

Chapter 1

Page 17: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

2

Introduction Chapter 1

1.2 Effectiveness evaluation

We are focusing here on effectiveness evaluation(also known as outcome evaluation orsummative evaluation), which determineswhether a safety initiative has had the intendedeffect. For example, such an evaluation mightanswer the question, does the new incidentinvestigation process instituted three years ago(for the purpose of decreasing injuries) actuallyprevent injuries in subsequent years? This typeof evaluation is the “CHECK” portion of thePLAN-DO-CHECK-ACT (PDCA) continuousquality improvement cycle.4

Although injuries are often measured in aneffectiveness evaluation to determine whetherthe initiative has had an effect or not, there aretwo situations where this might not be the case.One of them arises from injury outcome data thatis unreliable or invalid - (e.g., while evaluatingan initiative in a small workplace). In this case asurrogate measure of safety could be used (e.g.,a checklist of safety conditions), if shown to bevalid for the circumstances in which it will beused. The other situation is where the program’sexplicit objective is not to decrease injuryincidence, but rather, some other objective suchas improve worker or management competenceor attitudes.

However, if the purpose of the program is toultimately affect injury incidence by targetingcompetence or attitudes, it would be beneficialto include a measure of injuries or a validsurrogate.

1.3 Overview of the evaluationprocess and the guide

Figure 1.2 provides an overview of theeffectiveness evaluation process. Much of theactivity in evaluation precedes the point wherethe intervention or initiative is introduced.Although evaluations can be done

retrospectively, appropriate data are typicallyunavailable. The guidelines contained in thisguide assume that a safety need in the workplaceand the nature of the intervention have beenidentified, but you are at a point prior tointroducing a new intervention in the workplace.Even if this is not the case, and you have alreadyimplemented your intervention, you should findthe guide relevant to the evaluation of yourintervention.

Chapter 2 identifies the decisions required beforepeople can work out the details of designing anevaluation. They include the available resources,time lines and, of course, the purpose of theevaluation. A few of the broader methodologicalissues, such as the use of qualitative andquantitative methods and choice of outcomes,are also introduced.

Chapters 3 and 4 introduce several evaluationdesigns (i.e., methods of conducting evaluationsin conjunction with delivering an intervention).These designs specify the groups of people orworkplace to be evaluated, as well as what willbe measured and when it will occur.

Chapter 5 explains in more detail who to includein the evaluation - a choice which affects thegeneralizability of the evaluation results and thetypes of statistical analyses to be used.

The next two chapters (6 and 7) considerquantitative (Chapter 6) and qualitative (Chapter7) data collection methods. Quantitativemethods yield the numeric informationnecessary to determine the size of anintervention’s effect and its trustworthiness -determined through statistical testing (Chapter8). Qualitative methods yield conceptualinformation which can inform the evaluationdesign at the beginning, and the interpretationof the results at the end. The guide ends with asummary of the practice recommended inprevious chapters (Chapter 9).

4 PDCA cycle mentioned in many references on quality improvement concepts. For example: Dennis P [1997] Quality, safety, and environment.Milwaukee, Wisconsin: ASQC Quality Press.

Page 18: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Chapter 1 Introduction

3

1.4 Other types of evaluations

Other types of evaluation, besides effectivenessevaluation, are useful in the process of improvingsafety in the workplace. They will only bedescribed briefly here. A needs assessment can becarried out to determine exactly what type ofintervention is required in a workplace. Analysesof injury statistics, incident reports or employeesurveys, as well as interviews with keyworkplace personnel (e.g., safety manager,disability manager, union representative, etc.) canidentify particular safety issues. This determineswhat type of intervention(s) should be chosen ordesigned to address an identified need.

After choosing and introducing a new safetyinitiative to a workplace, a process evaluation (alsoknown as a formative evaluation) can be used to

determine whether the new initiative is beingimplemented as planned. It assesses to whatextent new processes have been put in place andthe reactions of people affected by the processes.Furthermore, the refinement of a new initiativeand its implementation before its effectivenesscan be measured. If the process evaluationdetermines that the initiative was notimplemented as planned, the time and trouble ofconducting an effectiveness evaluation might bespared, or at least delayed until it becomes moremeaningful.

Finally, economic analyses can be used toevaluate workplace interventions, including cost-outcome, cost-effectiveness and cost-benefit analyses.They also depend on effectiveness information.The first two analyses estimate the net cost of anintervention (i.e., the cost of the intervention

Figure 1.2 Overview of the effectiveness evaluation process

Page 19: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

4

Introduction Chapter 1

Table 1.1 Types of intervention evaluations

minus the monetary saving derived from theintervention) relative to the amount of safetyimprovement achieved. (Monetary savingsinclude reductions in workers’ compensationpremiums, medical costs, absenteeism, andturnover, etc.) This yields a ratio such as net costper injury prevented.

In a cost-benefit analysis, monetary values are

assigned to all costs and outcomes resulting froman intervention, including health outcomes.Furthermore, a net (monetized) benefit or cost ofthe intervention is calculated.

Drummond et al. [1994], Haddix et al. [1996] andGold et al. [1996] are useful introductions toeconomic evaluations.

Page 20: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

5

2.1 Introduction

2.2 Defining the scope of the evaluation

2.3 Who should be involved with the evaluation2.3.1 Evaluation committee2.3.2 Internal vs. external evaluators 2.3.3 Technical or methodological expertise

2.4 Models to assist planning2.4.1 Conceptual models2.4.2 Program logic models

2.5 Quantitative vs. qualitative methods for collecting data

2.6 Choosing the evaluation design2.6.1 Strength of evidence provided by different evaluation designs2.6.2 Ethical considerations

2.7 Practical tips2.7.1 Time management2.7.2 Dealing with reaction to interim results2.7.3 Intervention diary2.7.4 Getting the cooperation of workplace parties

2.8 Summary

Planning Right from the Start

Chapter 2

Page 21: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

6

Planning Chapter 2

2.1 Introduction

One golden rule of intervention evaluation is thatthe intervention and its evaluation should beplanned simultaneously. Decide on yourevaluation design and methods before theintervention is introduced. If you do not, yourisk missing the only opportunity for collectingimportant data before the intervention.

This chapter gives some information aboutorganizing and carrying out an interventionevaluation, including who to involve to overseethe evaluation and how to use models to assistwith planning. It also highlights key issues toconsider during the planning stage.

2.2 Defining the scope of theevaluation

Some basic decisions are required before adetailed evaluation strategy can be specified.They should be done through a collaborativeprocess among those who will use the evaluationresults, and others either implementing theintervention and evaluation project or fundingthe evaluation. In a workplace setting, thesedecisions should involve senior management.The types of things to be determined at the outsetshould include: 1) overall purpose of theevaluation; 2) the main questions that theevaluation should answer; 3) available resources(financial, personnel, in-kind assistance); and 4)the deadline for the evaluation results. Thesehave to be in place early on, since they willinfluence the methodological deliberations. Inparticular, the rationale for the evaluation willinfluence the strength of the evidence sought.For example, you might want a more rigorousevaluation design if the result of the evaluation ismeant to inform a decision with larger resourceor policy implications.

2.3 Who should be involved with theevaluation?

2.3.1 Evaluation committee

In all but the simplest evaluation, it is advisableto assemble a committee to oversee theevaluation. Ideally, this same group would alsointeract with the evaluators in the selecting ordesigning of the safety intervention. The largerthe potential resource and personnel impacts ofany conclusion drawn from the evaluation, thegreater is the need to have an evaluationcommittee. The following factors can beconsidered in deciding whether or not to set upa committee.

People with different backgrounds and skill-setswill be included in an evaluation committee andany individual might play more than one role.Key are those who will use, or could influence,the use of the evaluation results. Their inclusionwill ensure the following: the evaluation willaddress decision-makers’ concerns; theevaluation and its results will be legitimized; andcommunication of the evaluation results will beinitiated early in the evaluation process. Youneed at least one person who can understand andcritique the methodological issues of both theintervention and the evaluation.

Also important is the expertise of those directlyaffected by the intervention, who have specialinsight into the reality of the workplaceconditions. Senior management and laborrepresentatives enhance the decision-makingcapacity of the committee, as well as facilitaterequired changes in workplace practices andmobilize the cooperation of all involved.Intervention skeptics are put to good use, as theirinput will likely result in a more rigorousevaluation. On the other hand, individualsinvolved in either choosing or developing theintervention should also be included, otherwisethey might be reluctant to accept any evidencethat their intervention is ineffective.

Page 22: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

7

Chapter 2 Planning

There is a wide range of capacities in whichevaluation committees can function. At oneextreme, committee members can serve in arelatively passive fashion, giving or denyingapproval to what the core evaluation teamdevelops. At the other end of the spectrumwould be an action research model: all workplacerepresentatives and methodological expertscollaborate to formulate evaluation questions,design the evaluation and interpret results. Anexample of this model is the article byHugentobler et al. [1992]. The choice of whereto place the committee between the two extremesinvolves weighing the complexity of theintervention and how widespread the buy-in hasto be, as well as the time and resources available.

2.3.2 Internal vs. external evaluators

From a scientific point of view, it is preferablethat an intervention is evaluated by anindependent party with no vested interest inshowing that the intervention is effective (or noteffective). In spite of all efforts to be open-minded, it is simply human nature to put moreeffort into finding and drawing conclusions frominformation which confirms our expectationsthan contradicts them. However, practitionersoften find they have to choose between carryingout the evaluation themselves or having noevaluation at all. Although the bias inherent inan “in-house” evaluation is never entirelyremoved, it can be diminished. This is achieved

by inviting others, especially those with opposingviews, to comment on analyses and reports of theevaluation, and by being very explicit in advanceabout what will be measured and regarded asevidence of success or failure.

2.3.3 Technical or methodological expertise

It is quite possible that you might have to lookoutside your own organization to tap into someof the specialized skills required for certainevaluations. Common areas requiring assistanceare questionnaire development and statisticalanalysis. Consider contacting the local academicinstitutions, for expertise in one of severaldepartments: biostatistics, occupational health &safety, management, (social or occupational)psychology, public health, education. Someuniversities even have consulting services gearedto assisting community residents. Other meansof contacting experts would be through safetyresearch organizations, safety consultants orsafety professional organizations. In any case,the rule is as before: involve these people early inthe evaluation.

Evaluation committee representation ideally includes:

• Stakeholders key to utilization and dissemination of results

• Key management representatives (e.g., relevant decision-maker)

• Key worker representatives (e.g., union representatives, opinion leaders, intervention participants)

• Evaluation expertise

• Diversity of disciplinary perspectives (e.g., engineering, safety, human resources, etc.)

• Diversity of workplace divisions/departments

• An intervention critic, as well as intervention proponents

Page 23: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

8

Planning Chapter 2

2.4 Models to assist planning

Certain types of models can assist in planning allbut the simplest intervention and evaluation.They diagrammatically depict the importantrelationships among the workplace conditions,interventions and outcomes, as well as identifywhat should be measured in the evaluation. Theprocess of constructing models will often revealcritical gaps in thinking and identify criticalissues. They also provide a means of generatingcommon understanding and communicatingamong those involved in the intervention andevaluation, as well as serving as an efficient aidto communicating with others.

One type of model is a conceptual model; anotheris a program logic model. Both are somewhatdifferent and complementary: the conceptualmodel tends to be more comprehensive in itsinclusion of factors which affect the safetyoutcomes of interest and the program logicmodel often includes more information on theintervention itself. Conceptual models are usedby researchers in many disciplines, whileprogram logic models are used by programevaluators - principally those evaluating socialor public service programs.

2.4.1 Conceptual models

A conceptual model typically uses arrows andboxes to represent the causal relationships(arrows) among important concepts (boxes)relevant to an intervention. An example follows(Figure 2.1).

The arrows indicate that structural managementcommitment affects participative supervisorypractices, which in turn affect the workgroup’spropensity for undertaking safety initiatives; andthese initiatives affect the likelihood of injuries.The relationship represented by an arrow in amodel can be either positive or negative. We

expect a greater structural managementcommitment will lead to more participativesupervisory practices - a positive relationship;while greater workgroup propensity toundertake safety initiatives would lead to fewerinjuries - a negative relationship. More complexmodels show multiple relationships, involvingseveral arrows and boxes. Examples of morecomplex models are included in an appendix tothis guide.

We recommend that a conceptual model relevant tothe intervention be developed6 while theintervention is being chosen or designed. Ideally,any model would be based as much as possibleon existing research and theory. You might adaptan existing model to your own situation. Theconceptual model helps clarify what theintervention hopes to change and the mechanismby which that should happen. This clarificationcould even reveal certain causal factors not yetaddressed by the intervention.

Structural management commitment

Participative supervisory practices

Workgroup propensityto undertake safety initiatives

Injuries

Figure 2.1: An illustration of a simpleconceptual model5

5 This model is based on some of the relationships reported in Simard M, Marchand A [1995]. A multilevel analysis of organizational factors relatedto the taking of safety initiatives by work groups. Safety Sci 21:113-129.6 For guidance in constructing conceptual models, see Earp and Ennett [1991].

Page 24: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

9

Chapter 2 Planning

The conceptual model tells us what should bemeasured in an evaluation. What we measureusing quantitative methods are known asvariables; i.e., attributes, properties or eventswhich can take on different values andcorrespond to concepts in the model. Using theconceptual model above, the numerical scorefrom supervisors completing questionnaires onsafety practices could be the variablecorresponding to the concept “participatorysupervisory practices”.

Independent and dependent variables

As demonstrated, the conceptual model does notdescribe the safety intervention itself, but rather,depicts the variables expected to change duringan intervention. For example, the above modelmight be applicable in an intervention wherestructural management commitment was goingto be changed by establishing a jointmanagement-labor health-and-safety committee.The variable being manipulated as part of theintervention, i.e., presence or absence of a jointhealth-and-safety committee in this case, isknown as the independent variable.

In contrast, the variables affected by theintervention, i.e., the variables corresponding tothe other three concepts in the model, are knownas dependent variables. These include thevariable(s) corresponding to the final outcome,“injuries” in this case. They also include thevariables corresponding to the concepts whichmediate an intervention’s effect, e.g“participatory supervisory practices” and“workgroup propensity to take safetyinitiatives”. The latter are known as mediating,intervening or intermediate variables.

Effect-modifying and confounding variables

Effect-modifying variables7 are sometimesimportant to include in a conceptual model.While these variables are not the focus of the

intervention, they often need to be consideredwhen data is collected and interpreted. An effect-modifying variable is one which modifies the sizeand direction of the causal relationship betweentwo variables. For example, “participativesupervisory practices” might have a greater effecton “workgroup propensity to take safetyinitiatives” if the workgroup has more training insafety or is younger. An effect modifyingvariable is depicted by an arrow extending fromit to the relationship which it modifies, as Figure2.2 shows.

Another type of variable to include in aconceptual model is a confounding variable. Thisis a variable which is related to the independentvariable (i.e. presence of intervention or not), aswell as the dependent variable of interest, but isnot a mediating variable. To build on the earlierillustration, “industrial sector” could be aconfounding variable if one is looking at theeffect of changing “structural managementcommitment” through the legislated introductionof joint health-and-safety committees (see Figure2.3).

Suppose that you decided to evaluate the effect ofthe legislation by comparing the injury rate oforganizations with established committees tothose without them. Suppose too, that theindustrial sectors with the higher injury rates aremore likely to have committees because ofgreater inspector activity. If you were to thencompare injury rates of companies with

Figure 2.2: Depiction of an effect-modifying variable in a conceptual model

7 In some disciplines, such variables would be referred to as moderator variables.

Page 25: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

10

Planning Chapter 2

committees (intervention group) versus thosewithout committees (control group), you wouldfind that the companies with the committees hadthe higher injury rates. However, the result ismost likely due to the specific industrial sector,rather than the intervention. Thus, a conclusionthat committees result in higher injury rateswould be inaccurate. You would form a differentconclusion if you took the sector into account inthe study design or analysis. To limit theinfluence of confounding factors, take them intoaccount in your study design (preferable) or youranalysis.

2.4.2 Program logic models

Program logic models are highly regarded byevaluators as a means of assisting evaluationplanning8. They distinguish short-, intermediate-and long-term outcome objectives. Figure 2.4depicts a generic program logic model.Implementation objectives are concerned withthe desired outputs of the intervention (e.g.,provide training sessions). They are distinct fromoutcome objectives, which are concerned withthe effects of the program. Continuing the

training example, a short-term objective could beimproved knowledge; an intermediate objective,changed behavior; and a long-term objective,decreased injuries.

A program logic model, developed for aparticular safety intervention, could have more orless boxes than shown in Figure 2.4, dependingon the number of intervention components andobjectives. Also, they could be linked in adifferent pattern of arrows.

Like a conceptual model, the program logicmodel gives insight into what should bemeasured in an evaluation. Ideally, one selects ameans of measuring the achievement of each ofthe objectives identified in the model. Figure 2.5depicts an example of a program logic model fora hypothetical ergonomic intervention.

Figure 2.3 Depiction of a confoundingvariable in a conceptual model

Figure 2.4 Generic program logic model

8 Program logic models are explained by Rush and Ogborne [1991]

Page 26: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

11

Chapter 2 Planning

A limitation of program logic models is that theyoften do not depict additional variables, such asconfounding and effect-modifying variables. Theadvantage is that they more explicitly indicatethe components of the intervention, and linkthem with the intervention objectives. Both typesof models can be used to identify potentialunintended outcomes of the intervention. Theseoutcomes are not intended to result from theintervention; but they nevertheless might. Assuch, these outcomes can be difficult toanticipate, but consideration of the models canhelp.

Using the models, one looks at the changes thatare supposed to happen following theintervention. You then try to think about whatother effects could possibly result from thechanges. For example, an intervention to reduceneedle injuries by eliminating recapping prior to

disposal into containers might not only have theintended effect of decreasing recapping injuries,but also an unintended effect of increasingdisposal-related injuries if disposed into poorlydesigned containers.

2.5 Quantitative vs. qualitativemethods for collecting evaluationdata

Tradeoffs are made in choosing which outcomesto measure. These include consideration of thequality of evidence required, available resourcesand quality of data potentially available. In avery complete effectiveness evaluation, youcollect pertinent data on all concepts representedin the intervention models, some usingqualitative methods and others quantitative.Quantitative methods involve measuringindicators or variables related to the concepts in

Figure 2.5 Example of a program logic model for an ergonomic program

Page 27: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

12

Planning Chapter 2

1) What is the strength of evidence required to address the purpose of the evaluation?• What evaluation design will convince the decision-makers that the intervention has succeeded or failed?

2) Are there any ethical and legal considerations?• Cannot assign people to a situation if one it is likely to be more harmful• Some forms of data collection might require consent of individuals

3) What data collection and analysis is possible with the resources (financial, personnel, expertise) available? • Some forms of data collection are more expensive than others.

4) Has the intervention been introduced already? If so, has some data collection already taken place?• Circumstances may limit choices regarding the evaluation design

5) What is the time frame required and/or available for evaluation?• Does demand for results preclude long-term measurement?• What does preliminary research predict regarding the timing of the maximum program effect, as well

as any erosion of that effect

6) What does the conceptual model for the intervention suggest should be measured and when?

7) How does the organization of the workplace influence the design?• Is randomization of workers/workplaces or the use of comparison groups possible? • Can an “intervention” group receive an intervention without other groups being aware? • How many workers/workplaces are available for the study?

8) Does the design offer sufficient statistical power?

Considerations in choosing an evaluation design

the models, such that they yield numerical data.Qualitative methods, on the other hand, do notyield numerical data and rely instead oninterviews, observation, and document analysis.

To clearly demonstrate intervention effectiveness,it is almost mandatory to use quantitativetechniques and measure an outcome variable(e.g., injury rate). A demonstration ofstatistically significant change or difference inthis variable (e.g., significant decrease in injuryrate) and its unambiguous attribution to thepresence of the intervention provides very goodevidence of intervention effectiveness. However,when both qualitative and quantitative methodsare used in an evaluation, an especially richsource of information is generated because thetwo can provide a check and a complement foreach other. Whereas quantitative methods areused to answer - “How big an effect did theintervention have on the outcome(a) of interest?”- and - “Was the effect statistically significant?”

- qualitative methods help answer - “How didthe intervention have that effect?” - and - “Whatwas the reaction of participants to theintervention?”. Qualitative methods can also beused at the earlier stage, when developing thequantitative methods for the study, by answering- “What would be important to measure whendetermining the size of the effect of theintervention?”

2.6 Choosing the evaluation design

Using the conceptual and/or program logicmodels as a guide for measuring, you will thenchoose an evaluation design. This is the generalprotocol for taking measurements; i.e., from howmany group(s) of workers/workers willmeasurements be taken and when will they betaken? Choosing the design involves acompromise among a number of considerations.Some are summarized below and will bediscussed throughout the guide.

Page 28: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

13

Chapter 2 Planning

2.6.1 Strength of evidence provided bydifferent evaluation designs

The goal of occupational safety interventionresearch is to be able to say that a specificintervention either enhanced, or did not enhance,worker safety. The degree of confidence withwhich you can legitimately draw this type ofconclusion depends on the strength of evidenceprovided by the study. Usually, an experimentaldesign provides the strongest evidence of a causallink between the intervention implementationand observed effects. This design strength arisesfrom: 1) the use of a control group, which does notparticipate in the intervention and is comparedwith the group which does participate; and 2)the assignment of people or workplaces to eitherintervention and control groups through anunbiased process (i.e., by randomization).

However, the logistical requirements ofexperimental designs often cause them to beunfeasible, especially for single, smaller work-sites. In such cases, quasi-experimental designs,should be considered. They represent a meansof compromising between the practicalrestrictions of workplaces and the rigour requiredfor demonstrating intervention effectiveness.They often include a control group, albeit onecreated through a non-random process; and inall cases, yield more information than a non-experimental design. This last group of designs,including the common before-and-after design, areoften necessary due to time and circumstances.Although it provides weaker evidence, using abefore-and-after design is better than noevaluation at all.

Table 2.1: Characteristics of different types of evaluation designs

Page 29: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

14

Planning Chapter 2

2.6.2 Ethical considerations

Potential for harm vs. likelihood of benefit

Ethical considerations might be another reasonfor not choosing to use an experimental design.If there is preliminary evidence of a certain typeof intervention being beneficial or if it could bepresumed to be beneficial, you might not want toplan an evaluation where some workers arepurposely put into a control group in which thereis a chance of great harm. For example, theaddition of machine guarding to a cuttingmachine could promise to be very beneficial,especially if there has been past severe injuriesthat probably could have been prevented byguarding. You might want to use only a before-and-after design in this situation.

Individual consent

Different cultures have various views on whattypes of research are ethically acceptable.Modern Western societies tend to emphasizeindividual autonomy and demand “fullyinformed consent” from study subjects. Forexample, in medical research, such consent isrequired to enroll a patient in a study.

The situation in North American workplaces issomewhat different than in the medical field.Management is considered to have certainprerogatives, as well as legislated responsibilitiesregarding safety, which allows it to makedecisions about what happens in the workplace.Thus, employers can impose a new safetyintervention upon their work force, and peoplemight be assigned to an intervention group orcontrol group, without first seeking their consent.This is less likely to occur in some Europeancountries, particularly Scandinavia, where moreworker participation exists in work-relateddecisions. In addition, workplace researchcarried out under the aegis of an academic orfederal research institution usually requiresapproval by ethics committees, which willcertainly insist upon individual consent to obtain

data of a more personal nature (e.g., healthrecords, employee opinions).

2.7 Practical tips

2.7.1 Time management

The evaluator has to be careful about the time setaside to properly conduct the evaluation. Often,the need to implement the intervention andresolve any unexpected problems can end upmonopolizing what hours are available for theevaluation. The less obvious, but neverthelessimportant, demands of consulting with theevaluation committee and communicating withworkplace parties regarding the evaluation cantoo easily fall by the wayside unless time isallocated for them.

2.7.2 Dealing with reaction to interim results

Be prepared for the reaction by workers,employers, etc. to interim results. If the resultsare encouraging, the group designated as acontrol may apply strong pressure to be allowedto receive the intervention immediately and notcontinue to incur possible risk until scientificproof has been accumulated. Your evaluationcommittee may come up with all sorts of interimsuggestions about ways to modify theintervention, which could destroy the scientificpurity of the study. In the face of proposedalterations to the intervention or evaluationdesign, always keep in mind what effect theproposed changes will have on your ability toanalyse, interpret and draw conclusions fromresults. There is no easy single answer to thesedilemmas and the decision made will dependpartly on the importance of the evaluationcompared to the value of the change theintervention is designed to produce.

2.7.3 Intervention diary

We strongly recommend keeping a diary or log-book of the intervention and the evaluation, inorder to supplement formal data collection

Page 30: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

15

Chapter 2 Planning

methods. It can be exceptionally valuable to lookback after the evaluation has revealedunexpected results. The diary provides aconvenient place to record information about anypotential influences on the intervention outcomesthat were not taken into account by theexperimental design, including what is takingplace in the workplace - such as a change insenior personnel or work processes. Evaluatorscan also track decisions made over the course ofthe evaluation which might not be documentedelsewhere.

2.7.4 Getting cooperation of workplaceparties

A critical step in a successful intervention andevaluation is obtaining the cooperation of allinvolved. This includes the people in theworkplace participating in the intervention andevaluation, their supervisors or managers, thoseusing the results of the evaluation and anybodyelse interacting with the individuals or activitiesjust mentioned. Evaluation can be threatening. Itimplies someone is checking, observing, andquestioning. It is vital to explain to thoseinvolved what the evaluation is for, what willhappen to the information collected (particularlyif this is personal or confidential) and what willbe done with the results.

Often, it is necessary to stress that the evaluationaims to learn about and improve the intervention,

and not engage in criticizing or fault-finding. Inparticular, individuals with a stake in the successof the intervention will be sensitive, as will thosewho chose or designed it, or invested time inimplementing it. Ongoing, repeated, clear, and,ideally, interactive communications with allinvolved is recommended. Communication canhelp allay fears, reduce resistance to change,increase acceptance of the evaluation results andencourage adoption of the intervention, if shownto be successful.

2.8 SummaryThis chapter has introduced some of the initialdecisions and considerations that must be takeninto account in planning an evaluation. Theseinclude the evaluators’ terms of reference and theselection of an evaluation committee. It was alsoshown how conceptual models and programlogic models can help identify what conceptsshould be assessed for the purpose of anevaluation. Such assessment can take place usingquantitative or qualitative methods. In addition,there are a number of ethical considerations inan evaluation.

Once you have undertaken the broader decisionsoutlined in this chapter, you will be ready tochoose the evaluation design. Options fordesigns will be discussed further in Chapters 3and 4.

• Decide objectives and scope of the evaluation.

• Involve all parties relevant to the intervention (expertise, influence, etc.) in planning and at subsequent stagesof the evaluation.

• Make a conceptual model relevant to the intervention, identifying concepts which must be measured oraccounted for in the design.

• Make a program logic model.

• Start designing the evaluation, considering several aspects.

• Consider using a quasi-experimental or experimental design, if permitted by feasibility and ethicalconsiderations.

• Keep an intervention diary.

Key points of Chapter 2

Page 31: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

16

Planning Chapter 2

Page 32: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

17

3.1 Introduction

3.2 Design terminology

3.3 Non-experimental evaluation designs

3.4 Before-and-after design

3.5 Threats to internal validity of before-and-after designs3.5.1 History threat3.5.2 Instrumentation/reporting threat3.5.3 Regression-to-the-mean threat3.5.4 Testing threat3.5.5 Placebo and Hawthorne threats3.5.6 Maturation threat3.5.7 Dropout threat

3.6 Summary

Before-and-after design:A simple evaluation design

Chapter 3

Page 33: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

18

Before-and-after Design Chapter 3

3.1 Introduction

Chapters 3 and 4 will discuss three basic typesof evaluation designs: non-experimental; quasi-experimental; and experimental.9 These designsdiffer in the strength of evidence they providefor intervention effectiveness. We can be moreconfident in a result from an evaluation based onan experimental or quasi-experimental designthan with a non-experimental design; but oftenwe have no choice. We either have to use a non-experimental design or not attempt anyevaluation. In this case, we advise that a non-experimental design is better than none at all.

Chapter 3 therefore focuses on the before-and-afterdesign, a type of non-experimental designcommonly used in safety studies. We outlinesome of the reasons the before-and-after designmust be used with caution. These are calledthreats to internal validity; i.e., circumstanceswhich threaten our ability to correctly inferwhether or not the intervention had the desiredeffect. Chapter 4 will cover quasi-experimentaland experimental designs.

3.2 Design terminology

In design terminology, “before” refers to ameasurement being made before an interventionis introduced to a group and “after” refers to ameasurement being made after its introduction.Equivalent terms for “before” and “after” are“pre” and “post”.

3.3 Non-experimental designs

The before-and-after design is the mostimportant non-experimental design for ourpurposes, since it is a reasonable option for anevaluation. Although it suffers from manythreats to internal validity, it can, in many cases,provide preliminary evidence for interventioneffectiveness, especially when supplementedwith complementary information. We will

consider it in detail in Sections 3.4 and 3.5. Thereare also two other types of non-experimentaldesigns: after-only and after-only-with-a-non-randomized-control-group. As implied,measurement is made only after theintervention’s introduction in both of thesedesigns and, hence, they are less desirable.

An example of an after-only design is where youmeasure safety knowledge in a test only after youtrain a work group. The weakness of thisapproach is that you cannot be sure whether thetest score would have been any different withouttraining. Even if the average score after trainingwas 90%, the training might actually have beenineffective, since the group might also havescored 90% on the test if it had been given beforethe training. Thus, the only acceptable use forthis type of design is to ascertain if a certainstandard has been met. It is not useful foreffectiveness evaluation.

An example of the after-only-with-a-non-randomized-control-group design is a case whereyou measure knowledge in one group followingtraining, and also measure it in another groupwhich did not have the training. If the score ishigher in the first group than in the second, forexample, 90% compared with 80%, you might betempted to think that the training had beeneffective. However, once again, you cannot tell ifthe training had any impact, because you do notknow what the “before” values would have been.If they had been 90% and 80% respectively, yourconclusion regarding program effectivenesswould have been wrong. Thus, the after-only-with-non-randomized-control-group design isalso not useful for effectiveness evaluation.

9 We follow the classification described by Cook and Campbell [1979] and Cook et al. [1990].

Page 34: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

19

Chapter 3 Before-and-after Design

3.4 Before-and-after design

The before-and-after design offers betterevidence about intervention effectiveness thanthe other non-experimental designs justdiscussed. Consider our example of training.Suppose a group had been given a test ofknowledge in the morning and scored 50%, andfollowing a day of training, the group repeatedthe same test and scored 80%. [We illustrate thisin the above figure by each “O” representing ameasurement (in this case, the test), and the “X”representing the introduction of the intervention(training).] In this situation, the evidence wouldbe strong that the training caused the increase intest score. Another way of saying this is that theevidence of causality would be strong. Besidesthe training, little else over the course of the daycould have caused the observed increase inknowledge (provided of course that we do notuse an identical test on the two occasions andgive the group the answers in the meantime).

The before-and-after design is most useful indemonstrating the immediate impacts of short-term programs. It is less useful for evaluatinglonger term interventions. This is because overthe course of a longer period of time, morecircumstances can arise that may obscure theeffects of an intervention. These circumstancesare collectively called threats to internal validity.

3.5 Threats to internal validity ofbefore-and-after designs

Threats to internal validity are possible alternativeexplanations for observed evaluation results. Themore threats to internal validity, the lessconfident we are that the results are actually dueto the intervention. A good evaluation identifiesthe possible threats to its internal validity andconsiders them one-by-one as to the degree ofthreat they pose.

One way to minimize the threats to validity is toconsider other data or theory. In doing so, youmight be able to render a particular threat tovalidity highly unlikely. Or, you might be able toshow that it poses a minimal threat and thus doesnot change the conclusions of the evaluation.Some of the later examples illustrate thisapproach.

The ideal way of dealing with internal validitythreats is by using a quasi-experimental orexperimental design. Chapter 4 will show thatyou can eliminate most of the threats we areabout to discuss by including a good “non-randomized control group” in your evaluationdesign. However, in the following we willassume that it has only been possible toundertake a before- and-after design. Oursuggestions for dealing with threats to internalvalidity are made with this limitation in mind.

O X O

Page 35: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

20

Before-and-after Design Chapter 3

History Some other influential event(s), which could affect the outcome, occurs during the intervention

Instrumentation/Reporting Validity of measurement method changes over course of the intervention

Regression-to-the-mean Change in outcome measure might be explained by a group with a one-time extreme value naturally changing towards a normal value

Testing Taking measurement (e.g. test) could have an effect on the outcome

Placebo Intervention could have a non-specific effect on the outcome, independent of the key intervention component

Hawthorne Involvement of outsiders could have an effect on the outcome, independent of the key intervention component

Maturation Intervention group develops in ways independent of the intervention (e.g. aging, increase experience, etc.), possibly affecting the outcome

Dropout The overall characteristics of the intervention group change due to some participants dropping out, possibly affecting the outcome

Threat to internal validity Description of threat

Table 3.1:Threats to internal validity

3.5.1 History threat

A “history threat” occurs when one or moreevents, which are not part of the intervention butcould affect the outcome, takes place between the“before” and “after” measurements. Commonhistory threats include changes in the following:management personnel; work processes,structure or pace; legislation; and management-labor relations. Clearly, the longer the timebetween the “before” and “after” measurements,the more opportunity is there for an extraneous,interfering event to happen.

There are two history threats in theaccompanying example (opposite), one fromoutside the company and the other from inside.Either the community campaign or the humanresource initiatives - or both - are alternativereasons for the observed decrease in injury rate.

You are trying to evaluate a new ergonomicintervention for nurses in a hospital. Aneducational program about back health andlifting techniques was provided, a programof voluntary stretches introduced and liftingequipment purchased. It was found that theinjury rate for the two years before theintervention was 4.4 lost-time back injuriesper 100,000 paid hours and for the two yearsfollowing it was 3.0. Thus, you conclude thatthe ergonomic intervention has been effective.

But what if one month after the educationprogram, a government ministry launched ayear-long public awareness campaign aimedat reducing back injury? And, what if thepresident of the hospital was replaced twomonths after the in-house ergonomicprogram and her replacement introducedhuman resource initiatives to improvecommunication among staff? This wouldmake you less confident about concludingthat it was the intervention alone that madethe difference in back injuries.

Example of a history threat

Page 36: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

21

Chapter 3 Before-and-after Design

How to deal with history threats

The opportunities for history threats to arise insafety intervention evaluations are considerablebecause of the complex nature of the workplaceand its environment. Careful considerationshould be given to the events which could affectthe safety outcome of interest. Earlier, anintervention diary was recommended as a meansof keeping track of such events throughout theintervention and evaluation. Interviews with keypersonnel, a form of qualitative investigation, canalso identify significant events that have takenplace. [Qualitative methods are discussed inChapter 7]. Even if none have occurred in thebroader environment or the workplace itself, itis important to be able to confidently state thisin the report of findings from a before-and-afterdesign.

If you do identify a history threat, try to estimatehow large an effect the threat would likely havehad. The following example illustrates how youcan use other data to estimate these possibleeffects.

Consider the preceding example of thehospital ergonomic program which appearedto have decreased back injury rates. Toreduce the threat of the communitycampaign, you could try to obtain statisticson changes in back injury rates for otherhospitals in the same community. If the injuryrate in the other hospitals had remainedconstant or increased, you could concludethat changes in your hospital (i.e., either theergonomic intervention or human resource(HR) initiatives) had an effect on injury ratesbeyond any effect of the communityeducation program.

As far as considering the effect that the HRinitiatives might have had on injury rates, youcould look at other HR-related outcomes, e.g.,non-back related absenteeism, non-backrelated health care claims or turnover. Youreason that if the HR initiatives werepowerful enough to have an effect on injuryrates, then they should also affect otheremployee-related health indicators. If theseother outcomes show little change, you couldbe more confident that the observed decreasein injury rates was due to the ergonomicprogram alone and not the president’sinitiatives.

Example of dealing with a history threat

Page 37: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

22

Before-and-after Design Chapter 3

3.5.2 Instrumentation/reporting threat

An instrumentation threat to validity occurswhenever the method of measuring safetyoutcome changes between the “before” and“after” measurements. There are many ways thiscould happen, depending on the type of outcomemeasured, as the following examples illustrate(Exhibit 3.1).

An instrumentation threat of special concern insafety evaluations using injury statistics is anychange in injury reporting over the course of theevaluation. Sometimes, this arises simplythrough changes in administrative policies orprocedures: e.g., the definition for an injury’srecordability changes or the recognition of aninjury as work-related changes. A particularlytricky situation arises when the intervention itselfaffects the reporting of incidents, as the followingexamples portray (Exhibit 3.2).

1) You are evaluating a new, redesignedworkstation by taking electromyographicmeasurements of muscle activity inworkers using their old workstations andthen again in the same workers with thenew workstations. You did not realizethat at the time the measurements weretaken at the old workstations, theequipment was malfunctioning, leadingto an underestimate of the true values.

2) You give a multiple choice pre-test ofknowledge with four possible choices:unlikely, somewhat unlikely, somewhatlikely and likely. Afterwards, someonecomments to you that they had a hardtime choosing among the four choices andwould have liked a “don’t know” option.In the post-test you decide to give fivepossible choices, the four above, plus“don’t know”.

3) You ask workers to fill out a safety climatequestionnaire before and after theintroduction of a joint health and safetycommittee in the plant to evaluate thecommittee’s effectiveness in changing thesafety climate. The first time, employeesfilled out the questionnaire on their owntime; the second time, they were giventime off during the day to complete it.

Exhibit 3.1 Examples of instrumentationthreats

1) You are evaluating the effect ofundergoing a voluntary safety audit onthe work-site by looking at injurystatistics before and after introduction ofthe audit. You realize, however, that theaudit includes a section on data collectionprocedures which could affect thereporting of incidents on the work-site.The increased reporting could cancel outany reduction in actual injuries. Thus,any change in injury statistics followingthe audit, might be due to the newreporting process alone, and not to truechange in the injury rate.

2) A mandatory experience rating programwas introduced by the government. Thisgives companies financial incentives andpenalties based on past injury statistics.Following the introduction of theprogram, it was found that lost-timeinjury frequency for the jurisdiction haddeclined. Critics of the program citeanecdotal evidence of workers withserious injuries being pressured to acceptmodified work. No one can be certainthat the decline in injury frequency resultsfrom enhanced injury prevention or fromsimply a suppression of reporting.

Exhibit 3.2 Examples of reporting threatsarising from intervention

Page 38: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

23

Chapter 3 Before-and-after Design

Dealing with instrumentation and reportingthreats

The best way to deal with most instrumentationthreats is to avoid them in the first place. Keep“before” and “after” measurement methodsconstant. Make sure all measuring equipment isfunctioning properly. Give questionnaires in thesame format and under the same conditions forall measurements. Keep processes for generatingand collecting injury statistics constant.

In the cases where the intervention itself mighthave affected the reporting of injuries, especiallyminor injuries, take a close look at the ratio ofmajor-to-minor injuries. The greater thesuppression of reporting, the higher this ratio willbe, because minor injuries are easier to not reportthan major injuries. If the ratio is constant, thenthe likelihood of an instrumentation threat isreduced. You can also interview key informantsto learn if the intervention affected reporting.

3.5.3 Regression-to-the-mean threat

Regression-to-the-mean is an issue when the basisfor choosing the intervention group is a greaterapparent need for the intervention (e.g., higherinjury rates). This situation is not unusual as thefollowing examples illustrate (Exhibit 3.3).

An alternative explanation for the apparentsuccess of the safety initiatives is “regression- to-the-mean”. This concept can be understood asfollows. From year-to-year a group’s orcompany’s injury rates will fluctuate - sometimesthey will be higher and sometimes lower. Anygroup with a lower than usual injury rate in anygiven year is therefore more likely to have theirrates increase than decrease in the following year,assuming workplace conditions do not change.Similarly, the odds are that any group with ahigher than usual injury rate in any given year ismore likely to have their rates decrease thanincrease in the subsequent year. Thus, if theintervention includes only groups with highinjury rates, a part of any decrease observed may

have nothing to do with the intervention itself.Rather, the rate is simply moving closer to theaverage rate.

Exhibit 3.3 Examples of regression-to-the-mean threats

1) Division A of the manufacturing companyhad an unusually high rate of slip and fallinjuries last year. The president isconcerned and an enhanced inspectionschedule is therefore implemented. Ratesare lower the following year and so itappears that the intervention wassuccessful.

2) The government labor ministry hasdecided to implement a new educationalinitiative with companies whose injuryrate in the previous year was twice ashigh as the average for their industrialsector. Personal contact was made withofficials of those companies, at which timepenalty-free on-site inspection and advicewas offered. The group of companieshad, on average, a lower injury rate thefollowing year. Thus, it appears that theministry program was successful.

Page 39: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

24

Before-and-after Design Chapter 3

Dealing with regression-to-the-mean threats

There is nothing that can be done to deal with aregression-to-the mean threat if you have a singlemeasure of frequency or rate before theintervention and one after, for a single workplace.However, with historical data, you can seewhether the before measurement is typical ofrecent data or if it is an outlier. If it is the latter,then regression-to-the mean does threaten thevalidity of the conclusions and you might wantinstead to use more of the historical data tocalculate the pre-intervention frequency or rate.Hauer [1980, 1986, 1992] has developed statisticalapproaches that correct for this phenomenon indata from multiple work-sites. However, analternative approach would be a quasi-experimental or experimental design, wheresome high injury groups receive the interventionand others are kept under control conditions forcomparative purposes.

3.5.4 Testing threat

A testing threat to internal validity is a concernwhen the act of taking the before measurementmight itself affect the safety outcome used toevaluate the intervention. This threat is only anissue for such outcomes as worker safetyknowledge, attitudes or practices. Any of thesecould be affected by the act of taking initial

measurements by methods involvingquestionnaires, interviews or observations. Thiscontrasts with injury outcomes which can usuallybe measured without interacting with workers.

Dealing with the testing threat

If you always plan to give a pre-test before givingthe intervention, you do not really need to knowwhether any observed effect was due to the pre-test, the intervention, or the combination of both.However, you then must continue to include thepre-test as part of the intervention package.Removing the pre-test risks decreasing theoverall effect of the intervention. If you want todo away with the pre-test, you should at firstcontinue to include a post-test. You can thencheck to see whether the post-test results areaffected by the pre-test’s removal. If not, and ifthe groups with which you are intervening aresimilar over time (and so similar pre-test scorescan be assumed), then you can conclude that atesting effect was unlikely. However, a trulydefinitive investigation of a testing effect requiresa quasi-experimental or experimental design.

You want to evaluate a training intervention designed to increase worker participation in plant safety.You use a questionnaire to assess pre-intervention worker attitudes, beliefs and practices concerningparticipation. You administer a second, post-intervention questionnaire after a three month programof worker and supervisor training. Comparison of the questionnaire results show a significant changein scores, indicating that participation has increased.

Upon reflection you are not really sure what accounts for the improvement in the score. You reasonthat it could be any of the following: a) an effect of the training program alone; b) an effect of havingawareness raised by completing the first questionnaire; or c) a combined effect of completing thequestionnaire and then experiencing training. Either of the latter two possibilities involves a testingthreat to the internal validity of the evaluation.

Example of a testing threat

Page 40: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

25

Chapter 3 Before-and-after Design

3.5.5 Placebo and Hawthorne threats

The “placebo effect” is a concept from clinicalstudies of medical treatments. It has beenobserved that a percentage of the study subjectstreated with a placebo (i.e., an inactivesubstance), instead of a medical treatment, willshow an improvement of symptoms beyond thatexpected of the normal course of their medicalcondition. It seems that the placebo operatesthrough a psychological mechanism whichresults in an alleviation of symptoms. Thepatients believe that the treatment will besuccessful and this has an effect in itself on theoutcome.

The “Hawthorne effect”10 usually refers to aneffect of the involvement of researchers or otheroutsiders upon the measured outcome. This termarose from a famous study of factory workers atWestern Electric’s Hawthorne Works in the 1920s.A subset of workers was moved to a differentsection of the factory, their working conditionsmanipulated and the effect of this on theirproductivity observed. It turned out thatworkers were more productive under any of the

work conditions tried - even uncomfortable onelike low lighting conditions. It is believed thatthe effect of the new psychosocial workingconditions (i.e., increased involvement ofworkers) in the experimental situation actuallyovershadowed any effect of the changes in thephysical environment.

Dealing with Hawthorne or placebo effectsrequires somehow “separating” them from theeffect of changing an injury risk factor as part ofthe intervention. In the example above, the effectof the consultant (and that of taking observations)was separated from the effect of providingfeedback by having the consultant take baselinemeasurements prior to starting the feedback.

Due to an increasing prevalence of “repetitivestrain injury” in a telecommunications firm,the management agreed to purchase newkeyboards for one division. A survey ofemployee upper extremity symptoms wasconducted the week before the keyboardswere introduced and then three weeksafterwards. Everyone was pleased to find asignificant decrease in reported symptomsbetween the “before” and “after”measurements. Management was on theverge of purchasing the same keyboards for asecond division, but there was concern abouta “placebo effect” of the new keyboard.

Example of a placebo threat

A work-site decides to implement andevaluate a new training program focused onchanging safety practices by providingfeedback to employees. A consultantexamines injury records and, with the help ofworkers and supervisors, develops a checklistof safety practices. The list will be used bythe consultant to observe the work force andprovide feedback to the employees abouttheir practices. The consultant realizes thathis presence (and the taking of observations)could make workers change their normalbehavior. To avoid this potential Hawthorneeffect, he makes baseline observations on adaily basis until his presence seems to nolonger create a reaction and the observationsbecome constant.

Example of a Hawthorne threat and oneway to deal with it.

10 Some discourage the continued use of this term. See Wickström G, Bendix T [2000]. The “Hawthorne effect” - what did the original Hawthornestudies actually show? Scan J Work Environ Health 26:363-367

Page 41: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

26

Before-and-after Design Chapter 3

3.5.6 Maturation threat

A maturation threat to internal validity occurswhen the apparent change in safety outcomecould be due more to the intervention groupchanging naturally (i.e., employees growingolder, or becoming more knowledgeable, andmore experienced) rather than to the interventionitself.

Dealing with maturation threats

In the example above, we need to consider agingof the work force as a maturation threat. Astatistician might eliminate this maturation threatby using appropriate statistical techniques in theanalysis. With such a correction in the aboveexample, we might find that the program actuallyhad either no effect or even a positive effect,instead of the apparent detrimental effect.

3.5.7 Dropout threat

The dropout threat to internal validity ariseswhen enough people drop out of the study toalter the characteristics of the intervention group.Furthermore, these characteristics are statisticallyrelated to the safety outcome of interest. Thiswould not matter if you could still measure thesafety outcome on all people who started theprogram. For instance, some individuals mightdrop out of a company back belt program. Yet, ifthey continue to work for the company, youcould still evaluate the intervention if thecompany injury rate data for the entireintervention group is available. Not surprisingly,safety outcome data on drop-outs is not alwaysaccessible, as the following example illustrates.

A shipping company instituted annualmedical screening for their dock-workers inorder to identify and treat musculoskeletalproblems early. Injury statistics after fouryears of the program indicated that theincidence of injuries remained about thesame, but the length of time off work perinjury had been increased. It appeared thatthe program had been detrimental. But anoccupational health nurse pointed out thatthere is a tendency for older workers to returnto work more slowly than younger workers,following an injury. Because there had beenfew new hires at the company, thismaturation threat was a real possibility.

Example of a maturation threat

An evaluation of an intervention to reducefarmer injuries takes “before” and “after”measurements using a questionnaire. Thequestionnaire includes questions aboutinjuries requiring hospital treatmentexperienced over the past year.Unfortunately, over the course of theintervention, a large number of farmerswithdraw and do not fill out thequestionnaire again. Thus, “after”measurements are not possible for a largepercentage of participants.

You find that the average self-reported injuryrate for the group decreased and so it appearsthat the intervention had an effect. But youcan not be sure whether this was actually dueto the intervention or that those with higherinjury rates dropped out of the study earlier.

Example of a dropout threat

Page 42: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

27

Chapter 3 Before-and-after Design

Dealing with a dropout effect

If you have access to final outcome data for theindividuals or groups who dropped out of theintervention, be sure to include their data withthat of the others assigned to the intervention atthe start of the study (as long as it does notcontravene the conditions described in theirconsent form). This estimate of interventioneffect will give a minimal, conservative estimateof the program’s potential since not everyone wasexposed to the program for the full length of time.

If you do not have access to final outcomemeasures for intervention dropouts, an importantthing to do is compare the “before” measurementof dropouts, as well as their other characteristic(e.g., age), with those who continued with theintervention. If those who continued and thosewho dropped out are similar in thesemeasurements, then the threat of dropout on theoutcome is reduced. You can assume that if thedropouts and others were the same before theintervention, they would also be similarafterwards, with the exception of havingcompleted the intervention. If the “before”measurements or other characteristics of thedropouts are different, then the threat of dropoutpersists. You could confine the estimate of theprogram’s effectiveness to those individuals whoparticipated for the entire length of the program.However, the results would not be generalizableto the entire target population and likelyoverestimate the program’s effectiveness for thatpopulation.

3.6 Summary

This chapter focused on the before-and-afterdesign, which is considered to be a type of non-experimental design. A before-and-afterevaluation design was shown to suffer fromseveral threats to internal validity: history,instrumentation, regression-to-the-mean, testing,placebo, Hawthorne, maturation and dropout.Fortunately, as the illustrations showed, thesethreats to internal validity can be handled tosome extent by additional data collection oranalysis.

We also showed the inherent vulnerability of abefore-and-after design to internal validitythreats, especially for long-term evaluationperiods. The longer your intervention andevaluation, the more you will want to consider aquasi-experimental design as an alternative. Wediscuss both quasi-experimental andexperimental evaluation designs in the nextchapter.

Page 43: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

28

Before-and-after Design Chapter 3

• If you have no choice but to use a before-and-after design, try to eliminate the threats to internalvalidity.

• Identify other changes in the workplace or community that could have an effect on the outcome(history threats) and estimate their possible effect.

• Ensure that before and after measurements are carried out using the same methodology (to avoidinstrumentation or reporting threats).

• Avoid using high-injury rate groups, or other such extreme groups, as the intervention group in abefore-and-after study (to avoid regression-to-the-mean threats).

• Allow for the fact that taking a test can have an effect of its own (testing threat).

• Identify possible placebo or Hawthorne threats and try to minimize them.

• Identify any natural changes in the population over time which could obscure the effect of theintervention (maturation threat), and possibly correct for their effect during the statistical analysis.

• Identify the effects of intervention participants dropping out and allow for this in the analysis.

Key points of Chapter 3

Page 44: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

29

Quasi-experimental andexperimental designs:more powerful evaluation designs

Chapter 4

4.1 Introduction

4.2 Quasi-experimental designs4.2.1 Strategy #1: Add a control group4.2.2 Strategy #2: Take more measurements (time series designs)4.2.3 Strategy #3: Stagger the introduction of the intervention4.2.4 Strategy #4: Reverse the intervention4.2.5 Strategy #5: Measure multiple outcomes

4.3 Experimental designs4.3.1 Experimental designs with “before” and “after” measurements4.3.2 Experimental designs with “after”-only measurements

4.4 Threats to internal validity in designs with control groups4.4.1 Selection threat4.4.2 Selection interaction threats4.4.3 Diffusion or contamination threat4.4.4 Rivalry or resentment threats

4.5 Summary

Page 45: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

4.1 Introduction

In Chapter 3 we described the simplest type ofevaluation design for intervention effectivenessevaluation, the before-and-after or pre-postdesign. We showed how its strength is inherentlylimited by several threats to internal validity.

In this chapter, we discuss several types of quasi-experimental and experimental designs. All offersome advantages over the simple before-and-after design, because some of the threats tointernal validity are eliminated. In the firstsection we show how a quasi-experimentaldesign evolves from the addition of one or moredesign elements to a before-and-after design.After this, we describe experimental designs.Although the latter offer the greatest strength ofevidence, quasi-experimental designs are oftenmore feasible in workplace situations. We closethis chapter with the discussion of various threatsto internal validity that arise with a control orcomparison group.

4.2 Quasi-experimental designs

There are five basic strategies to improving upona before-and-after design. This section describescommon approaches to adopting one or more ofthese strategies.

4.2.1 Strategy #1: Add a control group (e.g.,pre-post with non-randomized control)

The pre-post with non-randomized control designmimics a simple experimental design. Like theexperimental design, there is at least one groupwhich receives the intervention (interventiongroup) and one group which does not (controlgroup)11. The difference lies in the wayparticipants are assigned to groups for thepurpose of intervention implementation andevaluation. In an experiment participants arerandomly assigned;12 in quasi-experimentaldesigns, they are not. Often assignment ofparticipants to a group is predetermined by thework organization. For example, you mightdeliver an intervention to one company division.Another division, which is similar, acts as a non-randomized control group by not receiving theintervention. In the example below, theassignment of reindeer herders to interventionand control groups was determined bygeographical location.

Quasi-experimental and Experimental Designs Chapter 4

30

Design strategies which change a before-and-after design into a quasi-experimental design

Strategy 1: add a control group

Strategy 2: take more measurements beforeand after the interventionimplementation

Strategy 3: stagger the introduction of theintervention among groups

Strategy 4: add a reversal of the intervention

Strategy 5: use additional outcome measures

11 The terminology varies regarding the use of the term “control group”. Some use it only in the context of experimental designs, in which theintervention and control groups are formed through randomization. Others, including ourselves, also use the term control group in the contextof quasi-experimental designs, in which groups are formed through a non-random process. In this case, the quasi-experimental control groupis referred to as a “non-randomized control group”. “Comparison group” is sometimes a synonym for “control group”, but in other cases isreserved to describe the non-intervention group in a quasi-experimental design.12 Random assignment of participants to groups is discussed in Section 5.4.

Page 46: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Advantages of the “pre-post with non-randomized control group” design

By adding a non-randomized control group tothe simple before-and-after design, youautomatically reduce some of the threats tointernal validity discussed in Chapter 3. Inparticular, interference by external circumstances(i.e., history effects) is reduced, because they willoften apply to both the control group and theintervention group. It therefore allows aseparation of the effect of the intervention fromthat of other circumstances. The followingexample illustrates this. The above example demonstrates that it is

possible to conclude that an intervention isineffective, even though fewer accidents are seenafter the intervention. The control group showedthe evaluators how much change to expect in theabsence of the intervention. These changes werelikely due to history, and possibly, testing andHawthorne effects, according to the originalreport by Pekkarinen et al.13 Thus, we see howthe presence of the control group allowed one toexamine the intervention effect, free from theinfluence of internal validity threats.

On the other hand, a new threat to validity -selection effects - arises from using a non-randomized control group. This threat occurswhen the intervention and control groups differwith respect to the characteristics of groupparticipants and these differences influence themeasures used to determine an interventioneffect. Selection effects will be discussed furtherat the end of the chapter.

Chapter 4 Quasi-experimental and Experimental Designs

31

Example of a pre-post with randomizedcontrol group design

Due to the high rate of injuries amongreindeer herders, preventive measures weredeveloped. In intervention group A, lettersdescribing possible preventive measures weresent to district leaders and contacts, who wereasked to pass on the information to herders intheir district. In intervention group B,occupational health personnel trained inprevention measures passed on informationduring medical examinations. There was alsoa control group C, which received nointervention. Pre-post statistics for the threegroups are shown below.

Statistical analysis confirmed that the groupsdid not differ in terms of a decrease inaccident rate. The authors had to concludethat the intervention efforts were ineffective.

13 Data from Pekkarinen et al. [1994] with permission of the Arctic Institute of North America.

Number of accidents/working days forreindeer herder groups13

Page 47: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

4.2.2 Strategy #2: take more measurements(time series designs)

A simple time series design differs from the simplebefore-and-after design by taking additionalmeasurements before and after the intervention.A baseline time trend is first established by takingseveral outcome measurements beforeimplementing the intervention. Similarly, inorder to establish a second time trend, several ofthe same measurements are made afterintroducing the intervention. If the interventionis effective, we expect to find a difference inoutcome measures between the two time trends.

Advantages of simple time series design

Figure 4.1 illustrates how much easier it is tointerpret the results of a time series evaluationdesign than a simple before-and-after design. Inthe first panel we see that there has been a dropin our safety measure from the period before theintervention to the one afterwards. As discussedin Chapter 3, several possible alternativeexplanations for this come to mind, e.g., history,maturation, instrumentation or Hawthorneeffects. By adding measurements, as shown inthe second panel, we can reduce the likelihoodof some of these alternative explanations.

The maturation threat is eliminated because weobserve that the change between the baselinetime trend and the second time trend is abrupt.In contrast, changes due to maturation, such asincreasing age or experience, are more gradual.Regression-to-the-mean or testing effects havealso been eliminated as possible threats becausewe can see that safety outcomes are repeatedlyhigh before and repeatedly low afterwards.Placebo and Hawthorne effects are less likelyexplanations because they tend not to be

sustained once people have adapted to a changein their conditions. The threat of a history effectis somewhat lessened because the window ofopportunity for a coincidental event is narrowedby the more frequent measures taken. Dropoutand instrumentation both remain as threats,without consideration of additional information.

How many measurements are needed for atime series design?

The number of measurements you need for atime series design depends on the amount ofrandom fluctuation (noise) that may occur in theoutcome being measured and how much of animpact the intervention is expected to have.Somewhere between 6 to 15 measurements toestablish a baseline and the same number againto establish the trend afterwards are typicallyrequired.14

Because of the necessity for many measurements,

Quasi-experimental and Experimental Designs Chapter 4

32

Figure 4.1 Comparison of before-and-afterand time series designs

Saf

ety

outc

ome

Time

Time

Before-and-after design

Simple time series design

Intervention commences

Intervention commences

Saf

ety

outc

ome

14 Several workplace examples can be found in Komaki and Jensen [1986].

Page 48: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

the time series design is suitable for only somesituations. For example, a time series design,using injury rate as the outcome measure wouldlikely not be suitable for a small workplace. Itsimply takes too long - a year or more - for a smallworkplace to establish a reliable injury rate.

On the other hand, the design could be quitesuitable in a group of small workplaces, a biggerworkplace, or if observed work-site conditionswere measured instead of injury rate. Thesesituations permit more frequent and reliablemeasurement.

Even when it is not possible to take as manymeasurements as are needed for a time seriesanalysis, taking additional measurements overtime is still a good idea. It gives you a bettersense of the pattern of variability over time andwhether the last “before” measurement is typicalof the ones preceding and the first “after”measurement is typical of the ones following.You are better informed of potential threats tointernal validity and the sustainability of theintervention’s effect. It may allow you to betterestimate the effect of the intervention moreaccurately by pooling data.

Multiple time series designs

Even better than using basic strategy #1 or #2alone, you can strengthen the before-and- afterdesign even more, by combining both approaches(adding a control group and taking moremeasurements).

4.2.3 Strategy #3: Stagger the introduction ofthe intervention (e.g., multiple baselinedesign across groups)

A special type of multiple time series design isknown as “multiple baseline design acrossgroups”. With this design, all groups eventuallyreceive the intervention, but at different times.As a result, all groups also serve as a comparisongroup to each other.

Advantages of the multiple baseline acrossgroups design

The advantage of the multiple baseline acrossgroups design is that it markedly reduces thethreat of history effects. When an intervention isgiven to only one group, you can never really besure that something else did not coincidentallyoccur at the same time to cause the measuredeffect. Even when you are using a control group,something could still happen to only theintervention group (besides the interventionitself) that affects the outcome.

When the intervention’s introduction isstaggered, with the apparent effectscorrespondingly staggered, history effects are anunlikely explanation for the result. This isbecause one coincidence of the intervention andan extraneous event happening close together intime is plausible, but two or more suchcoincidences are much less likely.

Whenever a workplace or jurisdiction has morethan one division or group, a staggeredintroduction of the intervention should beconsidered as an alternative to introducing it toall divisions or groups at the same time. Thisstaggered arrangement can also allow an interimassessment and, if appropriate, modification ofthe intervention or its implementation, before itis introduced into other divisions (though suchmodifications should be considered in theanalysis and interpretation of results).

Chapter 4 Quasi-experimental and Experimental Designs

33

Page 49: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Quasi-experimental and Experimental Designs Chapter 4

34

15 Example from Komaki J, Barwick KD, Scott LR [1978] A behavioral approach to occupational safety: pinpointing and reinforcing safetyperformance in a food manufacturing plant. Journal of Applied Psychology 63:434- 445. Copyright © 1978 by the American PsychologicalAssociation. Adapted with permission.

Example of a multiple baseline across groups design15

A safety behavior training intervention was undertaken at a food manufacturing plant. Theintervention was first introduced in the wrapping department and then in the make-up department.The intervention started with an educational session on safety behaviors, after which a list of safetybehaviors was posted. From then on the group was given feedback by posting the results of behavioralobservations.

Safety behaviors were measured by a trained observer (three to four times a week). The observerused a checklist which gave an estimate of the percentage of incidents performed safely. Baselinemeasurements of safety behaviors were taken prior to introduction of the intervention.

You can see how, in each department, the change in safety behaviors followed implementation of theintervention. Having this sequence of events happen not only once, but twice, bolsters the causal linkbetween intervention and behavior change. Further, because implementation occurred at differenttimes, we really end up with two separate estimates of the amount of change caused by theintervention. [The reversal part of the intervention will be discussed in Section 4.2.4]

Page 50: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

4.2.4 Strategy #4: Reverse the intervention

One way of strengthening a before-and-after oreven a time series design is to follow theintroduction of an intervention with anotherphase of the project in which the intervention isremoved. In the simplest case, you end up withthree phases: a baseline phase; an interventionphase; and a reversal or withdrawal phase. Therationale here is that if you remove theintervention conditions, you shouldcorrespondingly see a change in the outcomeback towards the baseline condition.

Of course, this design is clearly not suitable for allsituations, because it is hoped that the effect ofan intervention will last and therefore is noteasily reversed. However, as the Figure in section4.2.3 shows, it has been found useful whenbehavior is the safety outcome being measured.In this case, the intervention was “reversed” byno longer giving the posted feedback.

Advantages and disadvantages of designswith a reversal phase

If you can demonstrate the effect of a reversalphase, you will have markedly reduced several ofthe internal validity threats discussed in Chapter4 - in particular history, maturation, testing,dropout and Hawthorne (assumingresearchers/outsiders are still present duringreversal phase). Instrumentation and placeboeffects may still remain as issues and should beconsidered. After demonstrating the effect ofintervention reversal, you are then free toreinstate the intervention.

The downside to the reversal design feature isthat repeated changes in safety programmingcould create confusion, stress and resentmentamong those affected. As well, if an interventionhas looked promising following its introduction,

subsequent removal could be consideredunethical. Thus, use this design feature withcaution.

4.2.5 Strategy #5: Measure multiple outcomes

The final strategy for increasing the strength of anevaluation design is to use more than one typeof outcome measure. We describe twoapproaches to doing this.

4.2.5.1 Add intervening outcome measures

We pointed out, using models in Chapter 2, thatthere can be a number of outcomes interveningbetween an intervention and the final outcome.We should ideally try to measure as many ofthese different intervention outcomes as isfeasible, in order to bolster the strength ofevidence provided by the evaluation design. Thisincludes measurement of the intervention’simplementation, as well as short- andintermediate-term effects of the intervention.

Measures of intervention implementation, suchas the documentation of equipment purchasesand work task modification in the followingexample, are especially important. In instanceswhere a program has failed, you want to be ableto distinguish between an inherently ineffectiveprogram and a flawed implementation. If anintervention has not been implemented asintended, measuring effectiveness by measuringchanges in outcome will likely underestimate theintervention’s potential impact. Thus, ifinadequate implementation is found by theevaluation, you might try first to improve thispart of the intervention, instead of discarding theintervention altogether.

Chapter 4 Quasi-experimental and Experimental Designs

35

Page 51: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

4.2.5.2 Add a related but untargeted outcomemeasure

The second approach to adding outcomemeasures involves measuring an outcome whichis similar to the main outcome measure, but nottargeted by the intervention. The additionaloutcome measure should be similar enough tothe main outcome measure so that it issusceptible to the most important threats tointernal validity. However, it also needs to bedifferent enough that it should be unaffected bythe intervention. The following examples showhow this approach works.

Quasi-experimental and Experimental Designs Chapter 4

36

A company plans to implement aparticipatory ergonomics program. Plansinvolve forming a labor-managementcommittee, assessing employee needs,purchasing new equipment, modifying worktasks and providing worker education. Thehealth and safety coordinator plans tomeasure the ultimate impact of the programby comparing self-reported symptoms andinjuries before and after the intervention isimplemented.

However there are concerns that a change insymptom and injury rates could have anumber of alternative explanations, such asstaffing changes, the business cycle,management changeover and Hawthorneeffects, etc. To deal with this concern, thehealth and safety coordinator plans someadditional measurements: records ofequipment purchases; and self- reports ofwork tasks, practices and stressors. These allmeasure outcomes intervening between theintervention and the final outcome of changesin symptoms and injuries.

Example of adding intervening outcomemeasures

Mason [1982] tried to evaluate theeffectiveness of a train-the-trainer kinetichandling training course, by looking at thechange in the rate of back and joint injuries inthe companies of instructors who had takencourse. When practically no change wasfound after a year, it was valuable to knowthat this was probably because few of theinstructors had organized and carried out in-company courses based on their own trainingduring that year. Furthermore, those who didrun courses had failed to retain most of theirtraining and therefore could not pass on thehandling techniques. The lack of anymeasurable effect of the intervention oninjuries was therefore no proof that the kinetichandling technique itself was not effective,but rather that an improvement in the trainingmethods for trainers were needed.

Illustration of the value of measuringintervention implementation

Page 52: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

4.3 Experimental designs

Two key features of an experimental design are 1)the use of a control group and 2) the assignment ofevaluation participants to either intervention orcontrol groups through randomization, a processin which participants are assigned to groups in anunbiased manner.18 Thus, an experimentaldesign uses an approach similar to strategy #1 inquasi-experimental designs (Section 4.2.1).

The use of randomization gives the experimentaldesign greater strength. We can be more certainthat any differences between the interventiongroup and the control group, with respect to theapparent effect of the intervention, can beattributed to the intervention, and not to groupdifferences. Although it is often not feasible touse an experimental design, it has been used inseveral occupational safety situations.

4.3.1 Experimental designs with “before” and“after” measurements

Earlier, three types of quasi-experimental designswere discussed that use non- randomized controlgroups: pre-post with non-randomized controlgroup (Section 4.2.1), multiple time series (4.2.2)and multiple baseline across groups (4.2.3).These same design approaches can turned intoexperimental designs by using randomization tocreate the groups.

The first design shown in Figure 4.3, “pre-post-with-randomized-control” has been used in thesubsequent examples. The first example involvesrandomizing work-sites into groups, and thesecond, randomizing individuals into groups.

Chapter 4 Quasi-experimental and Experimental Designs

37

1)16 The effect of new equipment on oil-drilling platforms was primarily evaluated by changes in therate of tong-related injuries, a type of injury which should have been reduced by using the newequipment. The rate of non-tong-related injuries, a related but untargeted outcome measure, wasalso tracked. Although this second type of injury should have been unaffected by the intervention,it would likely be similarly susceptible to any history or reporting effects threatening the internalvalidity of the evaluation. Thus, including this untargeted injury measure in the evaluationreduced these threats, since any history or reporting effects on tong-related injuries would also bedetected by changes in the non-tong-related injuries.

2)17 An ergonomic intervention among grocery check stand workers was primarily evaluated bymeasuring self-reported changes in musculoskeletal discomfort. The intervention appearedsuccessful because of significant change in reported symptoms in the neck/upper back/shouldersand lower back/buttocks/legs, the two areas predicted to benefit from the ergonomic changes.This conclusion was bolstered by a finding of no significant changes in symptoms in thearm/forearm/wrist, which were not targeted by the intervention. This made history, maturation,instrumentation, placebo, Hawthorne and instrumentation effects a less likely explanation forthe improvement in the upper extremity and lower back areas.

Examples of adding related but untargeted outcomes

16 Based on Mohr and Clemmer [1989]17 Based on Orgel et al. [1992]18 Randomization is discussed in Section 5.4.

Page 53: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Quasi-experimental and Experimental Designs Chapter 4

38

Figure 4.3: Experimental designs with “before” and “after” measurements

Example of an experimental design (1)19

An intervention for principle farm operators and their farms consisted of an on-site farm safety checkwith feedback and a one-day educational seminar. Potential participants in the intervention wereidentified from a list of all farms in the Farmers Association, using a random selection process. Of these,60% of farm operators agreed to participate in the study. They were then assigned to either anintervention or control group, using a randomization procedure. To evaluate the intervention, thesegroups were compared on measures taken before and after the intervention: self-reported injuries andnear-injuries (final outcome) and safety perceptions, practices and attitudes (intermediate outcomes).

19 Adaptation of intervention described in Glassock et al. [1997]

Table 4.3 Example of an evaluation of a farm safety intervention using an experimental design

Page 54: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Chapter 4 Quasi-experimental and Experimental Designs

39

Example of an experimental design (2) 20

Two interventions for the prevention of back injury were evaluated with an experimental designinvolving warehouse workers for a grocery distribution center. Ninety workers with the same jobclassification were randomly selected from among the 800 employees at a warehouse. The ninetyworkers were then randomly assigned to one of three groups. One group was given one hour oftraining on back injury prevention and body mechanics on the job. A second group was also given thetraining, as well as back belts to wear. The third group served as a control group, receiving neithertraining, nor back belts. Both “before” and “after” measurements were taken: knowledge (short-termoutcome); injuries and days absent as reported in health records (final outcomes). Abdominal strengthwas also measured in case it decreased as a result of wearing the belt (unintended outcome).

20 Based on Walsh and Schwartz [1990]

Table 4.4 Example of an evaluation of back belt and training interventions using anexperimental design

4.3.2 Experimental designs with “after”-onlymeasurements

One advantage of randomization is that in somesituations it may allow for not having “before”measurements. This can be especiallyadvantageous if you are worried about themeasurement influencing the outcome of interest(“testing effect”, section 3.5.4). It is alsoadvantageous if taking a before measurement iscostly (e.g., the administration of aquestionnaire).

The disadvantage of not obtaining “before”measurements is that it will not be possible to seeif the groups differed initially with respect to theoutcome measure. You would therefore not beable to make any allowance in the analysis forthese group differences.

Page 55: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

4.4 Threats to internal validity indesigns with control groups

We discussed how designs that use controlgroups can markedly reduce the threats tointernal validity discussed in Chapter 3.However, using control groups introduces somenew threats to internal validity which weconsider below. In spite of these, control groupsare still strongly recommended. On balance, theystrengthen the evaluation design far more thanthey weaken it.

4.4.1 Selection threats

A selection threat occurs when the apparent effectof the intervention could be due to differences inthe participants’ characteristics in the groupsbeing compared, rather than the interventionitself. For this reason, control and interventiongroups should be similar, especially with respectto any variables that can affect the measuredoutcome(s).21

Whenever you compare groups created througha non-random process, as in the quasi-experimental designs, you must consider howselection could affect your results. In what waydo the people in the groups differ? Do they differin their initial value of safety outcome measure orother characteristics (e.g., age, level ofexperience, level of education, etc.) which couldinfluence the way groups respond to theintervention? If so, you need to collectinformation on these differences and makeallowances for these differences in your statisticalanalysis.

Even by using a randomization procedure tocreate groups, as in a true experiment, you canhave a selection threat.

4.4.2 Selection interaction threats

We just described how it is important for groupsto be similar in their characteristics at the outsetof an evaluation. It is also important that theyremain similar and are treated similarly over thecourse of the evaluation. Otherwise, selectioninteraction-effects threaten the legitimacy of yourevaluation conclusions. Recall that there are avariety of threats to internal validity in before-and-after designs, e.g., history, instrumentation,dropout, etc. In many cases having a controlgroup - especially a randomized control group -can reduce or eliminate these threats to internalvalidity. The exception to this situation is whensomething happens to one group (e.g., history,instrumentation, maturation, etc.) and not to theother, resulting in selection interaction threats;i.e., selection-history, selection-instrumentation,selection-maturation, etc.

For example, a selection-history effect could occurif you are comparing two different divisions in a“pre-post with non-randomized control group”design. What if the supervisor of only one ofthese divisions changed during the course of theevaluation? You could not be sure whetherbetween-group differences in the “before” to“after” changes was due to the effect of theintervention on the intervention group - or due toa change in the leader in one group. Selection-history interaction threats to internal validity areoften beyond the evaluator’s control, as in theexample above. If they should arise, they aredealt with as was described for history threats(Section 3.5.1).

A regression-to-the-mean-interaction threat tointernal validity arises if you deliver anintervention to units with high injury rates andcompare their results to units with lower injuryrates. Even if there was no intervention effect,the high injury group would tend to have adecrease in rates, and the others might have evenshown an increase. The proper control group

Quasi-experimental and Experimental Designs Chapter 4

40

21 Depending on the type of evaluation design and the context, these characteristics or variables are sometimes called confounders; other timesthey are called effect modifiers or moderating variables.

Page 56: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

would be a second group with similarly highinjury rates.

A dropout interaction threat arises if one group hasa greater rate of dropout than the other, especiallyif it results in the two groups having differentcharacteristics. Characteristics of particularconcern are those which could affect how thestudy participants respond to the intervention(e.g., age, level of experience, level of education),as well as differences in the initial value of thesafety indicator used to measure outcome. Whilethese differences are sometimes taken intoaccount in the statistical analysis, it is preferableto avoid selection-dropout threats to internalvalidity altogether by taking steps to ensure thatpeople continue participating in the interventionproject and its evaluation.

Most other selection interactions, i.e., selection-instrumentation, -testing, -placebo, - Hawthorne,or -maturation effects can be minimized bytreating the control group as similarly as possibleto the intervention group with the exception ofthe intervention itself. Ideally, the evaluatorsshould have just as much contact withindividuals in the control group as those in theintervention group. In practice, such anarrangement may not be feasible.

4.4.3 Diffusion or contamination threat

A diffusion threat to internal validity (also knownas a contamination threat) occurs when theintervention delivered to one group “diffuses” tothe other. This can easily happen when theintervention is educational in nature, sinceworkers naturally share information with oneanother. It is even possible for new equipmentgiven to the intervention group to be shared withthe control group. Diffusion is most likely tooccur when the intervention is perceived asbeneficial. It is undesirable for an evaluationbecause it reduces the differences observedbetween groups in their “before” to “after”changes. Thus, you might conclude that anintervention was ineffective when it really was

not. The best way to reduce the threat ofdiffusion is by keeping the intervention andcontrol groups as separate as possible.

4.4.4 Rivalry or resentment threat

Finally, threats to validity can arise when peoplein the control group react to not receiving theintervention. Suppose a safety incentive programhas been introduced to encourage safe behaviors.The control group could react by not reportinginjuries so its safety performance ends up lookinggood compared to the intervention group. Or theopposite might be done. Injuries could be “over-reported” to demonstrate that the group needsan incentive program as well. In both cases wecould say that the control group has changed itsbehavior due to rivalry. Resentment effects arealso possible. The control group, for example,could resent not being given the opportunity toparticipate in an incentive program. This souringof labor-management relations in the divisioncould cause an increase in injury rates.

Rivalry or resentment threats can affect theevaluation’s conclusions in either direction.Depending on the situation, they can eitherincrease or decrease the differences betweengroups in “before” to “after” changes. The effectsjust described can sometimes be avoided bycommunicating well with groups or promisingthat if the intervention is shown to be effective,the control group will receive the interventionafterwards. If interventions are conceived andintroduced through a participatory process,unexpected reactions are less likely. However, itis impossible to anticipate every reaction to aprogram. This is one area where qualitativeinvestigation can be very helpful. Interviewswith a few knowledgeable people in the controlgroup should give insight into whether rivalryor resentment dynamics are an issue. As withthe diffusion threat, the rivalry or resentmentthreats might be avoided if groups in differentlocations are compared and communicationbetween the groups does not occur.

Chapter 4 Quasi-experimental and Experimental Designs

41

Page 57: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

4.5 Summary

A quasi-experimental or experimental design ismore likely to give a truer estimate of the effect ofan intervention than a non-experimental design.You can change a (non-experimental) before-and-after design into a quasi-experimental onethrough one or more of the following designstrategies: adding a control group; taking moremeasurements; staggering the introduction of theintervention; reversing the intervention; or usingadditional outcome measures. By adding thesedesign elements you can increase the strengthenthe design and reduce or eliminate the threats tointernal validity discussed in Chapter 3.

Experimental designs differ from quasi-experimental designs by always involving acontrol group and by assigning subjects tointervention and control groups under arandomization scheme. Otherwise, many of theelements of quasi-experimental and experimentaldesigns are the same. Although some newthreats to internal validity need to be consideredwhen using designs with control groups -selection, selection interactions, diffusion, rivalry,resentment - the use of control groups is almostalways recommended whenever feasible.

Quasi-experimental and Experimental Designs Chapter 4

42

• Improve upon a simple before-and-after design, and use a quasi-experimental design, through oneor more of five strategies:

• adding a control group

• taking more measurements

• staggering introduction of the intervention among groups

• adding a reversal of the intervention

• using additional outcome measures.

• Improve upon a quasi-experimental design, and use an experimental design, by assigningparticipants to intervention and control groups through randomization.

• Check that intervention and control groups receive similar treatment throughout the evaluationperiod, apart from the intervention itself.

• Avoid (but check for) diffusion, rivalry or resentment effects.

Key points of Chapter 4

Page 58: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

43

Study sample:Who should be in your interventionand evaluation?

Chapter 5

5.1 Introduction

5.2 Some definitions

5.3 Choosing people, groups or workplaces for the studysample5.3.1 How to choose a (simple) random sample5.3.2 How to choose a stratified random sample

5.4 Randomization - forming groups in experimental designs5.4.1 Why randomize?5.4.2 Randomized block design and matching

5.5 Forming groups in quasi-experimental designs

5.6 Summary

Page 59: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

5.1 Introduction

Having decided on the evaluation design, youshould choose which individuals, work groupsor workplaces will be included in the evaluationproject. They comprise the study sample. Thischapter will discuss how to select the studysample from a larger group of possibleparticipants, and how to form differentcomparison groups within the study sample,through randomization and other means.

5.2 Some definitions

Let us start by distinguishing three terms:

1) The target population22 consists of the people,groups or workplaces which might benefitfrom the safety intervention. For example, ifyou identify a safety need for constructionworkers and conduct an intervention amonga participating group of workers, you wantit to apply to all construction workers. Thetarget population is “all constructionworkers” in this case.

2) The sampling frame is a segment of the targetpopulation - e.g., construction workers in agiven company, union, or city.

3) A further subset, the study sample includesthose people, work groups or workplaceschosen from the sampling frame.

In summary, the study sample is a sub-group ofthe sampling frame, which in turn is a sub-groupof the target population.

5.3 Choosing people, groups orworkplaces for the study sample

The people or workplaces included in theevaluation may be determined by circumstances.For instance, if your concern is a single

workplace, and a decision has been made tointroduce the intervention to all its 50 employees,then your study sample has been pre-determined. However, in dealing with a largeworkplace, it might not be necessary to includeeveryone in the evaluation, if lower numbers (i.e.,a smaller sample size) provide you sufficientstatistical power (see Chapter 8) to detect anintervention effect. This is especially worthconsidering if a measurement on larger numbersof people increases the cost of data collection orotherwise makes it unfeasible. Thus, situationswill arise where you need to select a studysample for evaluation purposes from among allparticipants in the intervention.

The study sample should be representative of thesampling frame and/or target population. Thisis because more is required than just knowingwhether the intervention or program workedonly for the particular group selected for thestudy and their particular circumstances (e.g.,time, place, etc.). A bigger concern is whetherthe intervention is generally effective. In otherwords, you want the results to havegeneralizability - also known as external validity.Safety managers will want to know that theevaluation results, obtained with a sampleselected from among their employees, applies totheir workplace as a whole. Someone with amultiple workplace perspective - i.e., aresearcher, corporate safety director or policy-maker - will want the results to apply to thewhole target population.

How can the sample be made representative ofthe sampling frame? Suppose a trainingintervention is being implemented for safe workpractices, and you will evaluate it by observingworkers doing their job before and after theintervention. You have determined it is notfeasible to observe all of them going through theprogram; so you limit yourself to a smallersample.

You could ask for volunteers, but they will

Study Population Chapter 5

44

22 Note that the term target population is not always defined this way. Some people define it as what we are calling the sampling frame.

Page 60: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

probably be unrepresentative of all the workers,since they are more likely to have particularlygood work practices. Another alternative is tochoose everyone who works in one area (makingit easier to do the observations). But again, thisgroup is unlikely to represent the wholeworkplace, given the wide variety of jobs.

The best method is to choose a random sample,which increases the chance of a representativesample from the target population. This avoidschoosing subjects on any other basis (e.g.,volunteering).

Using a control group, as well as an interventiongroup, in your evaluation design, will increasethe required sample size.

5.3.1 How to choose a (simple) randomsample

Random selection involves choosing your samplewhere each person in the sampling frame has aknown probability of being selected. With simplerandom sampling, the most common type ofrandom sampling, the probability of beingselected is equal for everybody. The process issimilar in many ways to tossing a coin for eachperson in the sampling frame and choosing thosepeople for whom the coin comes up heads. Butwith coin-tossing the probability of getting headsis 0.5 and of getting tails is also 0.5. We may onlywant to choose 50 out of 200 people, so theprobability of selection in this case is one in four.

There are several different ways to choose arandom sample. One of the simplest is to userandom number tables, which typically show manythousands of randomly selected digits. Thismeans that when the tables are generated, thereis exactly a one-in-ten chance (probability) thateach of the digits 0,1,...,9 would be selected forany position on the table. The digits are usuallyshown in groups of 5 - this has no particularmeaning - it simply makes the table easy to use.Random number tables are often included in theappendices of statistics text books. Alternatively,many statistical or spreadsheet software packageshave a function which generates randomnumbers. We used one to generate Table 5.1.

Chapter 5 Study Population

45

Considerations when choosing the studysample

• To what target population will the resultsbe generalized?

• How many participating individuals,workgroups or workplaces are potentiallyavailable for the intervention andevaluation?

• What sample size will give sufficientstatistical power?

• What is the marginal cost per participantof data collection?

• Will a control group be used in theevaluation design?

• Will the sample be stratified?

Page 61: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

5 8 5 0 1 4 2 5 4 9 8 1 2 0 0 7 9 0 5 6 1 1 0 7 0 7 6 0 8 1 9 9 3 7 1 6 1 6 9 0 8 6 9 5 5 6 4 2 0 4 5 7 2 0 9 8 5 8 5 14 6 0 0 6 2 9 0 9 7 9 6 1 2 8 6 3 3 5 6 5 8 2 1 9 6 3 7 4 39 8 0 6 3 0 0 9 1 3 1 3 5 1 7 4 4 4 5 2 4 4 1 9 6 9 2 4 0 5 1 9 0 0 0 6 5 3 9 1 7 8 9 6 2 1 7 4 0 0 1 7 3 7 3 5 2 6 9 0

5 8 0 5 8 7 3 5 2 6 4 4 7 9 9 0 2 3 5 5 6 8 1 3 4 5 5 1 2 12 4 3 4 8 5 0 1 3 4 2 6 7 3 4 4 0 0 5 0 4 6 3 6 0 5 3 4 5 72 4 4 3 3 2 2 1 2 3 0 6 2 9 6 3 0 4 4 2 1 2 3 5 3 5 0 5 0 43 7 2 6 7 6 8 2 8 5 0 9 7 1 7 1 9 8 6 8 4 8 3 6 6 0 3 1 6 70 9 3 0 4 1 7 6 9 9 4 6 3 6 5 7 1 5 9 0 8 5 5 7 7 0 7 1 9 3

2 1 4 7 7 3 9 8 0 5 4 9 8 1 2 5 2 9 0 0 5 4 7 6 9 5 3 4 1 1 8 6 2 4 4 6 6 9 5 1 4 6 3 3 1 7 6 1 2 4 2 6 8 2 5 4 5 5 1 8 0 7 5 1 0 0 1 8 3 9 9 5 7 2 5 2 2 1 3 4 4 5 7 5 2 8 8 2 0 39 5 8 4 0 5 3 0 1 7 8 2 1 3 1 7 4 4 8 7 4 2 2 8 3 6 8 6 3 70 5 4 8 4 6 4 9 6 8 4 0 2 9 8 7 1 9 1 8 3 4 5 5 3 3 2 4 8 5

8 6 0 7 0 8 3 1 2 7 0 1 1 2 3 0 2 1 3 3 0 8 4 6 9 6 8 2 9 08 7 5 7 5 3 0 3 7 4 3 3 7 3 0 0 9 4 4 1 9 2 5 1 9 4 1 6 6 56 8 5 4 4 7 6 7 4 6 3 4 0 6 3 5 5 2 1 9 4 5 7 6 5 4 7 2 3 03 3 6 7 9 7 8 4 7 6 4 7 8 6 7 1 9 4 4 8 9 4 2 1 8 2 9 5 3 68 5 1 4 3 9 6 1 2 2 0 5 7 4 5 7 7 2 6 0 7 4 0 9 2 4 8 7 5 3

3 1 8 9 4 6 7 5 2 2 8 2 2 8 6 7 7 4 1 4 1 5 3 7 2 3 5 7 7 97 8 6 8 3 4 9 3 2 9 4 5 4 8 2 5 7 8 2 6 5 5 1 4 2 8 6 0 7 6 7 0 1 3 5 6 1 5 6 3 7 1 8 8 5 3 8 8 1 5 5 1 2 7 5 7 1 4 1 0 7 7 3 1 0 9 9 0 9 6 9 4 5 5 6 2 7 8 7 5 0 6 9 3 9 6 7 1 2 53 6 9 6 8 5 1 9 9 1 9 7 2 7 4 2 9 2 7 0 8 0 4 8 6 2 4 4 5 6

3 8 2 8 7 8 0 7 5 4 5 1 4 2 2 4 1 3 9 0 4 9 8 4 3 5 1 9 3 99 6 3 6 7 3 1 2 7 7 5 5 9 5 8 5 1 1 7 5 4 7 0 9 6 1 3 5 7 46 0 4 8 8 3 5 6 1 9 9 5 3 7 4 2 6 4 4 4 0 3 7 9 3 6 4 2 8 45 7 9 7 5 8 3 6 9 8 8 0 5 2 1 7 0 3 5 3 2 6 2 5 1 5 7 2 6 2 0 7 3 6 4 3 7 1 9 4 9 9 1 5 6 3 5 1 7 0 9 0 9 4 1 2 9 5 5 8

Study Population Chapter 5

46

This table was generated using a spreadsheet software with a random number generation function. It canbe used for selecting a random sample from a sampling frame and for randomizing sample subjects tointervention and control groups. The groups of three digits indicated above are used in the illustrationof random sampling in Section 5.3.1.

Table 5.1 Random number table

Page 62: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

5.3.2 How to choose a stratified randomsample

Random sampling is not guaranteed to producea sample representative of the sampling frame,although it does work “on average.” That is, ifwe repeat the procedure many times, the averageproportion of women in the samples is the sameas their proportion in the sampling frame as awhole. When a fairly large sample is chosen,representativeness is more likely. However, withsmall samples, this may not be the case. To avoidthe problem of lack of representativeness, selecta stratified random sample. This allows you todeliberately select a sample with the sameproportion of men and women as in the totalgroup.

Do this by stratifying the group into the two sexes.Then choose a sample of men, and a sample ofwomen, applying the same process as in simplerandom sampling with each stratum. The firstsample has the number of men you want; thesecond, the number of women. Opinion pollstypically use some form of stratified sampling,though one that is rather more complex than hasbeen described.

(Caution: with stratified sampling, the statisticalapproach you use to analyze the data must bemodified.)

Another reason for stratifying would be if thereare important differences in the reaction of sub-groups to an intervention, and one of the sub-groups in the sampling frame is quite small. Forinstance, suppose you want to look at the effectof flexible work hours on both men and womenin a manufacturing environment by means of asurvey, yet women comprise only 5% of theworking population. You would end up withabout 10 women, if 200 workers were selected atrandom from the total work force, making yourestimate of the effect of the intervention onfemales in your sample imprecise.

However, the precision could be greatlyimproved if you first stratify by sex and thenchoose 50 women and 150 men, using randomsampling from each stratum.

Chapter 5 Study Population

47

How do you randomly select, say 50 people (the study sample) from a group of 839 workers (thesampling frame)? To do this, you can use a random number table, such as the small one provided here(Table 5.1). Typically, you start using the table at an arbitrary point, determined by, for example, rollinga die. You can roll to see which of the large groups of rows to select, and then roll three more times tosimilarly select the exact row, the group of columns and the exact column. If you rolled 3, 4, 5, 4, youwould start at the third group of rows, 4th row, 5th group of columns and 4th column ending up at thenumber 8.

Since the sampling frame contains 839 workers, number them from 1 (or rather 001) to 839. The number839 is less than 1000, so you can go through the table, using three digits at a time. Reading from thetable then, you would read off the following sequence of numbers to start: 836, 863, 705, 484, 649, 684,029, 871, 918, 345, 533, 248, 586. Ignore digit triplets lying between 840 and 999, as well as 000. Thismeans you would select for your sample, the workers numbers 836, 705, 484, 649, 684, 029, 345, 533,248, 586. You could continue until you have the 50 people required. If the random number table shouldyield a repeated triplet of digits, ignore this and use the next valid triplet.

How to select a random sample using a random number table

Page 63: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

5.4 Randomization - forming groupsin experimental designs

You have seen how to randomly select a singlesample from a bigger group. Suppose, you doan experiment, with one intervention group andone control group. In this case, you randomizestudy subjects (i.e., workers, work groups, orworkplaces) to either group and make sure theyall have the same chance (probability) of beingassigned to each one. Typically, theseprobabilities will be one half, or equal for eachgroup. It is rather like tossing a “fair” coin -heads you go in the intervention group, tails youbecome a control.

5.4.1 Why randomize?

Why randomize subjects into intervention andcontrol groups? The primary purpose is to avoidselecting only particular types of people. Forexample, we do not want only volunteers for theintervention group, leaving other people for thecontrol group. Volunteers differ in many ways

from non-volunteers. Similarly, we do not wantall the men in one group and women in the other.We even want to avoid the tendency for men tobe in one group rather than another.

You might argue that you can certainly tell ifsubjects are men or women, and thus check forany imbalance of the two sexes in the treatmentand control groups. But what about factors youdo not measure, or do not even know about orcannot measure? The answer is that withrandomization it does not matter! This is becauseon average these factors balance out if yourandomize. When we say on average, we mean: ifwe repeat the randomization many times, andeach time calculate the resulting proportion ofmen in the treatment and control groups, theaverage of all these proportions for theintervention group would be the same as that forthe control group. Similarly, the averageproportion of women in the intervention andcontrol groups would be equal, following manyrandomizations. This is true of variables we havenot measured.

Study Population Chapter 5

48

Suppose you want to randomize people into two groups, with an equal probability of selection intoeither. As with random selection, there are several ways we can proceed. Using the random numbertable (Table 5.1), you could start at the point where you left off in choosing your random sample andread off the sequence: 070 83127 01123 02133 08... Taking single digits, if the digit is even (including 0)you allocate the person to the intervention group. If the digit is odd you allocate to the control group.Alternatively, our rule could be that if the digit is between 0 and 4, the subject goes in the interventiongroup; if between 5 and 9, the subject becomes part of the control group.

We will illustrate this using the odd/even rule and allocate 20 people into two groups, 10 per group.First, number the 20 people 01 to 20. The first digit in the random number sequence is 0, so subject 01is assigned to the control group; the second digit is 7 so subject 02 is in the intervention group.Continuing, you see that subjects 02, 05, 06, 08, 10, 11, 13, 14, 16, 17 are put into the intervention group.Since you now have ten people in this group, you can stop the randomization and put the threeremaining subjects in the control group. Sometimes, you will decide in advance to randomly allocatepeople to groups without guaranteeing equal numbers in each group. If you did this here, you wouldkeep selecting single digits so that subject 18 would also go into the intervention group and subjects19 and 20 would go in the control group. This means that out of 20 people, eleven are in the interventiongroup and nine are in the control group. There is always the risk of an imbalance like this, particularlywith small samples.

How to randomize

Page 64: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Sometimes, you might see a study where, eventhough proper randomization techniques havenot been used, it seems that there is no biasedpattern of selection into the treatment or controlgroup. Why is this still not as good a design?The problem is that there may in fact still be someselection bias. For example, someone may havedeliberately (or even sub-consciously) put intothe intervention group people considered moreinterested in safety. This will mean the groupsare not truly comparable.

5.4.2 Randomized block design and matching

You may want to ensure that a characteristic, suchas sex, is balanced between the groups, in orderto avoid a selection threat to internal validity.Thus, in this case you want equal numbers ofmen in both intervention and control groups;and, similarly, equal numbers of women in eachgroup. How can you guarantee this?

The answer is to stratify subjects and randomizewithin the strata (or “block” in the jargon ofexperimental design). What you do is list all themen to be randomized and assign them in equalnumbers to intervention and control groups.Then do the same for women, and you will havea similar distribution of the sexes in each of thegroups.

Another possibility is to match. First, pair up(match) subjects according to characteristics likesex, age, duration of employment and so on. Youcan then (randomly) allocate one member of thepair to the intervention group, and the other tothe control group. (This process is really like therandomizing within blocks, with each blockreduced to just two people). In practice, it can bedifficult to get exact matches. So instead of takingpeople with the same year of birth, you may haveto match pairs to within two or three years in age.

5.5 Forming groups in quasi-experimentaldesigns

It might be difficult or even impossible to matchor randomize subjects to one group or another,

given the requirements of a particularorganization of people at a workplace. Or agroup (e.g., department, work-site, etc.) mighthave already been chosen to participate in theintervention, thereby preventing anyrandomizing of participants to intervention andcontrol groups. In such cases, you can still chooseanother group, which will serve as the controlgroup, as in a number of the quasi-experimentaldesigns discussed in Chapter 4.

The overriding principle in choosing non-randomized groups for a quasi-experimentaldesign is to make them truly comparable. Theyshould be similar in all respects apart from theintervention. In comparing two departments, youwant them to be similar in their work activities.You would not compare an accounts departmentwith a maintenance department. Of coursewithin workplaces there may be no directlycomparable groups - so aim to select ones thatclosely resemble each other. You might even trysimilar departments in other workplaces,preferably from the same type of industry.

The actual choice you make depends on yourlocal situation. We cannot say specifically whatgroup would be best, but several characteristicscan be considered.

Chapter 5 Study Population

49

• worker characteristics (e.g., age, sex,experience)

• nature of job tasks• work environment (i.e., exposure to

hazards, safety controls)• workplace organization (e.g., structures for

decision-making and safety, work flow)• contextual factors (e.g., health & safety

culture; management support for safety)• past safety record

Characteristics to consider when choosinga non-randomized control group

Page 65: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

5.6 Summary

We have described how to randomly selectparticipants from a sampling frame. This is doneso that the study sample is representative of thesampling frame and the intervention results willbe applicable to the larger group. We alsodescribed how the process of randomization canbe used to create intervention and control groupsin experimental designs. For situations in whichgroups are formed non-randomly, someconsiderations were given.

Study Population Chapter 5

50

• Choose your sampling frame, so that it istypical of the target population to whichyou want to generalize your evaluationresults.

• Select a study sample from your samplingframe using random sampling.

• Whenever possible, use an experimentaldesign with randomization to assignparticipants to intervention and controlgroups.

• In quasi-experimental designs, selectintervention and control groups so thatthey are similar.

Key points of Chapter 5

Page 66: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

51

Measuring outcomes

Chapter 6

6.1 Introduction

6.2 Reliability and validity of measurements

6.3 Different types of safety outcome measures6.3.1 Administrative data collection - injury statistics6.3.2 Administrative data collection - other statistics6.3.3 Behavioral and work-site observations6.3.4 Employee surveys6.3.5 Analytical equipment measures6.3.6 Workplace audits

6.4 Choosing how to measure the outcomes6.4.1 Evaluation design and outcome measures6.4.2 Measuring unintended outcomes6.4.3 Characteristics of measurement method6.4.4 Statistical power and measurement method6.4.5 Practical considerations 6.4.6 Ethical aspects

6.5 Summary

Page 67: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

6.1 Introduction

Chapters 3 and 4 described the various studydesigns used in evaluation. In those chapters,we referred to taking measurements. Thischapter will discuss those measurements. Wewill first introduce the concepts of reliability andvalidity - two key characteristics to considerwhen choosing a measurement technique. Wewill then review several common ways ofmeasuring safety outcomes, examining reliabilityand validity. Finally, we will list a wider range ofconsiderations in choosing your measurementmethod(s).

Here, we will be discussing only quantitativemethods; i.e., those which yield numericalinformation. The next chapter deals withmethods which yield qualitative information. Acomprehensive evaluation should include bothtypes of data.

6.2 Reliability and validity ofmeasurements

A measured value consists of two parts: the truevalue plus a certain amount of measurement error(i.e., the error we make in measuring). It is thismeasurement error which makes a particularmeasured value either higher or lower than thetrue value. Thus, measurements are moreaccurate when measurement error is minimized.Imagine using a ruler to measure the legs of atable. Since we do not measure them perfectly,each estimate of leg length will consist of the truevalue plus or minus a small amount of error.

Measurement error, in fact, consists of two parts.One part is called systematic error, also known asbias. This type of error exists when weconsistently make an error in the same direction.This would happen, for example, if we alwayslooked at the ruler from an angle, causing us toconsistently underestimate the table leg length.The other part of measurement error is randomerror. As the name implies, it fluctuates

randomly, sometimes leading to overestimation,and sometimes to underestimation. These twotypes of measurement error affect the reliabilityand validity of a measurement method. Whileevaluating the effectiveness of your intervention,apply measurement methods which minimizeboth types of measurement error. In other words,these methods should be valid and reliable.

Measurements that are valid have a low degreeof systematic error and measurements that arereliable have a low degree of random error. Inother words, a valid method means we aremeasuring what we had hoped to measure. Areliable method gives us consistent answers(while measuring the same thing) on numerousoccasions. If a measurement method is both validand reliable, it is considered to be accurate.

Figure 6.2 illustrates these concepts of reliabilityand validity with the analogy of a shootingtarget. Consider the center of the target as thetrue value and the bullet holes as the actualmeasured values. Reliable measurement has alow degree of scatter to the values (as in the left-hand panels in Figure 6.2). Valid measurement iscentered on the true value (as in the top panels).

Quantitative Measurement Chapter 6

52

Figure 6.1:Types of error inmeasurement

Page 68: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Why reliability and validity are important

Poor reliability is a problem for evaluationbecause it makes it harder to detect any effect ofa safety intervention, even if it is truly effective.This is because it is hard to tell whether anychanges in critical safety outcome measures aredue to the intervention or simply to randomfluctuation. Unfortunately, lost-time injury ratedata, except in larger workplaces, often has lowreliability. For this reason, alternative oradditional outcome measures must often be usedto measure the effect of an injury reductionintervention.

When methods are reliable, but have poorvalidity, your conclusions drawn from evaluationresults might be wrong, especially if you aremeasuring a concept different from the one youthought you were measuring.

Specific types of reliability and validity

There are several specific types of reliabilitywhich may be of concern to evaluators: analyticalequipment reliability or precision, inter-raterreliability, test-retest reliability and internalconsistency of questionnaires. We will elaborateon these in our discussion about particular typesof outcomes measures.

Similarly, several types of validity can be at issue.Major types are criterion, content and construct.

• Criterion validity is the extent to which themeasurement predicts or agrees with somecriterion of the “true” value or “goldstandard” of the measure. For example, youwould establish the criterion validity of work-site observation measurements by showing acorrelation of these measurements with injuryrates in a given workplace.

Chapter 6 Quantitative Measurement

53

Figure 6.2: Illustration of the effects of reliability and validity on measurement

Page 69: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

• Content validity is applicable only toinstruments like questionnaires or checklists.It is concerned with whether or not theymeasure all aspects of the concept(s) of interest.This type of validity can be established byhaving experts review the instrument for itscompleteness of content.

• Construct validity is the hardest to establish, butthe most important. It pertains to how well themeasure accurately indicates the concept (orconstruct) of interest. This is not an issue whenyou are dealing with a concrete concept likeinjuries. You already know that lost-timeinjury rates are accepted indicators - assumingthe data are free from biases. On the otherhand, construct validity becomes more of anissue whenever you are measuring abstractconcepts. For example, if the objective of theintervention is to change the safety climate ofthe workplace, you might want to measure theoutcome (safety climate) with a safety climatequestionnaire. This is a good idea, but youneed to determine how the construct validity ofthe questionnaire was established. In otherwords, how was it established that thequestionnaire truly measures “safety climate”and not something more limited like safetypolicy and procedures.

6.3 Different types of safety outcomemeasures

In the following, we discuss some of the morecommon outcome measurement methods, witha focus on the reliability and validity.

6.3.1 Administrative data collection - injurystatistics

Several types of injury statistics have becomestandard reporting measures for companies.They are often determined by legislativerequirements. The most common measure, injuryfrequency rate, is equal to the number of injuriesper unit of exposure. There are differentcategories of injuries where rates can becalculated: e.g., lost-time or disabling injuries;recordable injuries (i.e., those required by law to berecorded); medical treatment injuries; and first-aidonly injuries. Although a less commonly acceptedstandardized measure, near-injury rates can alsobe calculated.

Various units of exposure are used to calculatefrequency rates. Worker-hour units of exposure- typically 100,000 worker-hours or 1,000,000worker-hours - yield relatively precise frequencyrate estimates. However, the number of workers,a cruder measure of exposure, is also used. Thechoice between the two depends on the state ofrecord-keeping, with the former requiring goodrecords of worker-hours, including lay-offs,overtime and lost time. The number of injuries canalso be used to compare two time periods ofequal length, but the equivalence of the worker-hours of exposure during the time periods mustbe confirmed.

Severity rate is another widely used injurystatistic. It is calculated by taking a ratio of lost-time hours over the corresponding units ofexposure - the higher ratios corresponding togreater severity. Severity rate is a usefulcomplement to frequency rate, since someinterventions can have an impact on severity, butnot on frequency. This could result, in somecases, from the interventions affecting severitymore than frequency. It could also result from itbeing easier to (statistically) detect an effect onseverity than frequency, for a given effect size.

Quantitative Measurement Chapter 6

54

1. Administrative data collection - injurystatistics

2. Administrative data collection - otherstatistics

3. Behavioral and work-site observations

4. Employee surveys

5. Analytical equipment measurement

6. Workplace audits

Common safety outcome measurementmethods

Page 70: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Claims data collected by workers’ compensationsystems are useful for evaluating interventionsdelivered to a multiple workplaces in ajurisdictional area.

Validity and reliability concerns with injurystatistics

The major concern in using injury and claimsdata involves the reporting biases that may existand stem from systematic errors which causeinjury records to be consistently different fromthe true injury occurrence. Such biases can enterat any point during the transfer of information -from the time a worker experiences an incident towhen that incident becomes part of nationalstatistics. On the one hand, certain compensationor work environments may encourage over-reporting of injuries by workers. On the otherhand, incentives to individuals or organizationsto minimize injuries may encourageunderreporting of injuries or a reclassification toa lower level of severity. In particular, incentivesfor early return-to-work might result in therecording of a medical-aid only incident, which inthe past or in a different location would havebeen considered a lost-time injury.

The degree of underreporting can be a greatsource of bias. One study of hospital workers’data from self-report questionnaires showed that39% of those who had experienced one or moreinjuries did not report them.23 Although the mainreason for not reporting was that they consideredthe injuries too minor, in fact, 64% of theminvolved medical treatment and 44% lost worktime. In another study in the U.S.,24 OSHA 200forms (U.S. workplaces are required to recordinjuries and illnesses meeting certain severitycriteria for the Occupational Safety & HealthAdministration) from several companies werecompared with company clinic records. Thisshowed that the OSHA logs had captured only60% of the reportable injuries.

A filter model of the injury reporting process25

has been developed that can help identify theplaces at which biases can influence injuryreporting and the reasons for these biases (Table6.1). A filter in this model is anything whichprevents some of the reportable injury data at onereporting level from passing to the next level ofreporting. For example, level one is considered torepresent the true injury rate, with the first filterbeing the worker’s decision-making processabout whether to report the injury or not. Thesecond, third and fourth filters operate at theworkplace; the fifth at the transmission ofcompany-level data into aggregate data at thejurisdictional level. The filters operate differentlyfor injuries of differing severity. The less severethe injury the more effective and less consistentare the filters. Accordingly, near-injuries areespecially prone to biased and inconsistentreporting.

Often the presence of some reporting biases canbe tolerated in an evaluation - if theycontinuously affect the data in the same way. Aproblem arises if they operate differently, eitherover time or between groups, for any of themeasurements being compared. Just howdifferently they operate has to be estimated andtaken into account when interpreting results.

One special evaluation challenge in using injurystatistics are those cases where the interventionitself has an impact on the way the filters operate.Management audits, for example, tend toimprove reporting, while safety incentiveprograms that reward workers or managers forlow injury rates discourage reporting. In suchsituations it is important to include methods forverifying the injury data, as well as to incorporatesupplementary outcome measurement methods,not subject to the same type of biases.

Chapter 6 Quantitative Measurement

55

23 Weddle MG [1996]. Reporting occupational injuries: the first step. J Safety Res 27(4):217-223.24 McCurdy SA, Schenker MB, Samuels SJ [1991]. Reporting of occupational injury and illness in the semiconductor manufacturing industry. AmPublic Health 81(1):85-89.25 Webb et al. [1989]

Page 71: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Quantitative Measurement Chapter 6

56

Tab

le 6

.1:

Th

e fi

lter

mo

del

for

wo

rk in

jury

rep

ort

ing

:six

leve

ls a

nd

fiv

e fi

lter

s26

26R

eprin

ted

from

Acc

iden

t Ana

lysi

s &

Pre

vent

ion

21, W

ebb

et a

l.F

ilter

ing

effe

cts

in r

epor

ting

wor

k in

jurie

s, 1

15-2

3, C

opyr

ight

198

9, w

ith p

erm

issi

on fr

om E

lsev

ier

Sci

ence

.

Page 72: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Misclassification errors can cause a problem ininjury statistics. Misclassification can beattributed to errors or disagreement in thejudgement of people involved in abstractinginformation from incident or clinical records, anddepends in part on the individual coders, as wellas on the classification scheme used. A commonmethod of classifying injuries and illnesses, theInternational Classification of Diseases, 9threvision (ICD-9) has been shown to performpoorly for soft-tissue disorders, since severaldifferent ICD-9 codes can be assigned for thesame condition.27 As well, the ability to codeincident descriptions consistently has beenshown to depend on which aspect of the incidentis being coded. Table 6.2 shows the percentagreement in coding for two trained peoplecoding the same 100 injury reports using astandardized coding scheme28.

Coding is considered unreliable when scores areless than 0.7. As you can see from Table 6.2, manyof the important items being coded fell below thiscut-off point. To improve the inter-raterreliability, the coders may need more training or

the coding scheme may need revising. Inaddition, you may want to maintain the samecoders through the data collection phase of theevaluation, because a change in personnel couldhave a large impact on the results. You shouldalso check for any changes in classificationpractices over time or differences in the codingpractices between intervention and controlgroups.

Checking the validity of injury statisticsbefore and during data collection

If you use injury statistics as an outcome measureto evaluate the intervention, consider thepotential biases at the outset. Investigations ofdata validity can be made beforehand, and, ifnecessary, corrective steps taken to improve datacollection. However, if that is done, wait untiltheir impact on the resulting statistics hasstabilized.

Also, check the validity of the injury statistics bycomparing them with data obtained at a lowerlevel of reporting - e.g., a comparison of thefrequency of incidents in summary statistics withthe medical or safety records on which thestatistics are based. Sometimes, several sourcesare used to identify the complete universe ofincidents, after which you can determine howwell any single source captures them. You mightuse supervisor, clinic and claims reports toidentify the universe of incidents in a workplaceand then see what percentage is captured by eachtype of report.

Checking the validity of injury statistics afterdata collection

Even after ensuring that the collected statisticsare consistent with respect to any biases, realitymight differ from expectation. Thus, it is a goodidea to check the data after it has been collectedas well. A good indicator of underreporting is

Chapter 6 Quantitative Measurement

57

Table 6.2: Reliability of coding injurydescriptions

Item being coded Reliability Sex 0.98Year of birth 0.89Industry classification 0.64Injury location 0.64Type of injury 0.92Part of body injured 0.92Injury legally notifiable or not 0.84Agent of injury 0.79Event precipitating injury 0.44Contributory factors 0.61

27 Buchbinder R, Goel V, Bombardier C, Hogg-Johnson S [1996]. Classification systems of soft tissue disorders of the neck and upper limb: dothey satisfy methodological guidelines? J Clin Epidemiol 49:141-149.28 Adapted from Glendon AI and Hale AR [1984]. A study of 1700 accidents on the youth opportunities programme. Sheffield: Manpower ServicesCommission.

Page 73: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

the ratio of minor to major injuries. Since minorinjuries are more likely to be “filtered out” thanmajor ones, a constant ratio indicates the stabilityof any underreporting biases (or lack of them).This method does depend on major injuries’occurring frequently enough in the periodmeasured that they perform as reliable measures.

Small numbers issue and injury statistics

Lost-time injuries and other severe injuries mightnot be useful in effectiveness evaluations insmaller workplaces. The small number ofinjuries found in a smaller workplace leads tostatistical tests with low power. This means it willbe more difficult to detect an intervention effect,even if the intervention is truly effective.

For this reason, people sometimes use thefrequency of less severe injuries, e.g., first-aid-only injuries, to evaluate safety interventioneffectiveness. However, less severe injuries aremore susceptible to underreporting biases thanmore severe injuries. Thus, the opposingconsiderations of statistical power versus validitymust be weighed when choosing measurementmethods. Examine both severe and minorinjuries if the data are available.

Another alternative or supplementary approachto the small numbers issue, is to measureintermediate outcomes, e.g., work practiceobservations, work-site observations or safetyclimate. Power will likely be greater in thestatistical tests on these data than on injury data.Another advantage of this approach lies with thelocation of the intermediate outcomes, which aresituated closer to the intervention than injuriesin the causal chain. The intervention effect istherefore less “diluted” or attenuated by otherfactors. The challenge in using upstream proxymeasures for injury outcomes is that a goodcorrelation has to be established between thisproxy measure and a more established measureof safety such as injuries, in order to demonstratethe validity of the proxy measure. We willdiscuss some other measures below.

6.3.2 Administrative data collection - otherstatistics

As discussed, there are some problems withrelying upon injury statistics for evaluatingworkplace interventions. For this reason youmight choose other types of data found inadministrative records. In doing so you canadopt the strategy, referred to earlier, of using anintermediate outcome as a proxy for the finaloutcome in the evaluation. Even when the injurystatistics serve as a final outcome measures,additional administrative data can provideinsight on the way the intervention broughtabout change. Finally, these data are often usefulfor demonstrating that the intervention is beingimplemented as planned.

As described in the above section, considerwhether any biases are entering the process ofcollecting these data; and conduct validity checksif needed.

Quantitative Measurement Chapter 6

58

29 Example from Menckel and Carter [1985]

Example of the use of other administrativedata29

Consider an example in which an incidentinvestigation intervention is being evaluated. Anew consultative group has been formed to assistsupervisors investigating incidents, with an aim toimproving the quality of actions resulting fromthe investigation. The following quantitativemeasures derived from administrative records areincluded in the intervention evaluation: 1) timebetween incident occurrence and formal reportingof it; 2) the number of near-incidents reported; 3)percentage of incidents for which correctivemeasures were suggested.

Change is seen in all these intermediate outcomesin the direction that suggests that the interventionhas been effective. It is also found that the numberand severity of incidents (final outcomes) showeda decrease, although only the latter is statisticallysignificant. Thus, even though the injuryfrequency did not show statistical significance,changes in the intermediate outcome measures (aswell as some qualitative evidence not discussedhere) together suggested that the intervention hadbeen effective.

Page 74: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

6.3.3 Behavioral and work-site observations

Observations of worker safety-related behaviorare increasingly being used to measure the effectof safety interventions, especially behavior-basedinterventions. To use this method, you firstdevelop a checklist of behaviors that theintervention is trying to influence. Targetedbehaviors are identified from incidentinvestigation reports, as well as from the opinionsof supervisors and workers. Be prepared to firstdevelop a trial checklist and make adjustmentsaccording to feedback. This will ensure that thelist is comprehensive (improving its validity) andthe items are unambiguous (improving itsreliability).

The checklist is then used by an observer (trainedsupervisor, evaluator or worker) who visits thework area at a randomly selected time of day andmakes observations for approximately half anhour. For each behavioral item on the list, theobserver marks either “performed safely”,“performed unsafely” or “not observed”.Following the observation period, the proportionof safe behaviors is calculated. It consists of theratio of the number of items performed safelyover the number of items observed.30

Another checklist approach uses observations onthe work-site conditions (e.g., housekeeping,emergency equipment). Details of such a methodfor the manufacturing industry can be found at asite of the Finnish Institute of OccupationalHealth:

www.occuphealth.fi/e/dept/t/wp2000/Elmeri1E.html.

Advantages and disadvantages of behavioraland work-site observational measures

Observational measures offer several advantages.First, they are “leading indicators” instead of a“trailing indicators”, meaning you do not haveto wait until incidents happen to get a measure ofsafety. Rather, measurement is “upstream” ofincidents in the causal pathway. Second, you cantake observations frequently, as often as severaltimes a week. This yields data sensitive tochanges caused by interventions and can beanalyzed for time trends. Third, there is someevidence that behavior serves as a valid proxyfor injuries as a final outcome measure. Reviewsof injury records show that the majority ofinjuries are associated with unsafe acts. [This isnot to say that responsibility for such injuries liesexclusively with the worker carrying out theseunsafe acts. Conditions leading to these unsafeacts are typically the responsibility ofmanagement.]

Further, evaluations of behavioral interventions- at least the ones that have been published - tendto find that an improvement in behaviors iscorrelated with a decrease in injury rate.Validation of work-site checklists using injuryrates as a criterion has also been achieved[Laitinen et al. 1999a,b].

A drawback with behavioral observations is thatthey are sometimes regarded unfavorably bythose being observed, and in some environmentscould be considered unethical or otherwiseunacceptable. This is less of a problem withwork-site observations, because there is lessemphasis on the behavior of individuals. Aswell, observations of the work-site can be carriedout in a less intrusive manner, thereby interferingless in the measurement of the intervention’seffect.

Chapter 6 Quantitative Measurement

59

30 See Krause [1995] for more details on this methodology.

Page 75: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Additional validity and reliability issues whenusing behavioral and work-site checklists

There are additional methodological issues toconsider when using observational checklisttechniques. One is the inter-rater reliability of thelist. This is determined by having more than oneperson observe the same area simultaneously andindependently record their observations. Theamount of agreement between the two raters canthen be compared. Typical ways of reportinginter-rater reliability are the percent agreementon items (the percentage of list items categorizedthe same) or, better, by using a special statisticknown as a Kappa coefficient.31 This statistic takesinto account the percent agreement that wouldoccur simply by chance before the percentagreement is calculated. Kappa values between0.40-0.59 indicate agreement which is moderate,0.60-0.79 substantial and 0.80-1.00 very good. Ahigh reliability ensures that there will be littlevariation in the data as a result of having differentobservers carry out the observations. If thisreliability is low during the piloting of thechecklist, you should be able to improve it byremoving any confusing language or betterspecifying criteria.

In order for the checklist to be able to measurechange, avoid what are called ceiling effects. Thisrefers to the situation where you are unable todetect an improvement of the index score,because the initial measured values were alreadyhigh and have limited room for improvement.Thus, you ideally want your pre-interventionindex scores to be around 50%. If they are higherthan 80% during the testing phase, see if you canmodify or substitute items so that they will bemore difficult to achieve and still be true to theintervention’s aims.

Before accepting observations as the definitivefinal outcome measure in evaluating anintervention, you would want to determine if astatistically significant correlation exists between

the behavioral index scores and injury rates. Ifthis is not possible beforehand, then collect somekind of injury data during the evaluation, alongwith the observations, in order to calculate sucha correlation. To get data yielding sufficientstatistical power during analysis, you mightrequire a measure of minor injuries, providedthat any underreporting biases remainedconstant throughout the study.

6.3.4 Employee surveys

Employee surveys often measure what cannototherwise be observed. They examine theknowledge, attitudes, beliefs, or perceptions heldby individuals. Occasionally, they assess groupphenomenons such as climate or culture. Theycan also be used to measure (self-reported) safetypractices and injuries, both of which can also bequantified by the methods discussed above.

Surveys of knowledge are an appropriate finaloutcome measure if the intervention is onlydesigned to change knowledge. A similarstatement could be made about surveys ofattitudes, beliefs, perceptions, practices, cultureor climate. However, if the intervention isdesigned to ultimately affect worker safety andinjury rates, then one must be cautious aboutusing surveys which measure only knowledge,attitudes, beliefs or perceptions as a proxy forinjury rates as a final outcome measure. Suchquestionnaires have not usually been sufficientlyvalidated through correlation with injury ratesto justify their use in this way.

Tips for better questionnaire data

In choosing from a number of possiblequestionnaires, Exhibit 6.1 has some questionsthat will assist you in that selection. They arebased on the assessment of the validity andreliability of the proposed questionnaire. Themore questions to which you can answer “yes”,the more likely the questionnaire is suitable for

Quantitative Measurement Chapter 6

60

31 Percent agreement and Kappa statistics are used with categorical data; correlation coefficients are used to measure the inter-rater reliabilityof continuous data.

Page 76: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

the effectiveness evaluation.

If you want to develop a questionnaire, consultsome specialized books on the subject, such asthe one by Aday [1996]. Better yet, consult witha specialist in the area (e.g., an organizationalpsychologist) to assist you. Developing a goodquestionnaire requires a significant investmentof resources; so whenever possible we suggestyou use existing questionnaires, scales or items.

Administering an employee survey

Devise a method of distributing and collectingquestionnaires so that you will know yourresponse rate; i.e., how many have been returnedout of the number given to potential respondents.It is important to achieve high response rates insurveys so that the results can be considered

representative of the entire group to which thequestionnaires were given. Another check onrepresentativeness - especially important in thecase of a poor response rate - involves lookingfor any differences (e.g., age, department, etc.)between those who responded to thequestionnaire and others who did not. Thegreater the difference between the two groups,the more cautious you must be in drawingconclusions from the survey. Sending potentialrespondents one or two follow up remindersabout completing the questionnaire is a goodidea. Participation is also more likely if you canassure people of the confidentiality of theirresponses, publicize the survey through theworkplace and obtain support from influentialmanagement and worker representatives.

Chapter 6 Quantitative Measurement

61

1) Do the questions seem applicable to your environment?

2) Was the questionnaire developed on a similar population in a similar environment or, alternatively,has its validity in a diverse range of populations/environments been demonstrated?

3) Was a process followed for ensuring that the content of the questionnaire was complete? (i.e., were all important aspects of what is being measured included in the questionnaire’s content?)

4) Do the questions measure what your intervention is designed to change?

5) If you are measuring an intermediate outcome with a questionnaire as a proxy for a final outcome,has a statistically significant correlation between the proxy and the final outcome measure beenestablished?

6) Has the sensitivity of the questionnaire been shown by its detection of changes over time, ideallyin response to an intervention similar to yours?

7) Has good “test-retest reliability” of the questionnaire been demonstrated?32

8) Is there a high “internal consistency” among groups of questions forming scales?33

Exhibit 6.1 Questions for assessing questionnaire suitability

32 Test-retest reliability, usually measured by a reliability coefficient, which ranges between 0 and 1, is the consistency of questionnaire scoreson the same group of people given the same questionnaire on more than one occasion. The period between occasions should be long enoughthat people forget the answers they gave on the first test and short enough that what is being measured has not changed. This depends on whatis being measured, but is typically 2 weeks to 3 months. A reliability coefficient of 0.8 is considered good; most consider a value of 0.7 to beminimally acceptable.33 Internal consistency, usually measured by “Cronbach’s alpha”, is a measure of the reliability of scales. Scores on scales are made up bycombining the scores of answers to a group of related questions. Higher values of alpha show that the questions in the scale are in fact relatedand can be considered a scale for measuring a construct. Alpha values between 0.70 and 0.90 are optimal.

Page 77: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

6.3.5 Analytical equipment measures

Data collected with analytical equipment will notserve as final outcome measures for evaluatinga safety intervention, but they might serve well asgood intermediate outcome measures. Forexample, to evaluate an intervention involving aworkstation or task redesign to decreasemusculoskeletal injury, you could indirectlymeasure changes in biomechanical exposuresindirectly through electromyography of muscleactivity or videotaped task performance.

Validity and reliability of analytical equipmentmeasures

The major issues related to the use of analyticalequipment are largely those related toexperimental control of the study conditions. Youneed to ask: is the instrument being used underthe conditions for which it was intended (e.g.,temperature)? Are proper calibration andmaintenance procedures being followed? Is thereanything present in the environment that couldinterfere with the equipment so that it gives falsereadings? Is the equipment operator properlytrained and using standard procedures? Any ofthese sources of error could potentially affectresults in either a systematic or random way.

The reliability and validity of measurementstaken with analytical equipment can be improvedby minimizing variation in the operation of theequipment - both in the environment and in thoseoperating the equipment. Reliability can also beimproved by taking multiple measurements andusing the average as the data point. However,the additional cost of taking multiplemeasurements has to be balanced with the gainsin reliability realized, especially if the equipmentis not a major source of unreliability in the study.

6.3.6 Workplace audits

Workplace safety audits are another way to assesssafety interventions. They focus on safetyelements upstream of injuries, such as safety

policy, management practices, safety programs,and, sometimes, workplace conditions. Auditshave been developed by both commercial andnon-profit organizations; and large companieshave even developed in-house company-specificaudits. Sector-specific audits also exist. Often,they are designed to give qualitative information,but some yield a quantitative measure or score.These summary scores can then be used as anoutcome measure in evaluating certaininterventions at the organizational level.

Validity and reliability considerations whenusing workplace audits

Before using an audit, consider the samequestions already raised regarding employeequestionnaires (section 6.3.4). In particular, makesure that the content of the audit is appropriate,given the intervention’s goals. If the findingsfrom the audit are being used as a proxy for afinal outcome measure, such as injuries, you willneed data that validates its use in that manner.Data could be from similar workplaces and showa statistically significant correlation betweenaudit scores and injuries.

6.4 Choosing how to measure theoutcomes

Your choice of outcome measures will depend onmany things, including the nature of theintervention and its objectives, the setting, andyour resources. While injury rates might be asuitable choice in one case; it might not be inanother.

6.4.1 Evaluation design and outcomemeasures

Final outcome measures

Consideration of the safety intervention’sobjectives should help in deciding what is theideal type of outcome to assess the intervention’seffect. If the intervention is ultimately meant toreduce injuries in the workplace, then the idealoutcome measurement is an (unbiased) measure

Quantitative Measurement Chapter 6

62

Page 78: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

of injuries. Unfortunately, the frequency of lost-time events in a given month or year in manyworkplaces is too low to show clearly anintervention’s effect. Sometimes, inconsistentbiases in the injury statistics are of concern. Ifyou cannot collect useful injury data, then youneed a good substitute - with evidence that it isindeed a good substitute. For example, if youwant to evaluate your intervention usingobservations as a substitute for injury rates, youneed to show that a strong correlation existsbetween the observational measure and(unbiased) injury statistics.

Choosing intervening outcome measures

The choice of intervening outcome measures willdepend on an understanding of how theintervention works, as shown in the conceptualmodel or program logic model relevant to yourintervention. We already discussed how you canstrengthen your evaluation by includingmeasurements of the steps between theintervention and the final outcome (Section4.2.5.1). This provides insight into how theintervention worked (or did not work), which isuseful for planning future interventions. It canalso bolster any evidence of an effect on the finaloutcome. For example, you might find that adecrease in injuries followed a trainingintervention. There is a temptation to think thatthe intervention had been successful. However,

Chapter 6 Quantitative Measurement

63

Evaluation design and outcome measures

1. Which measures should be included to address the objectives of the safety intervention (finaloutcome)?

2. Which, if any, measures should be included to provide an understanding of how the interventionworks or bolster the strength of the design (intermediate and implementation outcomes)?

Measuring unintended outcomes

3. Which measures should be included to detect possible unintended outcomes of the intervention?

Characteristics of measurement

4. Do the methods really measure the outcomes they are intended to measure, from a conceptualpoint of view (construct validity)?

5. Is the outcome measurement method free of systematic biases (validity)?

6. Is the measurement method reliable?

7. Have the measurement methods been used on a group similar to the one under study before?

Statistical power and measurement method

8. Will there be sufficient statistical power during analysis with the method chosen and the numberof evaluation participants?

Practical considerations

9. Is the measurement method feasible (i.e., cost, administrative requirements)?

Ethical aspects

10. Can the measurements be carried out in an ethical manner (i.e., fully informed consent)?

Considerations when choosing the outcome measures

Page 79: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

if you also measured the practices targeted by theintervention and found that they did not change,the question arises as to whether the change ininjury rate was due to something else besides theintervention. On the other hand, if both injuriesand unsafe practices showed a decreasefollowing the intervention, you could be moreconfident that the intervention indeed caused thecause of the decrease in injuries.

A thorough effectiveness evaluation determinesthe extent to which the intervention wasimplemented as planned. This information willbe especially valuable if the intervention appearsto have no effect. You want to distinguishbetween the two possibilities: 1) the interventionis inherently ineffective, even when implementedas planned; or 2) the intervention is potentiallyeffective, but was poorly implemented. Suchinformation is valuable to those who might wantto repeat the intervention or who designinterventions. Program implementation can beassessed using both qualitative and quantitativemeasures.

6.4.2 Measuring unintended outcomes

Unintended consequences of the intervention, bytheir very nature, are difficult to anticipate, andhence, difficult to measure. It is possible that anintervention successfully decreases one type ofinjury but increases another type. This increasecould occur in the same group of personnel or itcould involve another group within the sameworkplace. Safer conditions, for example, forequipment operators might mean more dangerfor maintenance workers. The basic principle inmeasuring unintended outcomes is to includemeasurements apart from the ones most directlyrelated to the intervention.

Other unintended outcomes may arise throughcompensatory behavior in response to theintervention. A particular engineeringintervention to reduce exposure to a hazard, for

instance, could result in a decrease in the use of personal protective equipment because peoplefeel safer. By including a measure of personalprotective equipment use in your evaluation, onecould see if this was happening. If so, it couldexplain why an intervention which lookedpromising failed.

6.4.3 Characteristics of measurement method

We already discussed the very importantconsiderations of reliability and validity at thebeginning of this chapter. You also need toconsider these characteristics of the measurementmethod within the context of the group andsetting where it is applied. For example, aquestionnaire developed to be reliable and validwith a white-collar, native-speaking workingpopulation could perform poorly with a blue-collar, immigrant population. Thus,measurement methods developed with adifferent work population might needmodification before they can work well inanother situation.

Consider also the conditions under which the

Quantitative Measurement Chapter 6

64

An example of unintended outcomesmeasurement

Interventions to decrease needlestick injuriesin hospitals have typically involvedrecommendations to avoid recapping. Theprimary indication of success of thisintervention is a decrease in the frequency ofrecapping injuries. However, it has also beenimportant to track other types of needlestickinjuries. In some cases an increase in injuriesduring disposal has been detected, which hasled to replacing disposal receptacles withones of a safer design. You also need toconfirm that the decrease in needlestickinjuries in health personnel has not beenachieved at the expense of an increase ininjuries to those who handle disposalreceptacles and garbage.

Page 80: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

measurement method is used. If they are not thesame as those for which the method wasdesigned, then determine whether they performwell under the new conditions. A delicateinstrument performing well under laboratoryconditions might not do so well in a workplacewhere there is vibration, dust, etc. The resultsfrom a self-administered questionnaire could bequite misleading if, because of workplace politics,you are not allowed to ensure that everyone hasactually received the questionnaire. Such issuesof appropriateness and the implications forreliability and validity of data might lead you tochoose a different measurement method thanwhat might be used under other conditions.

6.4.4 Statistical power and measurementmethod

In choosing an outcome measure, considerwhether there will be sufficient statistical powerduring analysis (see section 8.4). Thus, a powercalculation should be carried out before theintervention is implemented. Calculations mightshow that the outcome measure initially chosenwill not yield sufficient power, given the plannednumber of evaluation participants. You mightthen choose a different measurement method ormeasure a different outcome in order to increasepower. For instance, you might decide to use aself-reported symptom survey or reported minorinjuries instead of reported lost-time injuries.

6.4.5 Practical considerations

It is also important to be practical when choosingmeasurement methods. What is the cost oftaking these measurements in terms of time andmaterial resources? How much disruption ofwork processes is involved in taking thesemeasurements? Is the necessary expertiseavailable to carry out these measurementsproperly? Are data already being collected forother reasons available for the purposes of theevaluation?

6.4.6 Ethical aspects

There might be ethical issues about the use ofsome measurement methods. For instance,behavioral observations may be inappropriate insome environments - at least without the consentof those being observed. It is customary, actuallyrequired, that researchers in an academicenvironment obtain written consent fromparticipants before using personal data (e.g.,health records, employee opinions).

6.5 Summary

This chapter highlighted two importantmeasurement concepts: reliability and validity.Several methods of measuring safety outcomes,including administrative data collection,behavioral and work-site observations, employeesurveys, analytical equipment measures andworkplace audits were reviewed with a focus onreliability and validity issues. Additional issueswhich influence the choice of evaluationmeasurement methods besides measurementproperties were also discussed: outcomesindicated by the evaluation design; detectingunintended outcomes; statistical power;practicality; and ethics.

Chapter 6 Quantitative Measurement

65

Page 81: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Quantitative Measurement Chapter 6

66

Key points of Chapter 6

• Choose implementation, intermediate and final outcome measures based on the intervention’sconceptual model or program logic model, as well as the objectives of the intervention.

• Consider injury statistics, other administrative records, behavioral/work-site observations,employee surveys, analytical equipment and workplace audits as means of measuring outcomes.

• Use measurement methods which are valid, reliable and practical.• Try to eliminate bias in injury statistics or keep such bias constant. If there is change, report on its

estimated effect on the results.• Choose measures which will yield sufficient statistical power during analysis.• Consider the ethical aspects of measurement methods.• Anticipate and try to measure unintended outcomes of the intervention.

Page 82: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

67

Qualitative methods foreffectiveness evaluation:When numbers are not enough

Chapter 7

7.1 Introduction

7.2 Methods of collecting qualitative information7.2.1 Interviews and focus groups7.2.2 Questionnaires with open-ended questions7.2.3 Observations7.2.4 Document analysis

7.3 Using qualitative methods in evaluation7.3.1 Identifying implementation and intermediate outcomes7.3.2 Verifying and complementing quantitative outcome measures7.3.3 Eliminating threats to internal validity7.3.4 Identifying unintended outcomes7.3.5 Developing quantitative measures

7.4 Selecting a sample for qualitative studies

7.5 Qualitative data management and analysis

7.6 Ensuring good quality data

7.7 Summary

Page 83: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

7.1 Introduction

Qualitative methods play an important role insafety intervention evaluation. Although in mostsituations, numbers are necessary to proveeffectiveness, qualitative methods can yieldinformation with a breadth and depth notpossible with quantitative approaches.

We first describe four methods of gatheringqualitative information: 1) interviews and focusgroups; 2) questionnaires with open-endedquestions; 3) observations; and 4) documentanalysis. We identify and illustrate severaldifferent ways in which these types of data can beused in an effectiveness evaluation. We followwith some details of how to select study subjects,analyze the collected data, and ensure goodquality data.

7.2 Methods of collecting qualitativedata

7.2.1 Interviews and focus groups

A major means of gathering qualitativeinformation is through in-depth interviewing.This involves open-ended questions, whereinterviewees can answer questions on their ownterms and in as much detail as they like. This isin contrast to the typical questions found onemployee surveys, that prompt for yes/no,multiple choice or very short answers. Forexample, a truly open-ended question asks “whatdo you think about the new safety program?”.In contrast, only a limited range of answers isallowed if you ask, “how useful was the newsafety program?” or “was the new programuseful?”

The types of questions used in interviews willdepend on the purpose of the data- gathering.They could be about any of the following: • knowledge (e.g., What did you learn about in

the training?) • experience (In what ways, if any, have things

changed in the way safety is done aroundhere, since the program began?)

• practices (In what way, if any, has the trainingprogram influenced your safety practices onthe job?)

• opinions (What do you think of theprogram?)

• beliefs (What do you think the company’sgoals are in providing you this program?)

• feelings (How do you feel about participatingin the program?).

A good interviewer is sensitive to the mood andfeelings of the interviewee(s), listens well, andencourages them to elaborate on the topicdiscussed. Better interviews will result fromsomeone who has been trained to conductinterviews and has practiced with the interviewquestions. There are a number of approaches forcollecting interview data.

Structured interviews

Structured interviews contain a standardizedmeans of getting information. The same set ofcarefully worded and ordered set of questionsare used with each respondent. This techniquereduces the level of skill required for theinterviewer to do a good job and curtails theinfluence of any particular interviewer on theresults. Structured interviews are useful whereseveral people are conducting the interviews orif the interviewers are inexperienced. On theother hand, there is less opportunity to learnabout individual subject differences andcircumstances while using the structuredapproach.

Qualitative Methods Chapter 7

68

Methods of collecting qualitative data

1) Interviews and focus groups 2) Questionnaires with open-ended questions3) Observations 4) Document analysis

Page 84: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Semi-structured interview

A semi-structured approach to interviewingrepresents a compromise betweenstandardization and flexibility. Here, aninterview guide is used, which is basically achecklist of the issues explored during theinterviews. There is no set order to the topics,and specific questions are not necessarily workedout ahead of time. However, before ending theinterview, the interviewer makes sure all theitems have been covered, through the naturalcourse of the conversation. Any topics not yetcovered can then be addressed. As with thestructured interview, this method ensures thatthe same type of interview data are gatheredfrom a number of people.

Unstructured interview

The unstructured interview is more like aninformal conversation, where questions aregenerated during the natural flow ofconversation. Although certain topics arecovered, there are no predetermined questions.The data varies with each person interviewed.This makes the analysis more challenging. Aswell, more opportunity exists for an interviewer’sbias to influence the results. The strength of thisapproach though is that the interviewer can tailorthe approach and line of questioning to eachindividual.

Focus group interview

A focus group is like an interview with a smallgroup of people rather than only one person. Asemi-structured approach is most useful. Aboutsix to ten people can be interviewed together andthe interviews usually last from one-and-one-halfto two hours. This allows time for participants todiscuss about eight to ten questions.

The focus group technique is a highly efficientway to collect data. You receive the opinions ofseveral people at the same time. The socialsetting provides a measure of validation for the

information, since extreme or false views tend tobe challenged by others in the group. A skilledfacilitator can guide the group’s dynamics so thatthe participants stay on topic and people who areeither shy or have less popular opinions areencouraged to speak.

Exert some caution in selecting individuals for afocus group. First, this format is not advisable ifsensitive information of either a personal ororganizational nature is sought. People might bereluctant to speak up and could be vulnerable torepercussions if they do. For similar reasons, and,depending on the subject of the interview, youshould probably group people together withsimilar positions within the organizationalhierarchy. In particular, consider separating laborand management; and supervisors and thosethey supervise. In some cases, you might want togroup men and women separately.

Chapter 7 Qualitative Methods

69

1. Let the subject(s) know at the outset how longthe interview will last, its purpose and generaloutline. Explain how confidentiality will beobserved.

2. Obtain consent (preferably by signing a consentform) for participating before starting theinterview

3. Start off the interview with non-controversialquestions that require minimal recall. Moresensitive topics, including questions onknowledge, should be covered once a rapporthas been established.

4. Create an atmosphere of having a conversation.You do not want people to feel as if they arebeing examined.

5. Ask clear, truly open-ended, questions.

6. Be nonjudgmental.

7. Be attentive. Indicate interest through youractions and offer verbal feedback.

8. Tape record the interview in order to have adetailed record for analysis. Record importantpoints in your notes.

Guidelines for obtaining good interviewdata

Page 85: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

7.2.2 Questionnaires with open-endedquestions

Researchers do not consider structuredquestionnaires - even with truly open-endedquestions - to be the most effective way to gatherqualitative information. It is assumed that manypeople do not want to take the time to write outa response. As well, it cannot be sensitive tointerviewee differences, since everyone gets thesame question. The depth of responses is limitedbecause there is no opportunity to follow up onan interviewee’s statement with other questions.

On the other hand, if you are using aquestionnaire to measure the quantitativeobjectives in the project, you can then quiteeconomically expand the breadth of the resultsby including a few open-ended questions. Thesecan be useful for gauging participant reactions,identifying program barriers, bringing outunintended consequences of the intervention,and verifying the picture obtained fromquantitative measures. Furthermore, the resultsof this initial screen can help you decide on thenature and extent of any follow-up qualitativestudies.

7.2.3 Observations

Another way of collecting qualitative data is toactually go on-site and observe what is going on.Depending on your needs for the evaluation,everything can be captured, including thephysical environment, social organization,program activities, as well as behaviors andinteractions of people. Or you can take a morenarrow focus. The type of observational dataused in qualitative analyses can be different thanthat used in quantitative analyses. In the latter,specific observations are always being sought:e.g., whether a particular procedure is beingdone correctly or if a particular work-sitecondition is observed. In contrast, for thepurpose of qualitative analysis, specific types ofobservations might not be defined beforehand.

Observational data is especially helpful inevaluating safety programs as an externalevaluator. An understanding of the physical andsocial environment will be increased. You willcatch issues that might go unreported during theinterviews because the insiders are too close totheir situations. As well, people might not speakfreely during interviews in fear of reprisal fromco-workers or management. Finally, an on-sitevisit can be the best way to verify thatintervention activities are occurring as described.

If you are an internal evaluator planning to useobservations, be aware that one’s view of thingsis influenced by one’s background and positionwithin the organization. Thus, if observationsare going to play a large role in an evaluation,consider bringing in an external, more neutralobserver. Similarly, you might have to choosebetween being an observer or a participant, orsomething in between. The more you participate,the more first-hand your knowledge will be. Thedisadvantage is that it becomes more difficult tomaintain “objectivity” and your presence couldinfluence those around you.

Tailor the length and frequency of observations toyour requirements. This can range from a singletwo-hour site visit to verify programimplementation to a full-time, year-long presenceto fully understand, for example, a change insafety climate. Field notes are the primary meansof recording observational information. This canbe supplemented with photographs or videos,although such methods are often obtrusive.Good field notes require a selectivity that canfocus on the important details, yet not severelybias the notes.

7.2.4 Document analysis

Documents of interest in workplace safetyintervention evaluations can include materialcontaining policies or procedures related to theintervention, safety records, committee minutes,correspondence, memoranda, or reports. Theycan suggest topics to include in interviews or

Qualitative Methods Chapter 7

70

Page 86: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

questionnaires and offer evidence of interventionimplementation, barriers to implementation, orother events in the workplace that could threatenthe evaluation’s internal validity.

Be aware that documents are never more than apartial reflection of reality. Some are normative;such as procedures documents. They tell whatshould be done, not whether it actually is done.Some documents are descriptive - e.g., minutesof meetings. However, they can reflect oneperson’s view (e.g., the minute-taker or chair ofthe meeting), more than the collective view.

7.3 Ways to use qualitative methods ineffectiveness evaluation

Interviews, questionnaires, observations anddocuments are used alone or in combinationtowards several purposes in safety interventionevaluations. Here, we elaborate on five ways inwhich they can contribute to an effectivenessevaluation.

7.3.1 Identifying implementation andintermediate outcomes

Qualitative data can help elucidate the stepsbetween the intervention and the final outcome,including implementation and intermediateoutcomes. They can identify results not capturedin the quantitative measures. This can be an

important addition to an evaluation, since it isnot usually possible to quantitatively measureevery pertinent intermediate effect of theintervention. It can be difficult to anticipate themall and measure them quantitatively. Youespecially want to find out the extent to whichthe intervention was implemented as planned.Document analysis, observations and interviewscan be used to check on program activities.

7.3.2 Verifying and complementingquantitative outcome measures

Qualitative measures are used to verifyquantitative measures. Through an approach of“triangulation”, two or more differentmethodological approaches can measure thesame thing in order to establish consistency. You

Chapter 7 Qualitative Methods

71

Ways to use qualitative methods ineffectiveness evaluation

1. Identifying implementation andintermediate outcomes

2. Verifying and complementing quantitativeoutcome measures

3. Eliminating threats to internal validity4. Identifying unintended outcomes5. Developing quantitative outcome

measures

Example of how qualitative methods canbe used to identify intermediate outcomes

Let us return to an earlier example34 where anintervention consisted of a workplace- basedincident investigation team assistingsupervisors in their investigation of incidents.Quantitative data included final outcomemeasures (frequency and severity of injuries)and some intermediate outcome measures(length of time between incident and itsreport and percentage of incidents generatingcorrective action proposals). Interviewshelped fill in the picture further of how theintervention could have led to the observeddecrease in injuries and their severity. Theinterviews revealed that supervisors andsafety representatives found the incidentinvestigation teams helpful and felt that bettercorrective actions were conceived. Thus,better quality corrective actions - anintermediate outcome - has been identified asa plausible means by which the frequencyand severity of injuries were decreased.

34 Menckel and Carter [1985]

Page 87: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

might undertake a broad-based safety initiativeto change the “safety climate” in the workplace.Certainly, you could use a safety climatequestionnaire, which typically consists of close-ended questionnaire items, to assess a change insafety climate. Also valuable are open-endedquestionnaire items or interviews completed bykey personnel regarding observed changes in theworkplace atmosphere concerning safety. If themethods are consistent in their portrayal ofchange in safety climate, then a “cross-validation” of the methods has been achievedand you can present your conclusions with moreconfidence.

Sometimes the methods are complementary inthat they might measure different aspects of thesame concept. Open-ended questions orinterviews might detect aspects of change missedby a questionnaire containing only close-endeditems.

7.3.3 Eliminating threats to internal validity

Interviews with key officials can provideinformation crucial for addressing potentialthreats to internal validity.

7.3.4 Identifying unintended outcomes

Interviews and, possibly, observations are usefulways to identify unintended outcomes.Although some unintended outcomes can beassessed quantitatively, such as an increase in anuntargeted type of injury, others would be betterdetected through qualitative inquiry.

Interviews are an especially good at gauging thereactions of intervention participants and othersinvolved in the intervention, includingsupervisors, union leaders and managers. Theirreactions and those of others involved with theintervention, are important, since a poorresponse by an influential individual or group ofindividuals at a work-site could have a big effecton the program. It might explain the lack ofsuccess of a promising intervention. Unintendedoutcomes can also be more positive. In oneevaluation, for example, interviews with workersand foremen showed that several people believedthat the recent decrease in the number of laborgrievances could be attributed to the improvedindustrial relations resulting from theparticipatory ergonomic program.

7.3.5 Developing quantitative measures

Data collected using qualitative methods in theplanning stage of the evaluation can provide thebasis for the subsequent development of relevantquantitative measurement instruments. Here arethree examples.

Qualitative Methods Chapter 7

72

Example of how qualitative informationhelps reduce threats to internal validity

In the evaluation example just discussed on theprevious page, interviews and analysis of safetycommittee minutes revealed the followinginformation which helped eliminate threats tointernal validity. The workplace physical plan,products, production techniques and activities, aswell as the safety-related policies, purchases andactivities (apart from the creation of the incidentinvestigation committee) had remained constantover the six-year evaluation period - suggestingno evidence of history threats. There was also noevidence for an instrumentation or reportingthreat, since there were no changes in the incidentreporting criteria, nor in safety-related policies,purchases and activities (apart from the creation ofthe committee).

Page 88: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

7.4 Selecting a sample for qualitativepurpose

Once you have decided to use qualitative datacollection methods as part of the programevaluation, you need to decide from whom, orabout what, data should be collected. This mightinclude collecting data from specific employeework groups, manager groups, female or maleworkers, or different classifications of workers.Additionally, you might want to collect dataabout a particular event, occurrence, or incident.

Rather than surveying the entire work force, usewhat is called purposeful sampling. Here, oneselects information-rich cases to study in-depth.They are purposefully selected so that theinvestigator can learn, in detail, about issues ofcentral importance to the program. For example,you might want to ask different employeeworkgroups about their experience in a particular

occupational safety program. Then comparequotes across groups to see if there are differencesin experiences which might influence theintended goals of the program. Furthermore, youmight separately ask male and female workersabout any problems in participating in theprogram. Again, comparisons can be made tosee if both females and males similarly receivedthe program as intended.

We describe eight different purposeful samplingstrategies that may be used.

Extreme or deviant case sampling

Identify unusual or special cases. It is possiblethat much can be learned from extremeconditions (good or bad) rather than the manypossibilities which fall in the middle. Forexample, survey data collected after a safetyprogram is over might show one or two people

Chapter 7 Qualitative Methods

73

Examples of how qualitative studies can help develop quantitative instruments

1) Interviews, observations and document analysis can lead to the development and inclusion ofcertain items on questionnaires. For example, say that opinions expressed in interviews had arepeating theme that safety is for sissies. If your intervention is in part designed to change thisattitude, then it would be a good idea to develop a questionnaire that includes questions whichmeasure such safety attitudes.

2) People have used the details of incident records, a qualitative information source, to developworkplace-specific checklists of work practices or work-site conditions used in quantitativeassessment. They review records to find which unsafe practices and conditions are associated withincidents. Interventions are then developed which encourage the practice of safer alternatives. Aswell, checklists of these safe practices and work-site conditions are developed and used inevaluation. Quantitative measurement consists of making (random) observations and recordingwhether the safe or unsafe version of the practice or work-site condition was observed.

3) Menckel and Carter35 described a new safety initiative in which a group assisted workplacesupervisors in their investigation of incidents within their division. Preliminary interviews anddocument analysis showed that there was often a long delay between incident occurrence and itsformal reporting. As a result, corrective safety measures were correspondingly delayed in theirimplementation. Thus, one of the ways evaluators chose to measure the effect of a new workplaceincident investigation group was by how long it took for incidents to be reported.

35 Menckel and Carter [1985]

Page 89: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

who have made big changes. A follow-up withan interview could validate the responses as wellas discover what in the program motivated themto make such big changes. By limiting the focusto extreme cases, this approach to sampling iseconomical in time and resources.

Heterogeneity sampling/maximum variationsampling

Identify cases with differing characteristics (e.g.,age, gender, education, job classification) toprovide diverse points of view. Any commonpatterns emerging from the variant cases cancapture the core experiences and shared aspectsof a program or event.

Homogenous sampling

Identify a small group of information-rich cases- similar in terms of background, employmentlevel, experiences, etc. and explore the issue ofinterest in depth. It might be of interest toseparate groups of management and then laborand compare their opinions about a safetyprogram.

Typical case sampling

Identify “typical” individuals to describe thebenefits of the program. Cases are selected withthe co-operation of key informants such asprogram staff. This information can be used tohelp “sell” the program to others reluctant toparticipate.

Critical case sampling

Find individuals who could dramatically make apoint about the program. They may be identifiedby asking a number of people involved with theprogram. A good bet are the leaders in the groupwho could provide suggestions about how toimprove the program.

Criterion sampling

Identify and study cases that meet somepredetermined important criterion. Even if allemployees at the work-site receive the training,you might interview only those most exposed tothe particular hazard targeted by the training.They may reveal major system weaknesses thatcould be targeted for improvement.

Politically important case sampling

Identify, and select (or not) politically sensitiveindividuals. You might want to interview aunion steward who supports the program, andthereby can enrich the information obtained.

Convenience sampling

The most common method in selectingparticipants for qualitative data collection lieswith picking cases that are easiest to obtain andthose most likely to participate. This is also theleast desirable method. The problem is that inthe end, it is difficult to know exactly who wasinterviewed and if their opinions are consistentwith others possibly affected by the program.

7.5 Qualitative data management andanalysis

A variety of methods are used to analyzequalitative data. The process is described herein very general terms and appears as a sequenceof steps, which in actual practice can occursimultaneously or may even be repeated. First,all raw information, if not already in a writtenform, is converted to text. Thus, taped interviewsare transcribed and visual material issummarized using words, etc. This body oftextual material is reviewed to identify importantfeatures and, possibly, summarize them. Acoding system of keywords, or some other datareduction technique, is developed to facilitate thisprocess. The data, either in summarized form ornot, is then reviewed to identify patterns. These

Qualitative Methods Chapter 7

74

Page 90: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

patterns are concerned with the following:similarities or differences among groups orevents; repeated themes; and relationshipsamong people, things or events.

Identification of patterns leads to somegeneralizations or tentative conclusionsregarding the data. Depending on the scope ofthe investigation, you might examine thetrustworthiness of these generalizations bytesting them with the results of further datacollection or comparing them with existingtheory.

Success at the data analysis stage requires thatgood data management practices are observedfrom the beginning of data collection. Usesystematic methods for collecting, storing,retrieving and analyzing data. People havedeveloped various techniques to help highlight,organize or summarize data. A useful referencein this regard is the book by [Miles andHuberman 1994]. This reference also reviews thevarious software developed to assist in both thedata reduction and pattern recognition stages ofanalysis.

7.6 Ensuring good quality data

Concerns about reliability and validity apply toqualitative data, just as they do to quantitativedata. Thus, anyone reading a report of aqualitative investigation wants to know that thestated methods have been used consistentlythroughout the study (reliability concerns). Theyalso want to know that there are no hidden biasesin the data collection, the data analysis nor theconclusions drawn (validity concerns).

The following contains considerations andsuggestions for ensuring that good quality data iscollected.

Minimizing evaluator bias

The product of a study no doubt bears thepersonal mark of the people conducting it.

However, researchers generally try to reducetheir effect on their research by using conceptsand methods agreed upon by other researchers.Ways to guard against bias include the following:outlining explicit methods for data collection anddata analyses; adhering to these methods;having more than one researcher collect data;having a second, non-biased person summarizeand/or draw conclusions from the data; andletting the data speak for themselves and notforcing them into a framework designed by theresearcher.

Appropriate sampling

Someone reading your evaluation wants to besure that the right sample has been selected forthe stated purpose. For example, you could notclaim to be truly representing workplaceperceptions of the effectiveness of anintervention, if either management or employeerepresentatives are not represented. Thus, therationale and method of sampling must beexplicit and justified with respect to the study’saims.

Validation by subjects

One of the best ways to determine whether ornot you “got it right” in your study, is to checkwith the subjects you are studying. This involvesconfirming the accuracy of the data collected, thereasonableness of the method used to summarizeit, and the soundness of the conclusions. Ofcourse the potential biases of the subjectsconsulted must be kept in mind when weighingtheir opinions.

Thorough methods of drawing conclusions

Avoid drawing conclusions too soon. This canbe caused by researcher bias or pressure to comeup with answers quickly. In contrast, well-grounded conclusions require time for at leastsome of the following activities: 1) reviewingcollected data to identify anything which hasbeen overlooked; 2) searching for evidence which

Chapter 7 Qualitative Methods

75

Page 91: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

contradicts preliminary conclusions, either byreviewing data already collected or by gatheringnew data; 3) confirming important data orconclusions through “triangulation”, i.e., findingagreement when using a different data source,methodology or researcher; and 4) exploringalternative explanations for patterns observed inthe data.

Conduct a pilot study

Conducting a pilot study or trial run with yourproposed research methods is often of greatvalue. Feedback from those involved in the pilotstudy can be used to refine a sampling strategy,interview guide, other data collection procedures,and even procedures for data management.

7.7 Summary

We have reviewed four major methods forgathering qualitative information: interviews;questionnaires with open-ended questions;observations; and document analysis.Qualitative data can be used in several ways tocomplement quantitative methods: identifyingimplementation and intermediate outcomes;verifying and complementing quantitativeoutcomes; eliminating threats to internal validity;identifying unintended outcomes; anddeveloping quantitative measures. In contrast toquantitative methodology, qualitative methodsusually employ one of several purposefulsampling strategies. We briefly discussedmethods of analysis and methods to ensure goodquality data.

Qualitative Methods Chapter 7

76

• Use interviews and focus groups, questionnaires with open-ended questions, observations, anddocument analysis to enrich your evaluation.

• Use qualitative methods for one or more purposes:

• identify implementation and intermediate outcomes

• verify and complement quantitative measures

• eliminate threats to internal validity

• identify unintended outcomes

• develop better quantitative measures.

• Use one of several purposeful sampling strategies.

• Collect and analyze data in ways which enhance their reliability and validity.

Key points from Chapter 7

Page 92: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Chapter 7 Qualitative Methods

77

Page 93: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries
Page 94: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

79

Statistical Issues:Are the results significant?

Chapter 8

8.1 Introduction

8.2 Why statistical analysis is necessary

8.3 P-values and statistical significance

8.4 Statistical power and sample size

8.5 Confidence intervals

8.6 Choosing the type of statistical analysis8.6.1 Type of data8.6.2 Evaluation design8.6.3 Unit of analysis

8.7 Avoiding pitfalls in data analysis

8.8 Summary

Page 95: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

8.1 Introduction

This chapter will not answer all your questionson statistics. It cannot cover all the possibleapproaches that intervention evaluations mightrequire. Statisticians will maintain they should beconsulted right from the design stage of a project.This is not just self-interest. Certainly, they wantyour business - including the consultation fee!But even the statistician among us (HSS), whotypically provides free consultations for facultycolleagues, makes the same point: there are oftenaspects of the way the evaluation is designed, orthe type of data collected among other factors,that mean the statistical analysis will not beentirely straightforward. You can avoid endingup with “messy data” by discussing the study inadvance. The data could still turn out to becomplicated and it may be best to find astatistician to do the analysis. After all, you haveprobably spent a lot of resources on ensuring thehighest quality of intervention and datacollection; so you do not want a second-rateevaluation.

This chapter provides an overview of thestatistical concepts applicable to interventionevaluation. We start by explaining the need forstatistical methods, followed by a discussion ofthe meaning of p-values. If you have ever read ascientific paper with quantitative information,then you have seen p-values. They show anequation like “p<0.05” along with a commentstating if the result is “statistically significant”.Another way of presenting statistical results areconfidence intervals. Also being introduced is thenotion of statistical power and how that relates tothe sample size.

Later in the chapter, we discuss what issues toconsider in choosing a statistical technique. Twohave already been mentioned - the type of datayou have and the study design being used. Nocalculations are shown in this chapter, but somesimple examples are included in Appendix B.They correspond to the evaluation designsoutlined in Chapters 3 and 4.

8.2 Why statistical analysis isnecessary

Surely, if the change in injury rate in anintervention group is greater than the change ininjury rate in the comparison group, doesn’t thatprove that the intervention has worked? Whyare statistics needed? The answer is that real-lifedata are subject to random variability. Supposeyou have a perfectly balanced (“fair”) coin.(Statisticians love using examples about coin-tossing to explain the principles they use.) Toss itten times, and you can expect five heads and fivetails. But it is also reasonably likely that youcould get six tails and four heads, or four tailsand six heads simply as a result of chance(random variability). You might also get a 7-3split. As will be seen in the next section, thequestion from a statistical viewpoint becomes:how far from a 5-5 split do I have to get, before Ibecome suspicious about the coin and question ifit really is fair?

The analogy with the safety situation can be seenif we think about a study investigating whetherback belts reduce back pain. You randomize halfthe people to receive back belts (the interventiongroup), while the other half (the control group) isleft alone. After the intervention group has hadback belts for a while, everyone is asked aboutlevels of back pain. [This is an example of anexperimental design with “after”-onlymeasurements, Section 4.3.2] For each person,there is a pain score.

The average score in the group given back beltsmay be somewhat better than the average in thecontrol group. Does this mean that back beltswork - or is it that simply by chance, therandomization led to more people who have alot of back pain ending up in the control group?And what if it is the opposite? Suppose, the backbelt group does a little bit worse than the controlgroup? Does that mean that these belts areactually harmful - or is it because there happenedto be more people who get pain in theintervention group? Statistical analyses can

Statistical Issues Chapter 8

80

Page 96: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

indicate how likely (probable) these possibilitiesare.

Statistical techniques can also address a selectionthreat to internal validity, i.e., when differencesin participants’ characteristics between theintervention and control group could beresponsible for the measured intervention effect(Section 4.4.1). You notice, for instance, thatpeople given back belts are on average somewhatolder than those who do not get them.Furthermore, because older workers are lesslikely to use the belts, and the intervention groupitself is older, you might not see much change incomparison with the control group. To reducethis type of threat, statisticians have developedtechniques that account for or control thedifference in ages (or other variables) betweenthe two groups.

8.3 P-values and statisticalsignificance

A leading philosopher of science in the 20th

Century, Sir Karl Popper, argued that scienceproceeds by the method of refutation. At anytime, scientists have only the best theory (orhypothesis) at the moment to describe how thereal world works. This hypothesis can alwayschange based on new observations. What

scientists must do, argued Popper, is to deviseexperiments aimed at disproving (refuting) theirlatest theory. As long as the experiments fail todo so we continue to regard the theory as at leasta reasonable approximation to the real world. Ifan experiment disproves the theory, then a newone must be adopted. Classical statisticalreasoning works in a similar fashion, basing therejection of an initial hypothesis on probabilisticgrounds.

With our example of back belts: start from aposition of skepticism about their value, i.e.,hypothesize that the intervention has no effect(null hypothesis). If the program is useless, thenexpect no difference in the back pain between theintervention and control groups. However, in

Chapter 8 Statistical issues

81

What do we mean by a “true”effect?

Several times in this chapter we refer to a“true” or “real effect” of a program orintervention. Surely, you may think, the trueeffect is what we find. How could the effectbe considered not true?

Part of the answer is that the estimate of theeffect is subject to variability. The groupsstudied may be comparable, but cannot beidentical even if they are created throughrandomization. So if you repeat yourevaluation study elsewhere the size of effectmight be larger or smaller.

If you repeat the study many times, you couldtake the average (mean) of the effect sizes.This would balance out the times when theeffect just happens to be larger with thosewhen it happens to be smaller. In practice,studies are rarely repeated even once, let alonemany times. But you can do this as “thoughtexperiment”. In fact, statistically we thinkabout a (hypothetical) infinite number ofreplications. If we could actually do them, theaverage effect size over the replications wouldbe what we have called the true effect.

Caution: Having emphasized the vitalimportance of statistical analysis, we warnyou about indiscriminately using statisticalpackages. Inexperienced researchers aresometimes tempted to feed their data into astandard software package and ask foreverything to be compared with everythingelse. This is not good practice. Statisticaltesting should follow a clear definition ofhypotheses; and the testing should notdetermine the theory. Instead, the hypothesesshould come from the chosen interventionmodels.

Page 97: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

practice, the intervention group sometimes doesbetter and sometimes worse than the controlgroup, i.e., the average scores show less pain ormore pain. The question then becomes: how biga difference must exist between the groups beforeyou start to doubt the null hypothesis and acceptthat the change comes from a real effect of theprogram?

Typically, the statistical approach takes intoaccount the difference in the mean (average)levels of pain between the two groups, as well ashow much variability exists between the scores ofthe people in the study and the sample size (howmany people are included in the study). Theanalysis produces a p-value, which can beinterpreted as:

When this probability is small enough, you rejectthe hypothesis of no difference and start tobelieve (at least for practical purposes) that theback belts have changed the level of pain. Whenthe p-value is larger, you cannot reject thehypothesis, so you would conclude the belts donot “work”, at least in the way they were used.The cut point for the p-value, which representsthe probability you are willing to allow ofconcluding the intervention works - or does harm- when it is really ineffective,36 is known as � (theGreek letter alpha). When the p-value is less thanthis, the result is declared statistically significant.

How small does the probability have to be foryou to reject the hypothesis and claim that theintervention works? Strictly speaking, there isno right or wrong answer here. It depends onyour willingness to draw the incorrect conclusionthat the belts work when they really do not. Thisin turn depends on the resource and policyimplications of the evaluation results. The more

expensive the intervention, the less likely youwant to make this type of mistake, so you wantthe probability to be very low; and vice versa - ifthe intervention is cheap and aimed at verysevere injuries, you may be willing to apply iteven if the evidence of its value is less strong. Inpractice, though, alpha is usually taken as 0.05(or 5%).

8.4 Statistical power and sample size

Now there is another side to all this. We havediscussed the p-value based on the condition thatthe program is useless. But what if it works? Ifthe intervention is truly effective, you want to bereasonably sure to reject the initial nullhypothesis. Just as you can get six heads andfour tails even with a fair coin, you could also getfive heads and five tails even with a coin biasedtoward heads. Similarly, even if the programtruly has a moderate effect, you might be unluckyin your study and only observe a very smalldifference, which may not be statisticallysignificant. If you fail to reject (i.e., you accept)your initial hypothesis of no difference, and therereally is one, the mistake is known as a Type IIerror. The probability of such a mistake, i.e., theprobability that you fail to reject the hypothesiswhen it is false, is known as � (the Greek letterbeta).

This means that the probability that you correctlyreject a false hypothesis, i.e. you detect theprogram’s effectiveness, is 1-�, and this value isknown as the power of the study. The importanceof this is that you obviously do not want to do a

The probability that a difference at least aslarge as the one seen could have occurredsimply by chance, if there really is no effectof the intervention.

Statistical Issues Chapter 8

82

36 This type of mistake is known as a Type I error

Important note: In the interpretation of thep-value, the phrase “if there really is no effectof the program” is crucial. Ignoring it has ledto many a misinterpretation of p-values. Wenow discuss the situation where there is aneffect.

Page 98: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

study that has little chance of demonstrating areal effect of an intervention. You want to bereasonably sure that you will conclude there is adifference if one really exists. Thus, it is importantto consider power before undertaking the interventionand evaluation.

You can do this in two ways. You could set thepower you want and then calculate the samplesize needed; or you may have a certain numberof workers who could be involved in a study, andyou can estimate the power the study wouldhave with that number.

The first approach is actually a preferable way toconduct an evaluation - indeed, clinical trials ofnew drugs do it this way round. Typically,researchers design evaluations so that the poweris 80% (sometimes 90%); that is, if theintervention is truly effective, there is an 80%(90%) chance that the data you gather and thestatistical test you use allows you to concludethat the intervention is effective.

In practice, workplace interventions usuallyinvolve a fixed number of employees, forexample, all workers in a plant or a department.So you can’t set power in advance - rather, youshould check what power you will have. Severalcomponents go into the calculation of power: theeffect size - how much effect you think theintervention will have (or should have in order tobe worth replicating elsewhere); sample size (thenumber of evaluation participants or, moreformally, experimental units); how muchvariability there is between the outcomemeasurements within the sample; and the valuesyou set for � and �. The formula you use tocalculate power, like your choice of statistical test,depends on the experimental design and type ofdata collected.

All other things being equal, the larger thesample size, the larger is the power. Similarly,the less variation in the outcome measure, the

larger the power. If you should find that theintended plan would likely yield power muchlower than 80-90%, you might want to changeyour evaluation design, choice of outcomemeasures, or number of people included in theevaluation. Cohen (1988) shows how to dopower calculations and you can also use thestatistical packages mentioned in Appendix B.

8.5 Confidence intervals

Another way of showing the degree ofuncertainty in an estimate of the effect of anintervention is through confidence intervals.Suppose, that you do a study where peopletaking a program improve their knowledge scoreby an average of five points. It could be that theprogram is actually useless but, just by chance,this apparent benefit is found. Alternatively, theprogram may really be worth more than a five-point improvement; but by chance you happento underestimate the real benefit. In other words,the real benefit of the program might be higher orlower than the observed value of five points. Anobvious question is: how much higher or lower?You can often construct a confidence interval (asillustrated in Appendix B), within which you arereasonably sure the “true” level of benefit lies.(The interval is calculated, taking into accounthow much variability exists in individualknowledge scores.) Typically, you see 95%confidence intervals, which means that you are95% sure (i.e., there is a probability of 95%) thatthe size of the true effect is within the interval.37

In many ways, the confidence interval is moreuseful than a p-value, which simply indicateswhether a difference between groups is or is notstatistically significant. With a confidenceinterval, you get a sense of just how high or lowthe benefit might reasonably be.

The narrower the interval the better, since therange of plausible values is smaller. Suppose, theconfidence interval shows that the observed

Chapter 8 Statistical issues

83

37 95% confidence intervals are based on � = 0.05; 99% confidence intervals are based on � = 0.01; etc.

Page 99: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

value of the average benefit of the program (fivepoints) is from a one point benefit to a nine pointbenefit. (Confidence intervals can be, but are notalways, symmetrical about the estimate of effect.)Although your best estimate of the programeffect is still five points, it is also quite plausiblethat the benefit could be as low as one point -which we might consider trivial - or as high asnine points, a very useful improvement. Thus,we would be quite uncertain about the value ofthe program. A smaller interval, of between fourand six points, would be preferable. All otherthings being equal, a narrower interval can beobtained with a larger sample size.

As a general rule, if a 95% confidence intervalexcludes the value of zero, you will know that ifyou tested the null hypothesis (i.e., no (zero)difference between the values being compared),you would be able to reject the hypothesis, usingalpha = 0.05.

8.6 Choosing the type of statisticalanalysis

Up to this point we have simply referred to theinterpretation of the p-value and confidenceinterval that result from a statistical analysis.Before you get there, you need to think aboutwhat type of statistical analysis to use. There area number of issues to consider. Some arediscussed in detail in this section; some later inthe chapter.

8.6.1 Type of data

The type of data being analyzed is important. Ifyou determine whether or not someone had anyinjuries, you have what is called categorical ordiscrete data; and if you use the actual number ofinjuries occurring to each person or the countsfor an entire workplace you still have categoricaldata. Continuous data can take essentially anyvalue in a range (at least in principle). Age is acontinuous variable, although in adults weusually simply round off the number of years andsay that someone is 42, rather than 42.752 yearsold. In practice, the boundary betweencategorical and continuous variables can be alittle fuzzy. Thus a measure of behaviors, whichmight take an integer value from 0 to 100, istypically considered to be continuous. Somestatistical methods are robust which means thattaking such variables as continuous is acceptable.Any variable that takes at least ten values canreasonably be considered continuous. Anothersituation in which the boundary betweencategorical and continuous is fuzzy is when youanalyze injury rates. Although the rates arecontinuous data, statistical tests intended forcategorical data are sometimes used. This isbecause the analysis in such cases uses the actualnumbers of injuries, which are categoricalvariables.

Statistical Issues Chapter 8

84

Things to consider when choosing thetype of statistical analysis

• What type of data is it - categorical orcontinuous?

• What type of evaluation design is used?• What is the unit of analysis?• What is the study sample size?• Is correction for different group

characteristics needed to avoid selectioneffects?

Page 100: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

8.6.2 Evaluation design

The choice of statistical analysis must also takeaccount of the evaluation design. A simplecomparison would be between the post-intervention scores of those who haveexperienced an intervention compared with thosewho have not. Similarly, scores before and afteran intervention can be compared. The simplicityof these two types of studies makes them usefulfor distinguishing two situations which requiredifferent analyses. In the first case, the scores inthe two groups all come from differentindividuals. In the second case, in contrast, eachindividual received two scores - before and afterthe intervention. This is an example of a repeatedmeasures design. The two scores tend to be related,

since those who scoring higher before theintervention will likely score high (or at leastrelatively high) after the intervention.

In analyzing repeated measures we can takeadvantage of this relationship. The techniquesusually reduce the “noise” in the data, becausethey remove much of the variability betweenpeople. This allows the “signal”, the differencesover time within each person’s scores, to becomeclearer.

Table 8.1 is a guide to statistical tests appropriatefor the designs that we have discussed. Asindicated, illustration of some of these tests canbe found in Appendix B.

Chapter 8 Statistical issues

85

Before-and-after Rate Chi-squared test for B.1.1comparing rates

Continuous Paired t-test B.1.2

Pre-post with Rate z-test for comparing B.2.1randomized or non- rate ratios or raterandomized control differencesgroup

Continuous Two sample t-test B.2.2(groups similar) or multiple regression(groups different)

Experimental designs Rate Chi-squared tests for B.3.1-B.3.2with “after”-only comparing ratesmeasurements

Continuous Two sample t-test B.3.3-B.3.4(two similar groups),ANOVA (two ormore similar groups),or ANCOVA (two ormore differentgroups)

Simple or multiple Categorical, rate or Time series analysis B.4 time series; multiple continuous techniquesbaseline design (e.g., ARIMA) across groups

Table 8.1: Choice of statistical test based on evaluation design and type of data

Type of design Type of outcome data Statistical test Section number

Page 101: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

8.6.3 Unit of analysis

You also need to think about the unit of analysis.If an intervention is targeted at changing aworkplace, rather than individuals, then eachworkplace might count as just one unit. Anotherpossibility is that you conduct a randomizedstudy aimed at changing individual behavior, butyou randomize work groups, rather thanindividuals. For example, if you have a companywith a number of small but geographicallyseparated units, you might randomize workgroups to receive or not receive the intervention.Then in any given work group either everyonegets the intervention or no one does. Thissampling of “clusters” rather than individualsmust be taken into account in the analysis. Inessence, the issue is that individuals may displaysimilar behavior within a given unit (i.e., a groupeffect). The greater this tendency, the lower theeffective sample size. The concern about “clustersampling” is real and relatively common - butoften it is not accounted for in the analysis.

8.7 Avoiding pitfalls in data analysis

Data exploration

It is good practice before jumping into a(relatively) sophisticated analysis to look at thedata in a fairly simple fashion. You can look atthe means of groups, the proportion of subjectshaving injuries, frequency distributions or therange of values observed. Graphs or diagramscan often be helpful. These approaches give youa “feel” for the data, as well as help find errorsarising from the data collection or processing.Failure to do these things may lead to theapplication of a technique inappropriate for thedata, or even worse, analyzing incorrect data.

Changes to the designs

It is often tempting, for practical purposes, to“tweak” the intervention or evaluation designs,to make what at face value might seem to beminor changes. The result will generally haveimportant implications for the type of analyses

done. If you had originally planned the studywith a particular type of statistical approach, youmay not be able to use it. This is not to suggestthat you should always be rigid in following apre-planned design, but rather that you shouldmake changes with caution.

Other pitfalls

We have already mentioned a few pitfalls andhow to deal with them: choosing the right unit ofanalysis - especially when we engage in “cluster”sampling; ensuring studies are large enough tohave adequate power; and ensuring we do notsimply press buttons on our computer to producean answer based on an incorrect analysis.

8.8 Summary

In this chapter, you have seen some basicconcepts in statistical inference, including p-values, statistical power and confidence intervals.We pointed out some of the things to considerwhen undertaking an analysis: type of data;evaluation design; unit of analysis; sample size;and correction for group characteristics. Someexamples of analyses, corresponding to thedesigns discussed in Chapters 3 and 4, can befound in Appendix B.

Statistical Issues Chapter 8

86

Key points from Chapter 8

• Discuss the analysis with a friendlystatistician while designing the evaluation -do not wait until after you have collectedthe data.

• Check statistical power while designing theevaluation.

• Do an initial data exploration to get a “feel”for the data.

• Choose the type of statistical test accordingto the type of evaluation design and the typeof data.

• If in doubt, discuss with a friendlystatistician.

Page 102: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

87

Summary of recommendedpractices

Chapter 9

9.1 Introduction

9.2 Summary of recommended practices

We have discussed the various methods ofeffectiveness evaluation in the context ofevaluating safety interventions. The followingsection gives an overview of some of the keymessages from the previous chapters. You likelywill not be able to follow all of the recommendedpractices. As a whole, they represent an ideal.Even if you are not able to follow all of the

practices outlined in this guide, it does not meanyou should not proceed with your chosenintervention and some level of its evaluation.

You will no doubt need to summarize and reporton the results of your evaluation. Some guidanceon this aspect has been included in Appendix C.

9.1 Introduction

Page 103: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Planning and development

• Identify resources available and information needs of the end-users of the evaluation results• Involve all parties relevant to the intervention and evaluation in the planning stage, as well as at

subsequent stages of the evaluation• Seek external expertise on evaluation design, methodology, and analysis, if necessary• Review relevant theory, research literature, methodology, historical data• Develop a conceptual model and/or program logic model• Keep an intervention diary

Method development

• Determine reliability and validity of measurement methods if not already known; pilot test whennecessary

• Use qualitative methods to inform the use and design of quantitative methods• Pilot test any qualitative or quantitative methods that are new• Estimate statistical power based on planned methods - if insufficient choose new study sample

size, evaluation design or measurement method

Study sample

• Choose a study sample representative of the target population• Use random sampling methods in selecting a sample from a sampling frame• Choose a sample size large enough to give sufficient statistical power• Consider using randomized block or matching designs to avoid selection effects• In quasi-experimental designs (non-randomized), choose intervention and control groups so that

they are very similar• In experimental designs, use randomization to assign participants to intervention and control groups• In qualitative studies, choose a purposeful sampling strategy suitable for the evaluation purpose and

intervention circumstances

Recommended Practice Chapter 9

88

9.2 Summary of recommended practice

Page 104: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Evaluation design

• If you have no choice but to use a before-and-after design, for reasons of feasibility or ethics, try toeliminate the threats to internal validity

• identify other changes in the workplace or community which could effect the outcome (history threat) and measure their effect

• ensure that before and after measurements are carried out using the same methodology, toavoid instrumentation or reporting threats

• avoid using high-injury rate groups as the intervention group in a before-and-after study, toavoid regression-to-the-mean threats

• allow for the fact that taking a test can have an effect of its own (testing threat)• try to minimize Hawthorne threats by acclimatizing workplace parties to researchers before

measuring the intervention’s effect• identify any natural changes in the population over time which could obscure the effect of the

intervention (maturation threat), and try to allow for them in the statistical analysis• investigate whether the intervention participants’ dropping out could have and effect (dropout

threat)• Use a quasi-experimental design whenever possible instead of a before-and-after design by using

one or more of the following strategies:• include a control group• take additional measurements both before and after the intervention• stagger the introduction of the intervention to different groups• add a reversal of the intervention• use multiple outcome measures

• Use an experimental design whenever possible instead of a quasi-experimental design by assigningparticipants to intervention and control groups through randomization

• When using control groups, check that intervention and control groups receive similar treatmentthroughout the evaluation period, apart from the intervention itself; avoid, but check for, diffusion,rivalry or resentment effects

• Plan a measurement timetable to capture maximum intervention effect and characterize longerterm effects

• Collect additional data to address threats to internal validity not addressed in the primaryexperimental design

• Try to triangulate and complement data collection methods by using multiple methodologies,especially qualitative and quantitative

Chapter 9 Recommended Practice

89

Page 105: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Measuring intervention implementation • Use both qualitative and quantitative methods to assess

• degree of intervention implementation• problems with intervention implementation

Measuring intervention outcomes • Measure intermediate and final outcomes• Use both quantitative and qualitative methods• Select reliable and valid methods, appropriate for the study sample and intervention• Consider injury statistics, other administrative records, behavioral/work-site observations, employee

surveys, analytical equipment and workplace audits as means of measuring outcomes.• Select quantitative methods which give sufficient statistical power during analysis• Consider the ethical aspects of measurement methods

Measuring unintended outcomes

• Use both qualitative and quantitative methods to assess any unintended outcomes

Statistical analysis

• Decide on the statistical methodology prior to undertaking evaluation• Calculate power before data gathering begins – modify the design or measurement methods if

power is inadequate• Use appropriate techniques for analysis based on type of data and experimental design

Interpretation

• Try to use the results of qualitative enquiry to enhance the understanding of the quantitative results• Identify all likely alternative explanations for the observed results apart from the true effect of the

intervention (i.e., threats to internal validity)• Examine the feasibility of alternative explanations, using a quantitative approach whenever possible

and collecting additional data if necessary

Conclusions

• Address evaluation questions in your conclusions

Recommended Practice Chapter 9

90

Page 106: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Glossary1

Alpha (�): in statistical analysis, the probability you are willing to allow of concluding that the interventionworks, or does harm, when it is really ineffective

Before-and-after design: (syn. pre-post design) a research design where measurements are taken bothbefore and after the introduction of an intervention to measure its effect; permits less confident causalinferences than a quasi-experimental or experimental design

Conceptual model: diagram which represents the causal relationships among important concepts relevantto an intervention

Confidence interval: interval surrounding a point estimate, where the true value of the estimatedparameter is found with a probability of (1-�)

Confounding variable: variable which affects both the independent variable (presence of intervention ornot) and the dependent variable of interest; it is not a mediating variable

Control group: group for which there is no intervention; group which is compared to the groupundergoing the intervention and the difference in group outcomes attributed to the effect of theintervention; created through randomization in experimental designs; created using non-random meansin quasi-experimental designs

Effect modifying variable: variable which modifies the size and direction of the causal relationshipbetween two variables

Effectiveness evaluation: (syn. outcome evaluation; summative evaluation) evaluation which determineswhether a safety initiative had the effect (e.g., decrease injuries) it was intended to have

Evaluation design: the general plan for taking measurements during an evaluation; i.e., from how manygroup(s) of workers/workers are measurements taken and when

Experimental design: a research design with both intervention and control groups created through arandomization process

History threat (to internal validity): when some other influential event happens during the interventionand evaluation period

Human sub-system (in the workplace): human knowledge, competencies, attitudes, perceptions,motivations, behaviors

Implementation: putting the intervention in place

Glossary

91

1 This glossary is not intended to be used as a general reference for evaluation terms. In many cases, terms have been expressed in the samecontext in which they are used in the guide. The definitions appearing here might therefore be more restrictive than they might be found in a moregeneral reference.

Page 107: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Instrumentation threat (to internal validity): when the measurement method changes during theintervention and evaluation period

Inter-rater reliability: degree of agreement in scores between two different people rating the samephenomenon

Intervening outcomes: outcomes which result from the intervention but precede the final outcome;includes implementation, short-term outcomes, and intermediate outcomes

Intervention: see Safety intervention

Intervention group: group which undergoes the intervention; not a control group

Moderating variable: see effect modifying variable

P-value: in statistical analysis, the probability that a difference at least as large as the one seen could haveoccurred simply by chance, if there really is no effect of the intervention

Power: see Statistical power

Program logic model: diagram depicting the linkage of intervention components to implementationobjectives to short-, intermediate- and long-term outcome objectives

Qualitative methods: research methodology which yields non-numerical data; includes interviews,document analysis, observations

Quantitative methods: research methodology which yields numerical data

Quasi-experimental design: research design which permits more confident causal inference than a before-and-after design; often includes a non-randomized control group

Random number tables: Tables consisting of randomly generated digits, 0 to 9, with each digit having aprobability 1:10 of being selected. Used to select random samples or to randomize participants tointervention and control groups.

Random sampling: technique of selecting a study sample so that the choice is made randomly (usingrandom number tables, etc.) and each participant has a known probability of being selected

Randomization: method of selecting participants for intervention and control groups such that theprobability of being selected into one or the other groups is the same for all participants; method offorming intervention and control groups in experimental designs

Regression-to-the-mean threat (to internal validity): when a pre-intervention measurement of safetyfor a group is atypical and later measurements over the course of the intervention and evaluation are moresimilar to mean values

Reporting threat (to internal validity): when something changes the validity of (injury) reporting overthe course of the intervention and evaluation

Glossary

92

Page 108: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Reliability: degree to which the values measured for a certain concept are consistent

Safety intervention: any attempt to change how things are done in order to improve safety (e.g.,engineering intervention, training program, administrative procedure)

Sample: see study sample

Sample size: the number of experimental units (people, workplaces, etc.) in the study sample

Sampling frame: the group within the target population from which you draw the study sample

Selection threat (to internal validity): when the apparent effect of the intervention could be due todifferences in the participants’ characteristics in the groups being compared

Selection interaction threats (to internal validity): when the apparent effect of the intervention could bedue to something happening to only one of the groups being compared in a experimental or quasi-experimental design

Statistical power: Likelihood of detecting a meaningful effect if an intervention is truly effective

Study sample: participants selected to undergo either intervention or control conditions in a researchdesign

Target population: larger group from which the study sample is selected; larger group to which evaluationresults should be generalizable

Technical sub-system (in the workplace): the organization, design and environment of work, includinghardware, software, job procedures, etc.

Testing threat (to internal validity): when the taking of the (pre-intervention) safety measurement for agroup has an effect on the subsequent measurements of safety for the group

Threats to internal validity: possible alternative explanations for observed evaluation results; typically,experimental designs have less threats than quasi-experimental designs, which have less than a before-and-after design

Unintended outcomes: outcomes of the intervention besides the intended ones; can be desirable orundesirable

Validity: degree to which we measure the concept we intend to measure

Variable: any attribute, phenomenon or event that can have different quantitative values

Glossary

93

Page 109: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Glossary

94

Page 110: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

95

Some models to assist inplanning

AppendixA

Chapter 2 emphasized the importance of havingan explicit conceptual model and/or programlogic model related to the intervention. Sincethere are three levels of interventions(organization of safety management, technicalsub-system and human sub-system), we presentdifferent models corresponding to each of thesethree levels, and which can either be applied oradapted for your own use.

A.1 A model for interventions in thetechnical sub-system

The first type of model is one most applicable tointerventions in the technical sub- system; i.e.,interventions concerned with the organization,design or environment of work, or withsecondary safety or emergency measures. Here,the harm process is seen as a deviation, which, ifunchecked, develops into an exposure to

dangerous energies because of a loss of controlof the work process (see Figure A.1). Anothermodel of this nature is by Kjellén [1984].

As an illustration of how models can assist indesigning an evaluation, consider the applicationof the first one to an engineering interventionwhere one style of machine guarding is replacedby a new one. Machine guarding is a specificexample of a “hazard control measure” depictedin the model. (It would also be considered anindependent variable in the context of theproposed intervention.) A typical evaluationmeasures the frequency of damage processes orinjuries. (Injuries are a dependent variable and afinal outcome measure). By referring to themodel, one can identify intermediate variables tomeasure (i.e., deviations from normal situation)which here could be the percentage of timesduring a given observation schedule the guard

A.1 A model for interventions in the technical sub-system

A.2 Models for interventions in the human sub-system

A.3 Models for interventions in the safety managementsystem

Page 111: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Appendix A

96

was not in place. The distinction in the modelbetween “damage process” and “reporting”reminds us that injuries are not necessarilyreported, and so, any changes in other variablesaffecting reporting should be accounted forduring the evaluation.

In general, when dealing with interventions inthe technical-subsystem, one also needs to thinkabout the possibility of compensatory behavioron the part of managers or workers and, if

necessary, account for this in the evaluation. Forexample, interventions to reduce the levels ofnoise emission from machines could be evaluatedby measuring the actual noise emissions.However, there may be no effect of thisintervention on audiometric measures of hearingloss if one result of quieter machines is thatworkers use less hearing protection. Ideally, onewants to include measures of noise emission,protection equipment use and audiometricmeasures in the evaluation.

2 Adapted from Hale and Glendon [1987]. Horizontal arrows from prevention or recovery activities indicate at which point they can curtail or stopthe harm development processes, which are represented by the vertical arrows running from the top to the bottom in the figure. If theprevention/recovery activities are successful, movement down a vertical line shifts with a right angle to a horizontal prevention/recovery pathway;if not, movement continues in a vertical direction.

Figure A.1: Deviation model2

Page 112: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

A.2 Models for interventions in thehuman sub-system

When interventions are planned that do notdirectly intervene in the work process, but aredesigned to modify human knowledge,competence, attitudes or behavior, it is valuableto use a specific model to help guide the research.Even when an intervention in the technical sub-system is planned, behavioral models could berelevant for the steps in the deviation whichinvolve human activity. We present twobehavioral models. One is concerned with errorsmade without the intention to take risk; the otheris concerned with intentional risk taking.

Model relevant to unintentional risk-taking

Figure A.23 shows a model of three types of errormechanisms that can occur without the intentionof taking risk. “Slips” (of action) and “lapses”(of memory) can happen in the absence of aproblem in the environment and result in thefailure to execute a plan as intended. Whenoperators realize that a problem exists, theorysuggests that people will most likely search forfamiliar patterns and try to apply a knownproblem-handling rule. Only if the “rule-basedlevel” approach fails, do people then resort to a“knowledge-based level” approach. Thisinvolves a wider range of cognitive strategies. Inthis model, errors arising at the rule-based andknowledge-based levels are called “mistakes”.They result from carrying out an inadequate planin the face of a problem. A model from Hale andGlendon [1987] develops the problem-solvingaspects of Reason’s model into a number ofspecific steps which can also be useful forevaluation planning. Both models indicate whichintervening variables are possibly relevant tointerventions aimed at modifying normalworking behavior.

Model relevant to intentional risk-taking

The other type of model relevant to the humansub-system is one concerned with intentionalrisk-taking. The Health Belief Model4 (FigureA.3) has been frequently used in the healthpromotion field in relation to health-relatedbehaviors, such as smoking. However, theunderlying theory is likely relevant to decisionsand behaviors in safety and occupational healthcontexts. It can be applied to observing safetyrules, using personal protective equipment anddismantling safety guards, etc. by workers; and,with modification, to observing safety rules andregulations by managers designing, planning andmonitoring work processes.

The model shows that the likelihood ofundertaking the recommended health (or safety)action depends on the individual’s perceptionsof their susceptibility to disease/injury, itsseriousness, the benefits of taking the preventiveaction and the barriers to taking such action.Benefits are things like saving time and effort,and approval from a production-orientedsupervisor, etc. Barriers are things likeinconvenience, lack of knowledge or skill inundertaking the new behavior, fear of counteringlocal norms, etc. All of these categories provideideas about the intervening attitude measuresthat can be taken in evaluations of behavioralinterventions.

Appendix A

97

3 Figure from Reason [1990] is reprinted with the permission of Cambridge University Press.4 Becker MH, Haefner KP, Kasl SV, Kirscht JP, Maiman LA, Rosenstock IM [1977]. Selected psychosocial models and correlates of individual health-related behaviors. Med Care 15:27-46. With permission of Lippincott Williams & Wilkins.

Page 113: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Appendix A

98

Figure A.2: Generic error-modeling system [Reason 1990]

Page 114: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

A.3 Models for interventions in thesafety management system

When we move to interventions at the level ofthe organization (i.e., interventions to changeworkplace safety policies, procedures, structures,organization), the causal chain between theintervention and the final outcome measures of,for example, injury, becomes even longer. It istherefore much more difficult to find evidencefor this link in a convincing way. Thus, there isan even greater need to measure intermediateoutcomes as well. Few comprehensiveorganizational models exist, linking aspects ofmanagement structure all the way through to theinjury process. The model in Figure A.4 is onewhich attempts to do this.

The model shows a management system as aninteracting set of tasks, controls and resources,linked by communications and feedback loops,that develops, operates, monitors and improvesa risk control and monitoring system (RCMS).The RCMS carries out risk analysis and plans allof the control functions and activities, includingthe system for monitoring the performance of therisk control system. The outputs to the technicaland human sub-systems in the managementmodel can be seen as the links to the earliermodels (Sections A.1 and A.2). The managementpolicy-making system sets up the RCMS,alongside the other aspect systems of thecompany (such as quality, environment orproductivity), reviews it and gives it signals toimprove. These loops make it a dynamic,learning system.

Appendix A

99

Figure A.3: Health belief model5

5 Becker MH, Haefner KP, Kasl SV, Kirscht JP, Maiman LA, Rosenstock IM [1977]. Selected psychosocial models and correlates of individual health-related behaviors. Med Care 15:27-46. With permission of Lippincott Williams & Wilkins.

Page 115: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Appendix A

100

Figure A.4: Model of a safety management system6

6 Model from Hale et al. [1999]

Page 116: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Appendix A

101

The eight delivery systems, referred to in thefigure, deliver for each primary business functionthe following generic controls and resources tothe requisite safety-critical tasks:

• people available when required to performsafety-critical tasks

• competent in performing them and• committed to safety; • clear output goals, procedures, rules and plans

for safety; • hardware resources of well designed work

interfaces, tools and equipment that can beworked safely;

• spares, replacements and modifications to plantand equipment that maintain safety;

• communication and coordination channels forensuring that information about safety isdisseminated to the right people, and thattasks performed by different people are safelycoordinated;

• mechanisms for resolving conflicts among safetyand other criteria.

Each delivery system actually involves a numberof steps or sub-tasks. For example, the firstdelivery system mentioned - delivering people -involves analysing their tasks; specifying thepersonnel requirements; selecting and trainingthe right people; and allocating them to the workat the appropriate times to perform their safety-critical functions.

Interventions at the level of the managementsystem can be done to introduce or improve thefunctioning of any one of the elements of themodel, or a combination of them. Examplesinclude involving operators in writing safetyprocedures; adopting a new safety audit systemto review the RCMS; appointing a reviewcommittee to check proposed plant modificationsfor safety and health implications; andintroducing ergonomic standards for purchasingtools and equipment. This model thereforeprovides possible sets of intervening variables tolink the management interventions to theultimate change in injury experience in thedeviation model.

Page 117: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Appendix A

102

Page 118: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

103

Examples of statistical analyses

AppendixB

B.1 Analyses for before-and-after designsB.1.1 Before-and-after design with injury rate dataB.1.2 Before-and-after design with continuous data

B.2 Analyses with pre-post measures and a control groupB.2.1 Pre-post with control group and rate dataB.2.2 Pre-post with control group and continuous data

B.3 Analyses for designs with after-only measures and acontrol groupB.3.1 After-only measurements with two groups and rate dataB.3.2 After-only measurements with several groups and rate dataB.3.3 After-only measurements with two groups and continuous dataB.3.4 After-only measurements with several groups and continuous

data

B.4 Multiple measurements over time

Page 119: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

This appendix shows some simple statisticalanalyses. We assume that you have a computerwith a statistical software package, or if not, canget a book that explains statistical calculations.Some well-liked basic texts are Altman [1991];Armitage and Berry [1994]; Clarke [1980]; Colton[1974]; Freedman et al. [1998]; Healey [1984];Norman and Streiner [1994]; Siegel and Castellan[1988]; Swinscow [1978]; Weinberg and Goldberg[1990]. Some of the techniques you might needare available in spreadsheet software packages -but they have their limitations. We demonstratehow to do some analyses not typically found inmost packages.

Reasonably good statistical applications are fairlycheap (many universities have site licences) andsome can be downloaded free from the Internet.Two products, available at the time of writing,include: the widely used Epi Info, a DOS-basedword processing (questionnaire design), databaseand statistical program(http://www.cdc.gov/epo/epi/epiinfo.html);and PEPI (http://www.usd-inc.com/pepi.html),a statistical package with many useful statisticalprocedures.

Keep in mind that most statistical tests makecertain assumptions about the real world, whichmay or may not be true in your situation. Wemention a few of these when they are particularlyrelevant.

The examples included in this chapter areorganized according to the evaluation designspresented in Chapters 3 and 4. For some, weshow how to do the calculations. For others weshow examples from published papers. In someof these cases (e.g. paired t-test, two-sample t-test), any statistical package will do thecomputations.

B.1 Analyses for before-and-afterdesigns

B.1.1 Before-and-after design with injury ratedata7

Calculating a p-value

Much injury data is in the form of rates. Typicallythey are expressed in the form:

The denominator could also be the number ofpeople working in a plant in a given year. Youmay want to compare two rates in oneworkplace, before and after an intervention. Thehypothesis here is that the intervention has noeffect, i.e., no difference between the twomeasurements, apart from chance variability.The data are in Table B.4. (Note that it isimportant to do the calculations as accurately aspossible, so do not round off numbers on yourcalculator until the very end of the analysis. Forease of reading we round off below, but we havealready done the calculations more precisely.)

Are the rates 70 (per 100,000 hours worked) and37 significantly different? The approach is to seehow many injuries are expected to occur if thetotal rate, in both time periods combined, isapplied to the Before and After groups - as if thereis no true difference between groups. You thencompare the observed (actual) numbers ofinjuries with those expected. In this case, theoverall rate is 50 (per 100,000 hours). So theexpected number of injuries in the Before groupis (50/100,000) x 40,000 = 20, and in the Aftergroup is (50/100,000) x 60,000 = 30.

Appendix B

104

7 The method described here assumes that the risks of an injury before-and-after are “independent.” Strictly speaking, with many of the samepeople working that may not be true, but we are reasonably safe in using this approach. If most injuries were occurring to the same few peoplebefore and after the intervention, we would need to use a more sophisticated statistical technique.

Page 120: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Now you calculate the test statistic (X2):

where ∑ (the Greek letter sigma) means you addup the quantities (Observed - Expected)2 /Expected for all the groups. For the data here,

X2 = [(28 - 20)2 / 20] + [(22 - 30)2 / 30] = 5.33.

You compare the calculated value of 5.33 withcritical values of something called the chi- squared(�2 ) distribution with one degree of freedom8. For5% significance (i.e., � = 0.05), �2 = 3.84, and thecalculated value 5.33 is larger than this. Thismeans that our result is statistically significant,i.e., the probability of getting a difference in ratesas large or larger than was found is less than 5%,if there really is no effect of the intervention. (Acomputer program gives the precise p-value: p =0.021.)

Calculating rate ratios and rate differences

One limitation of this method is that it does notindicate the strength of the effect of theintervention. How can you measure that? Wedescribe two obvious ways. The first is to look atthe relative injury rate. Take the rate after theintervention and divide by the rate before. Callthe result RR (for Rate Ratio), which in this caseis 36.7 / 70 = 0.52, or 52%. You could also saythat the rate has dropped by 100 - 52 = 48%. Thesecond measure is simply the difference in therates, RD. Here RD is 70 - 36.7 = 33.3 per 100,000hours worked.

Calculating a confidence interval for a rateratio

You can calculate a confidence interval (CI) forthese estimates. As described in Section 8.5, aconfidence interval is a range of values withinwhich the true value of the parameter lies, witha probability of (100 - �)%, usually 95%. Let usstart with the RR. For reasons we will not go into,the analysis uses natural logarithms, abbreviatedas “ln” on your calculator. The CI for ln(RR) isgiven by the limits:

ln(RR) � Z x SE,

where � means plus or minus. Z is a numberyou can look up in a statistical table of criticalvalues from the normal distribution. Its valuedepends on whether you want a 95% CI or 90%or some other value. [We use the conventional95%, for which the appropriate value of Z is 1.96.]

Appendix B

105

Before Intervention After Intervention Total

Number of injuries 28 22 50

Employee hours 40000 60000 100000

Number ofinjuries/105 hours 70 37 50

Table B.4: Injury rate data from a before-and-after evaluation design

Note of caution: This method works wellwhen the numbers of injuries are not toosmall. A reasonable rule here is that thenumber of injuries in each group should beat least five. If it is less, you can use a differentstatistical approach known as an exactmethod, which would likely require acomputer.

8 Degrees of freedom are used a lot in statistics. We will not go into any detail, but simply point out that in this situation the number of degreesof freedom is one less than the number of groups.

Page 121: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

SE is something called the Standard Error and inthis case it refers to the Standard Error of ln(RR).

The Rate Ratio = 36.7 / 70 = 0.52, and a calculatorshows ln(0.52) = -0.65. You now need to calculatethe Standard Error (SE). It is not too complicated.Take the reciprocal (1 divided by the number) ofthe number of injuries in each time period, addthe reciprocals up, and take the square root of thesum. The reciprocals are 1/28 and 1/22. Addingthem gives 0.08, and the square root of that is0.28. With our data, the 95% CI for ln(RR) is then:

-0.65 � (1.96 x 0.28) = -1.2 to -0.09.

Since this is the CI for ln(RR), you now need totake antilogs to get the CI for the Rate Ratio (RR)itself. (On your calculator, you get antilogs withthe button that reads “ex ”). This gives the CI forRR as 0.30 to 0.92. In other words, you are 95%sure that the true value for the Rate Ratio isbetween 0.30 and 0.92.

Calculating a confidence interval for a ratedifference

Now let us work out the CI for the RD, thedifference in rates. The CI for RD is given by thelimits:

RD � (Z x SE).

You earlier calculated rates per 100,000 hours (notper hour) to avoid lots of zeros after the decimalpoint. This means you can use time units of100,000 hours, which makes things a little easieron your calculator. You found earlier that the RDis 33.3 per 100,000 hours. You again need to getthe SE - the Standard Error of RD in this case.This time you calculate # injuries / (# time units)2

for each of the time periods, add them up andtake the square root of the sum.

Thus,

For the 95% CI,

RD � (1.96 x SE) = 33.33 � (1.96 x 15.37) = 3.22 to 63.45.

The CIs for RR and RD show that you cannot ruleout that the effect could be quite small or verylarge. Your best estimate of each, though, is stillthe value you actually found from the data, e.g.,33.3 is the best estimate of RD.

What is the appropriate “before”measurement when you have historical data?

In our example above, we have a situation wherethe periods of time before and after theintroduction of the intervention are similar inlength. What do you do in a situation where, forexample, you have injury rate data for severalyears before the intervention, as well as for oneyear after the intervention’s introduction? Somepeople would propose calculating the “before”measure from the data of several years. This isgenerally not recommended, since many thingscan change over such a long period of time.

On the other hand, the historical data are usefulfor judging whether or not the most recent ratemeasurement for the period just before theintervention - your candidate “before”measurement - is typical of the preceding years.If it is atypical, then consider regression-to-the-mean as a possible threat to internal validity. Ifthere is any suggestion from the historical datathat a trend over time is occurring, you would bewell-advised to use the time series methodsdescribed in Section B.4 for your analysis insteadof the above test.

B.1.2 Before-and-after design withcontinuous data

The next illustration is also concerned withbefore-and-after designs, but in this case the dataare continuous. We will refer to a paper byRobins et al. [1990] which used several statisticalmethods. The study was of a large U.S.

Appendix B

106

SUM = [28 / 0.42] + [22 / 0.62] = 236.11SE = �SUM = �236.11 = 15.37

Page 122: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

manufacturing firm with 50 plants. Theintervention was designed to bring the companyinto compliance with the U.S. Federal HazardsCommunications Standard (HCS), informingworkers about work-related hazardoussubstances, the potential consequences ofexposure, detection and protection. The ultimategoal of the HCS is to reduce chemically relatedoccupational illnesses and injuries.

Trainers were trained, and worked with unionand management, who jointly developed andimplemented the programs at each plant. Theevaluation was designed to assess the planningand implementation, attitudes and knowledge,work practices, working conditions andorganizational impacts. (The authors didexamine changes in rates of illness and injury, butbecause of a change in the classification system,decided they could draw no conclusions.) Fiveplants, representing the variety of manufacturingprocesses, were chosen as sites for data collection.At each plant, data were collected in three phases- phase one, as the training was being completed;phase two, one year later; and phase three, afurther year later, i.e., two years after the training.

Although the design had three measurementtimes, the data were analysed by comparisons ofthe measures at two times - e.g., phase one withphase two. Information was collected fromvarious sources, including semi-structuredinterviews, feedback sessions, observations andquestionnaires. The questionnaires were givento about 50 hourly paid employees at each of thefive plants. One hundred and twenty-fiveemployees answered questions on work practicesat both phases one and two. A composite scorewas calculated as the average response to thequestions, which were rated from 1 = never to 4= always. The hypothesis being tested involvedwhether the program was effective in changingwork practices from phase one to phase two.Statistically, you start with a null hypothesis thatthe difference in practices between the phasesshould be zero. The observed mean score for thisgroup was 2.80 at phase one and 2.92 at phasetwo.

Thus, the observed difference between phaseswas not zero but rather 0.12. The next step was tosee if this difference was statistically significant ornot. Since each person’s composite score was theaverage of responses to several questions, it wasreasonable to consider the data as continuous.As well, since the data were paired - eachindividual provided a pair of scores, with one ateach phase - the appropriate statistical test wasthe paired t-test. This method produces a t-statistic, which can be looked up in a table todetermine the p-value - computer printouts givethis automatically. The authors reported the p-value to be 0.07. Using a cut point (significancelevel) of 0.05, the result was thus not statisticallysignificant. Therefore, the null hypothesis couldnot be rejected.

B.2 Analyses with pre-post measuresand a control group

B.2.1 Pre-post with control group and ratedata

The wrong way to compare pre-post ratechanges in two groups

As noted in earlier chapters, you should alwaysuse a control group - randomized or non-randomized - if at all possible, since rates beforeand after an intervention can differ for reasonsthat have nothing to do with the intervention.Suppose you collect the data in Table B.5. Youmight think you could do a before-aftercomparison for each group, using the methodshown in Section B.1.1, and see if there was asignificant drop in the rate in the interventiongroup, but a non-significant one in the controls.

The problem is, it is WRONG! To see why, lookat Table B.5. In the intervention group, the“before” rate is 7.3 injuries per 100 FTE workers.The “after” rate is 4.1. Calculations, as shown insection B.1.1, might find p = 0.048, i.e., juststatistically significant. In the controls, the“before” rate is 7.7, and the “after” rate is 4.6,giving p = 0.052, not quite significant. This couldlead you to think that the intervention is effective.

Appendix B

107

Page 123: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

In fact, the appropriate statistical test comparesthe before-after difference in the interventiongroup (3.2) with the difference in the controls(3.1). (You are examining a difference ofdifferences!) If you do the calculations (we showyou how in another example below), you wouldfind that it is NOT significant, showing youcannot claim the intervention works - somethingelse likely led to the drop in rates in both groups.

Now there is another way to think of these data.You could look at the relative (or proportional)values of the rates as we did in section B.1.1. Therate ratios are 56% in the intervention group and60% in the controls. (You could also say that therewas a 44% drop in rate in the intervention groupand 40% in the controls.) Some statisticians viewthis as a better way to consider the data. Onereason for this is that if the groups have quitedifferent initial rates, then one of the groups canimprove a lot more than the other based on thedifference in rates. For example, if Group Onehas a Before rate of 4.3 and an After rate of 1.2, adifference of 3.1, then if Group Two’s Before rateis 2.9, its rate cannot possibly be reduced as muchas Group One’s, even if it has no injuries at all!This problem does not apply to relative changes,which cannot exceed 100%.

Comparing the pre-post rate changes in twogroups using rate ratios

We now show you the two methods forcomparing the change in an intervention groupversus the change in a control group. Thestatistical methods for analyzing rate differencesand rate ratios are not the same. First, let us look

at rate ratios. We will use the data in Table B.6and show you how to calculate the test statistic,z, using rate ratios

We start by going through some calculationssimilar to those in Section B.1.1. Get the ln(RR)for each group. (We use subscripts 1 and 2 torepresent the two groups.)

ln(RR1) = ln(5.74/6.00) = -0.043,ln(RR2) = ln(2.03/6.40) = -1.149.

Then calculate the difference between these, D:

D = ln(RR1) - ln(RR2) = -0.043 -(-1.149) = 1.106.

You need the Standard Error (SE) of thisdifference as well. Simply take the reciprocal ofthe number of injuries in each of the fourpre/post, control/intervention categories, addthem up and take the square root of the sum. Thereciprocals are 1/49, 1/46, 1/26, and 1/8.Adding them gives 0.206, and the square root ofthat is 0.453.

Now calculate the test statistic, z:

z = D / SE = 1.106 / 0.453 = 2.44.

When z is bigger than 1.96 - the critical valuefrom the normal distribution, when � = 0.05 - asit is here, the difference in (the ln of) the rate ratiosis statistically significant. Thus, the data suggestthe intervention works.

Appendix B

108

Period of Injury Rate Intervention Group Control GroupMeasurement

Pre-intervention 7.3 7.7

Post-intervention 4.1 4.6

Table B.5: Injury rate data from a pre-post with control group evaluation design

(injuries per 100 FTE workers)

Page 124: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Comparing the pre-post rate changes in twogroups using rate differences

Let’s do the calculations based on the differencein rates, rather than the ratio. You used rates per100,000 hours so you can use time units of 100,000hours for the rate differences.

RD1 = 6.00 - 5.74 = 0.26,RD2 = 6.40 - 2.03 = 4.37.

And the difference between them is calculatedby subtracting the RD for the control group fromthe RD for the intervention group.

D = RD2 - RD1 = 4.37 - 0.26 = 4.11.

Again you need the SE of the difference. In thiscase, calculate # injuries/(#time units)2 for each ofthe four categories, add them up and take thesquare root of the sum:

As before, calculate the test statistic, z:

z = D / SE = 4.11/ 1.88 = 2.19.

This z is bigger than 1.96, so this analysis alsoprovides evidence that the intervention works.

Notice that the z values calculated by the rateratio and rate difference methods are not quitethe same. This is because the analyses haveactually tested different hypotheses. If the pre-intervention values are different, then it ispossible for the hypothesis about rate ratios to betrue, but that for rate differences to be false; orfor the rate difference hypothesis to be true, butthe rate ratio one to be false. For example, thepre and post rates for controls could be 12 and 9respectively; those for the intervention groupmight be 6 and 3. The rate difference is the same,but the ratios are quite different - 75% and 50%.Likewise, suppose the control group’s rates preand post were 12 and 6, respectively, comparedwith 6 and 3 for the intervention group. The rateratios are the same, but the rate differences arenot - they are 6 and 3.

In practice, if the pre-intervention rates are verydifferent, you should be concerned about apotential selection threat to internal validity.When you have chosen the controls well, the pre-intervention rates in the two groups should besimilar, in which case the two analyses shouldgive reasonably similar conclusions. If you areconcerned about moderate differences betweenthe groups, consult a statistician - there are waysto make adjustments to your analysis throughmultiple regression analyses.

Appendix B

109

Pre-intervention Post-intervention

Control Injuries 49 46

(Group 1) Hours 817000 801000

Rate per 100,000 hrs. 6 5.74

Intervention Injuries 26 8

(Group 2) Hours 406000 394000

Rate per 100,000 hrs. 6.4 2.03

Table B.6: Injury rate data from a pre-post with control group evaluation design

SUM = 49/8.172 + 46/8.012 + 26/4.062

+ 8/3.942 = 3.54SE = �SUM = �3.54 = 1.88.

Page 125: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

B.2.2 Pre-post with control group andcontinuous data

Section B.1.2 demonstrated how to comparebefore and after scores on a continuous scale fora single group. Essentially, you take thedifference score (pre - post) for each person andwork with those values in the analysis. With twogroups (intervention and control), you can do thesame thing. Then for each group you have thedifference scores for all the people in the groupand you have reduced the data to a singledifference score for each person. If theintervention is ineffective, then there should beno difference in the means of these differencescores between the groups. The way to see if anydifference is statistically significant or not is todo another type of t-test. You have two groups;so this is called the two-sample t-test. It is in anycomputer package or textbook.

What do you do when there are differences in thecharacteristics of participants in the interventionand control groups that might influence how theyrespond to an intervention? In such a situation,apply a more sophisticated technique that allowsa correction of these differences, such as someform of multiple regression.

B.3 Analyses for designs with after-only measures and a controlgroup

When you only obtain measurements after theintervention and not beforehand, you can have aproblem if the subjects are not randomized tointervention or control group. This is becausethe groups may have been quite different at thestart. For this reason, in chapter 4, werecommended “after”-only designs only whenusing groups formed through randomization.We therefore expect the statistical methods of thissection to be used primarily in such situations.

However they can sometimes be used - withcaution - in the absence of randomization,especially if there is information on potentialconfounders post-intervention. If so, statisticaltechniques are available that can use this extradata. As you can imagine, they are toocomplicated for this appendix - you will have totalk to a statistician - but we mention some ofthem here.

B.3.1 After-only measurements with twogroups and rate data

The same statistical test, �2, is used as in thebefore-and-after design with rate data (SectionB.1.1), but because you have two different groupsof workers, the cautionary footnote no longerapplies.

B.3.2 After-only measurements with severalgroups and rate data

Sometimes you might have several groups, forexample one control and others in which differentinterventions are carried out (Table B.7). Thistime our hypothesis is that none of theinterventions work; so we expect the rates in allthe groups to be the same. Notice that if evenone of the interventions works, our hypothesis iscontradicted. A simple approach might be tocompare each intervention group with the controlgroup, using the test indicated in the precedingsection (B.3.1). However, this would be invalidbecause it involves multiple testing.9 Instead, weuse an approach similar to the one for twogroups.

Again, use the overall rate to estimate theexpected number of injuries in each group,assuming the overall rate applies to all groups.For example, for the control group, the expectednumber of injuries = (150 / 250,000) x 60,000 =36.0. The equivalent values for the other threegroups are 42.0, 30.0, and 42.0.

Appendix B

110

9 Multiple testing means you are doing more than one test at the same time. If you do each one using an alpha of 0.05, the probability that atleast one of the tests is significant is more than 0.05, sometimes considerably more. You have to make an adjustment or use a different method,as is done here, to keep the overall alpha at 0.05.

Page 126: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Again, calculate X2:

This time there are 4 quantities to add up, eachcorresponding to one of the four groups:

X2 = [(43-36)2/36] + [(47-42)2/42] +

[(36-30)2/30] + [(24-42)2/42] = 10.87.

Again, compare this with the chi-squareddistribution. The number of degrees of freedomis one less than the number of group measuresbeing compared, i.e., , 4 - 1 = 3. A chi-squaredtable shows the 5% significance point is 7.81. OurX2 is bigger than this, so again, it is statisticallysignificant, i.e., , p < 0.05. (The actual p-value is0.012). Note that this method does not workproperly if the number of injuries is small (lessthan 5) in one or more of the groups. In suchcases, you need to use an exact method.

B.3.3 After-only measurements with twogroups and continuous data

If there are two groups with after-only measures,you have the same situation as described inSection B.2.2, with a series of scores for each of thetwo groups. You may want to see if anydifference is statistically significant. Again do a

two-sample t-test.

B.3.4 After-only measurements with severalgroups and continuous data

Survey questions in the Robins et al. studydescribed earlier also asked about the extent towhich workers followed certain work practices.The responses for each item ranged from 1 =“never”/”almost never” to 4 = “always”, and anaverage was calculated to give a scale score. Foreach plant, the mean scores of individuals werecalculated. These means ranged from 2.42 to 3.06.

The hypothesis tested was that there was nodifference in work practices between plants.Once again, the data were treated as continuous,but this time with the means of several groupscompared. Whereas the two-sample t-testapplied to a comparison of the means of twogroups, the generalization of this to severalgroups is called a one-way Analysis of Variance(ANOVA). ANOVA is actually a generic namefor a range of procedures, of which this is aparticular example. The method produces an F-statistic, and a subsequent p-value. In this case,p = 0.004. Thus, the result is statisticallysignificant, and we conclude that the differencesin observed means are not due simply to chance.

As with the 2 x 5 table discussed earlier, themethod does not tell us which means aredifferent from others. In fact, with five groups,

Appendix B

111

Table B.7: Example injury data from an experimental design with post-only measurements andmultiple intervention groups

Control Intervention Intervention Intervention Calculated1 2 3 Total

Injuries 43 47 36 24 150

Hrs. 60000 70000 50000 70000 250000

Rate/105 hrs. 71.7 67.1 72 34.3 60

Expected 36 42 30 42 number ofinjuries

Page 127: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

there are ten possible comparisons between pairsof groups. This raises the issue of multipletesting, described earlier. You should onlyconsider testing individual pairs if the “overall”F-test is significant. Various text booksdemonstrate how these pairwise comparisons,allowing for multiple testing, are made10.

In a quasi-experimental (and sometimes in a trulyexperimental) setting, it is important to removethe effects of any characteristics of individualsthat may affect the outcome measure. Forexample, older workers may behave more safelyon the job. If this is true, and the interventiongroup contains more older workers, it will likelyhave better outcomes following the interventionthan the comparison group, regardless of thevalue of the intervention. Likewise, if thecomparison group is older, we may fail to see areal effect of the intervention. Statistically, youcan allow for the age effect, to obtain a moreproper estimate of the impact of the intervention.(The technical term for these adjustment variablesis confounders.)

Robins et al, in the example above, noted knowndifferences in the people surveyed in each plant.Several variables were listed where therespondents differed in occupation, degree of jobhazard, age, education level, and number of yearsin the plant. There was obvious concern that ifwork practices varied by age and more olderpeople worked at some plants, then this disparity(and others like it) - rather than the trainingprogram - could have accounted for thesignificant differences. This would threateninternal validity through a selection bias. Astatistical approach that combines ANOVA withallowance for these confounders (covariates) isneeded. The answer is a technique calledanalysis of covariance (often abbreviated asANCOVA). The authors reported that in theirstudy the new analysis did not change the basicconclusions about differences in work practices.

B.4 Multiple measurements over time

Komaki et al. [1980] reported the effect of anintervention involving training and feedback.Concerned that training alone was not sufficientto produce behavioral change, a “multiplebaseline design across groups” was applied withfour sections of a city’s vehicle maintenancedivision. (This design was described in section4.2.3.) Five “conditions”, each lasting for severalweeks, were used: 1) Baseline; 2) Training only I;3) Training and Feedback I; 4) Training only II; 5)Training and Feedback II. (i.e., after a Baselineperiod, there was a period when only trainingwas given (Training only I), followed by one thatalso included feedback (Training and FeedbackI). Feedback was then dropped (Training onlyII), before being reinstated (Training andFeedback II). Since the study was done in foursections of the workplace, the times for thechangeover from one condition to anotherdiffered. (This allowed the authors to check onwhether or not other changes at the work-siteunrelated to the intervention might haveinfluenced behavior.)

The main outcome measure was of the safeperformance of tasks, measured by behavioralobservations. Observations were made severaltimes a week and plotted to give a weeklyaverage of the percentage of incidents performedsafely in each maintenance section. Since theconditions each lasted from five to eleven weeks,there were multiple measures before and duringeach condition, with approximately 45observations for each section over the course ofthe evaluation. Such data are a form of repeatedmeasures known as time series data.

The authors wanted to allow for general trends insafe behavior, as well as see if the change fromone condition to the next led to a switch inbehavior. The appropriate method for this formof time series is known as ARIMA - autoregressive integrated moving averages. Now the

Appendix B

112

10 For example Kleinbaum et al. [1988]

Page 128: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

name might seem enough to put you off statisticsfor life. What is important is that it takes accountof correlations from measurement tomeasurement (auto correlation) - if thepercentage was high in one week, it was likelyto have been relatively high the next week. Thedata are too complex to describe in detail here.Nevertheless, the general message from theevaluation was clear. Without feedback as well,training showed relatively little effect.

Time series analyses are appropriate in a situationwhere there are repeated measurements on thesame subjects, e.g., when taking behavioralmeasurements. They are also appropriate whenthere are trends in workplace data due to changesin the business cycle, weather etc.

There are occasions, perhaps after an interventionis in place and running well, when the injury rateis expected to remain stable over time. Yetbecause a single plant may experience only a fewinjuries per month, the monthly rate may varyconsiderably simply because of randomvariability. To check if the results for a singlemonth are significantly out-of-line with previousexperience, you can use control charts. They areused in quality control settings, perhaps to makesure that the size of ball bearings is within a smallrange. They can be readily adapted to workplacesafety.

As an example, suppose that on average thereare three injuries per month. (We assume thenumber of hours worked is constant; if not, youcan use the average rate with its standarddeviation, a measure of month-to-monthvariability in the rate.) Sometimes, there will beonly one or two injuries in a month, or maybe 4or 5. In fact, the probability of only one in anymonth is about 10%, while the probability of fiveis about 10%. Even 6 in a month will happenabout 5% of the time, so might well occur at somepoint in a two-year period. This means youshouldn’t be too quick to push the panic buttonwhen one month’s figures are somewhat higherthan normal. But by the same token, you

shouldn’t be smug if in one month there are noinjuries at all.

Control chart methodology will alert you whenthe number of injuries in a month is so high thatthere seems to be a real problem, or when thepattern over two or three months is a cause forconcern.

Appendix B

113

Page 129: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Appendix B

114

Page 130: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

115

Reporting your evaluation results

Appendix C

C.1 Introduction

C.2 Evaluation reportC.2.1 Structure of the reportC.2.2 Clear languageC.2.3 Audience specificity

C.3 Communicating beyond the report

C.4 Summary

C.1 Introduction

Most of this guide has focused on themethodology required to do a good interventioneffectiveness evaluation. This appendix focuseson what to do with the results of the evaluation.Written reports are the usual way to summarizethem, even when an evaluation is done in-house.Not only does this provide a record for theorganization, the process of writing the reportalso encourages a critical examination andsynthesis of the evaluation activities and results.We will describe the sections that people typicallyinclude in a report. We will also discuss howyour communication strategy should extendbeyond the report itself.

C.2 Evaluation report

C.2.1 Structure of the report

Table C.1 lists what you would typically includein a report. First is the abstract/executivesummary, which incorporates the main points ofthe introduction, methods, results, discussion andconclusion sections. This is typically one or twopages in length. This summary is an importantsince, for many readers, this might be the onlysection they read in its entirety.

• The introduction presents the goals of theintervention, the intervention itself and thegeneral approach taken in the evaluation.

• Methods/procedures then describe theevaluation methods in detail.

• Results present the data gathered through theevaluation which address the evaluation

Page 131: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

questions. This section should present theresults not only in text, but also throughfigures, graphs and tables. These visualsummaries facilitate uptake of the informationfor many readers.

• Many reports include a discussion section,which should lead the reader from the resultsto the conclusion. Whereas the results sectiongives a logical presentation of the results, thediscussion synthesizes and interprets themwith reference to current theory andunderstanding. The discussion section is alsothe place to consider threats to the internal

validity of the evaluation, including anyreasoning based on theory or data fromoutside of the evaluation.

• The conclusions summarize what isconcluded from the data and, possibly, anyresulting recommendations. Conclusionsshould address the main evaluation questions.

• In fact, as much as possible, the entire reportshould be constructed so that the relationshipof the various methods and results sub-sections to the evaluation questions andconclusions is clear.

Appendix C

116

Abstract/executive • Overview of the program and evaluationsummary • General results, conclusions and recommendations

Introduction • Purpose of the evaluation• Program and participant description (including staff, materials,

activities, procedures, etc.)• Goals and objectives• Evaluation questions

Methods/procedures • Design of the evaluation• Target population• Instruments (e.g., questionnaire)• Sampling procedures• Data collection procedures• Validity and reliability• Limitations• Data analyses procedures

Results • Description of findings from data analyses• Answers to evaluation questions• Charts and graphs of findings

Discussion • Explanation of findings• Interpretation of results• Consideration of threats to internal validity

Conclusions/ • Conclusions about program effectivenessrecommendations • Program recommendations

Sections of report Content of sections

Table C.1 What to include in the evaluation report 11

11 Table from McKenzie and Smeltzer, Planning, Implementing, and Evaluating Health Promotion Programs: A Primer, 2nd ed. Copyright (c)1997 by Allyn & Bacon. Adapted by permission.

Page 132: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

C.2.2 Audience specificity

One of the key principles in communicating areport is to tailor it to the audience. It shouldhave the audience’s level of education andinterests in mind. Key messages should beformulated in the conclusion and abstract so thatthey answer the questions most pertinent to theaudience. Conceivably you might have morethan one report - preparing both technical andlay person versions is common.

C.2.3 Clear language

The report should be written in clear language ifa non-technical audience is planned. This meansit will be quite different from the style found inmany academic publications. Guidelines forclear language have been developed by manyorganizations12. The following is a compilationfrom some of these .

Appendix C

117

12 For example: Baldwin R [1990]. Clear writing and literacy. Toronto: ON Literacy Coalition; Canadian Labour Congress [1999]. Making it clear:Clear language for union communications. Ottawa: Canadian Labour Congress; Gowers E (revised by Greenbaum S, Whitcut J) [1986]. Thecomplete plain words. London: HMSO; Ministry of Multiculturalism and Citizenship [1991]. Plain language clear and simple. Ottawa: Ministry ofSupply and Services.

Guidelines for writing in clear languageOverall• Write with your audience’s needs, knowledge and abilities in mindDocument organization• Include a table of contents for longer documents• Divide document into sections of related information, using headings and sub-headings • Include detailed or technical material in an AppendixParagraphs• Limit each paragraph to one idea• Avoid paragraphs of more than five sentences• Consider using point form for a list of related items• Use left justification, but not right justification; i.e., leave a ragged right marginSentences• Limit each sentence to one point• Sentences should be no more than 20 words on average and, typically, not exceed 25 to 30 words• Use a subject-verb-object order for most sentencesWords• Avoid jargon and technical words; explain them when used • Eliminate unnecessary words (e.g., replace “in view of the fact” with “because”)• Use the active voice instead of the passive voice. (e.g., replace “The requirement of the workplace

was that employees.....” with “The workplace required employees....)• Avoid chains of nouns (e.g., resource allocation procedures)Font• Use a serif style of font (with hooks on the end of characters) instead of a sans serif style• Do not use all upper case (i.e., all capital) letters for anything longer than a brief statement• 12 point type is recommended for the main text

Page 133: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

C.3 Communication beyond thereport

Communicating the evaluation results involvesmore than producing the evaluation report.Relationships between would-be users of theevaluation results and the evaluators should beestablished early on, because the successfuluptake of results often depends on a few keyindividuals who understand and support theevaluation. For this reason we recommendforming an evaluation committee at the outsetthat includes key stakeholders (Chapter 2). Thiscommittee should be involved at all stages of theevaluation: development of questions; selectionof design and methodology; and interpretationof results. An ongoing engagement ofstakeholders fosters trust, understanding andownership of the results. It also helps ensure thatthe results are appropriate to their needs.

At the point of release of the final results, youwill ideally include several interactive means ofpresenting the report’s results, involving eitherlarger verbal presentations or small groupmeetings. Interaction of the audience with thepresenters should be encouraged in both cases.Make sure the key messages of the report areemphasized and give people the opportunity tovoice any doubts or lack of understanding. Avariety of written, verbal and visual presentationsmight be needed for various audiences.

C.4 Summary

Communication of the evaluation resultsinvolves, at the very least, a clear, well-organized, audience-specific evaluation report.Other strategies, including the ongoingengagement of an appropriately structuredevaluation committee can further the use of anevaluation’s results.

Appendix C

118

Page 134: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Bibliography

Aday LA [1996]. Designing and conducting health surveys: a comprehensive guide. 2nd ed. San Francisco:Jossey-Bass.

Altman DG [1991]. Practical statistics for medical research. London, New York: Chapman & Hall.

Armitage P, Berry G [1994]. Statistical methods in medical research. 3rd ed. Oxford, Boston: BlackwellScientific Publications.

Becker MH, Haefner KP, Kasl SV, Kirscht JP, Maiman LA, Rosenstock IM [1977]. Selected psychosocialmodels and correlates of individual health-related behaviors. Med Care 15:27-46.

Cherry N [1995]. Evaluation of preventive measures. In: McDonald C, ed. Epidemiology of work relateddiseases. London: BMJ.

Clarke GM [1980]. Statistics & experimental design. London: Arnold.

Cohen J [1988]. Statistical power analysis for the behavioral sciences. New York: Academic Press.

Colton T [1974]. Statistics in medicine. Boston: Little, Brown & Company.

Conrad KM, Conrad KJ, Walcott-McQuigg J [1991]. Threats to internal validity in worksite healthpromotion program research: common problems and possible solutions. Am J Health Promot 6:112-122.

Cook TD, Campbell DT [1979]. Quasi-experimentation: design & analysis issues for field settings. Chicago:Rand McNally College Publishing Co.

Cook TD, Campbell DT, Peracchio L [1990]. Quasi experimentation. In: Dunnette JD, Hough LM, eds.Handbook of industrial and organizational psychology. 2nd ed., vol. 1. Palo Alto, CA: ConsultingPsychologists Press, Inc., pp. 491-576.

Cresswell JW [1994]. Research design. Qualitative & quantitative approaches. Thousand Oaks, CA,London, New Delhi: Sage Publications.

Dignan MB [1995]. Measurement and evaluation of health education. 3rd ed. Springfield, Illinois: CharlesC. Thomas.

Drummond MF, Stoddart GL, Torrance GW [1994]. Methods for the economic evaluation of health careprogrammes. Oxford, Toronto, New York: Oxford University Press.

Earp JA, Ennett, ST [1991]. Conceptual models for health education research and practice. Health EducRes 6:163-171.

Fleiss JL [1981]. Statistical methods for rates and proportions, New York: John Wiley & Sons.

Bibliography

119

Page 135: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Freedman D, Pisani R, Purves R [1998]. Statistics. 3rd ed. WW Norton & Co.

Glassock DJ, Hansen ON, Rasmussen K, Carstensen O, Lauritsen J [1997]. The West Jutland study offarm accidents: a model for prevention. Safety Sci 25:105-112.

Glendon I, Booth R [1995]. Risk management for the 1990s: Measuring management performance inoccupational health and safety. J Occup Health Safety - Aust NZ 11:559-565.

Gold RG, Siegel JE, Russell LB, Weinstein MC, eds. [1996]. Cost-effectiveness in health and medicine.New York: Oxford University Press.

Goldenhar LM, Schulte PA [1994]. Intervention research in occupational health and safety. J Occ Med36:763-775.

Goldenhar LM, Connally LB, Schulte PA, eds. [1996]. Intervention research in occupational health andsafety: science, skills and strategy. Special Issue of Am J Ind Med 29(4).

Green LW, Lewis FM [1986]. Measurement and evaluation in health education and health promotion.California: Mayfield Publishing Co.

Greene JC, Caracelli VJ, Graham WF [1989]. Toward a conceptual framework for mixed-method evaluationdesigns. Educ Eval & Policy Analysis 11:255-274.

Guastello SJ [1993]. Do we really know how well our occupational accident prevention programs work?Safety Sci 16: 445-463.

Haddix AC, Teutsch SM, Shaffer PA, Dunet DO [1996]. Prevention effectiveness: a guide to decisionanalysis and economic evaluation. Oxford, New York: Oxford University Press.

Hale AR [1984]. Is safety training worthwhile? J Occup Accid 6:17-33.

Hale AR, Glendon AI [1987]. Individual behavior in the control of danger. Amsterdam, New York: Elsevier.

Hale AR, Guldenmund F, Bellamy L [1999]. Annex 2: management model. In: Bellamy LJ, Papazoglou IA,Hale AR, Aneziris ON, Ale BJM, Morris MI & Oh JIH I-Risk: Development of an integrated technical andmanagement risk control and monitoring methodology for managing and quantifying on-site and off-siterisks. Den Haag, Netherlands, Report to Ministry of Social Affairs and Employment, European Union,Contract ENVA-CT96- 0243.

Hauer E [1980]. Bias-by-selection: Overestimation of the effectiveness of safety countermeasures causedby the process of selection for treatment. Accid Anal Prev 12:113-117.

Hauer E [1986]. On the estimation of the expected number of accidents. Accid Anal Prev 18:1- 12.

Hauer E [1992]. Empirical Bayes approach to the estimation of “unsafety”: the multivariate regressionmethod. Accid Anal Prev 24:457-477.

Bibliography

120

Page 136: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Hawe P, Degeling D, Hall J [1990]. Evaluating health promotion: a health worker’s guide. Sydney,Philadelphia, London, Australia: MacLennan & Petty.

Healey JF [1984]. Statistics: a tool for social research. Belmont: Wadsworth.

Hugentobler MK, Israel BA, Schurman SJ [1992]. An action research approach to workplace health:integrating methods. Health Educ Q 19:55-76.

Johnston JJ, Cattledge GTH, Collins JW [1994]. The efficacy of training for occupational injury control.Occup Med 9:147-158.

Kjellén U [1984]. The role of deviations in accident causation and control. J Occup Accidents 6:117-126.

Kjellén U [2000]. Prevention of accidents through experience feedback. London, New York: Taylor &Francis.

Kleinbaum DG, Kupper LL, Muller KE [1988]. Applied regression analysis and other multivariablemethods, 2nd ed. Boston: PWS-Kent.

Komaki JL, Jensen M [1986]. Within-group designs: an alternative to traditional control-group designs. In:Cataldo MF, Coates TJ, eds. Health and industry: a behavioral medicine perspective. New York, Toronto,Chichester, Brisbane, Singapore: John Wiley & Sons.

Komaki J, Barwick KD, Scott LR [1978]. A behavioral approach to occupational safety: pinpointing andreinforcing safe performance in a food manufacturing plant. J Appl Psychol 63: 434-445.

Komaki J, Heinzmann AT, Lawson L [1980]. Effect of training and feedback: component analysis of abehavioral safety program. J Appl Psychol 65: 261-270.

Krause TR [1995]. Employee-driven systems for safe behavior: integrating behavioral and statisticalmethologies. New York: Van Nostrand Reinhold.

Laitinen H, Marjamaki M, Paivarinta K [1999a]. The validity of the TR safety observation method onbuilding construction. Accid Anal Prev 31:463-472.

Laitinen H, Rasa P-L, Resanen T, Lankinen T, Nykyri E [1999b]. The ELMERI observation method forpredicting the accident rate and the absence due to sick leaves. Am J Ind Med 1(Suppl):86- 88.

Lipsey MW [1990]. Design sensitivity. Newbury Park, CA: Sage Publications.

Mason ID [1982]. An evaluation of kinetic handling methods and training. [PhD thesis]. Birmingham:University of Aston.

McAfee RB, Winn AR [1989]. The use of incentives/feedback to enhance work place safety: a critique ofthe literature. J Safety Res 20:7-19.

Bibliography

121

Page 137: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

McKenzie JF, Smeltzer JL [1997]. Planning, implementing, and evaluating health promotion programs: aprimer. 2nd ed. Boston: Allyn and Bacon.

Menckel E, Carter N [1985]. The development and evaluation of accident prevention routines: a casestudy. J Safety Res 16:73-82.

Miles MB, Huberman AM [1994]. Qualitative data analysis. 2nd ed. Thousand Oaks, CA, London, NewDelhi: Sage Publications.

Mohr DL, Clemmer DI [1989]. Evaluation of an occupational injury intervention in the petroleum drillingindustry. Accid Anal Prev 21:263-271.

Morgan DL [1998]. Practical strategies for combining qualitative and quantitative methods: applicationsto health research. Qual Health Res 8:362-376.

Morgan DL, Krueger RA [1998]. The focus group kit. Thousand Oaks, CA., London, New Delhi: SagePublications.

Needleman C, Needleman ML [1996]. Qualitative methods for intervention research. Am J Ind Med29:329-337.

Norman GR, Streiner DL [1994]. Biostatistics: the bare essentials. St. Louis: Mosby.

Orgel DL, Milliron MJ, Frederick LJ [1992]. Musculoskeletal discomfort in grocery express checkstandworkers. J Occup Med 34:815-818.

Patton MQ [1986]. Utilization-focused evaluation. 2nd ed. Beverly Hills: Sage.

Patton MQ [1987]. How to use qualitative methods in evaluation. Newbury Park, CA, London, NewDelhi: Sage Publications.

Patton MQ [1990]. Qualitative evaluation and research methods. 2nd ed. Newbury Park, CA: SagePublications.

Pekkarinen A, Anttonen H, Pramila S [1994]. Accident prevention in reindeer herding work. Arctic 47:124-127.

Reason J [1990]. Human error. Cambridge: Cambridge University Press.

Rivara FP, Thompson DC, eds. [2000]. Systematic reviews of strategies to prevent occupational injuries.Special issue of Am J Prev Med 18(4)(Suppl).

Robins TG, Hugentobler MK, Kaminski M, Klitzman S [1990]. Implementation of the federal hazardcommunication standard: does training work? J Occup Med 32:1133-1140.

Bibliography

122

Page 138: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Rossi PH, Berk RA [1991]. A guide to evaluation research theory and practice. In: Fisher A, Pavlova M,Covello V, eds. Proceedings of the Evaluation and Effective Risk Communication Workshop. InteragencyTask Force on Environmental Cancer and Heart and Lung Disease Committe on Public Education,EPA/600/9-90-054, pp. 205-254.

Rush B, Ogborne A [1991]. Program logic models: expanding their role and structure for program planningand evaluation. Can J Program Eval 6:95-106.

Siegel S, Castellan NJ Jr. [1988]. Nonparametric statistics for the behavioral sciences. 2nd ed. New York:McGraw-Hill.

Sizing up safety: how to measure where your organization has been, where it’s at and where it’s going.[1995]. Occup Health Safety 11(Mar/Apr):54-60

Steckler A, McLeroy KR, Goodman RM, Bird ST, McCormick L [1992]. Toward integrating qualitativeand quantitative methods: an introduction. Health Educ Q 19:1-8.

Stewart AL [1990]. Psychometric considerations in functional status instruments. In: Lipkin M Jr., ed.Functional status measurement in primary care. New York: Springer-Verlag.

Streiner DL, Norman GR [1989]. Health measurement scales: a practical guide to their development anduse. Oxford: Oxford University Press.

Swinscow TDV [1978]. Statistics at square one. London: British Medical Association.

Tarrants WE [1987]. How to evaluate your occupational safety and health program. In: Slote L, ed.Handbook of occupational safety and health. New York: John Wiley & Sons, ch. 8.

Vojtecky MA, Schmitz MF [1986]. Program evaluation and health and safety training. J Safety Res 17:57-63.

Walsh NE, Schwartz RK [1990]. The influence of prophylactic orthoses on abdominal strength and lowback injury in the workplace. Am J Phys Med Rehabil 69:245-250.

Webb GR, Redman S, Wilkinson C, Sanson-Fisher RW [1989]. Filtering effects in reporting work injuries.Accid Anal Prev 21:115-123.

Weinberg SL, Goldberg RP [1990]. Statistics for the behavioral sciences. Cambridge: Cambridge UniversityPress.

Weiss CH [1988]. Evaluation for decisions: is anybody there? Does anybody care? Eval Practice 9:5-19.

Yin RK [1984]. Case study research: design and methods. Applied social research methods series. Vol. 5.Beverly Hills, London, New Delhi: Sage Publications.

Zwerling C, Daltroy LH, Fine LJ, Johnston JJ, Melius J, Silverstein BA [1997]. Design and conduct ofoccupational injury intervention studies: a review of evaluation strategies. Am J Ind Med 32: 164-179.

Bibliography

123

Page 139: Guide to Evaluating the Effectiveness of Strategies for Preventing Work Injuries

Bibliography

124