DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart,...

33
DOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance Abuse: A Guide Book. INSTITUTION Pacific Inst. for Research and Evaluation, Walnut Creek, CA. SPONS AGENCY National Inst. on Alcohol Abuse and Alcoholism (DHHS), Rockville, Md.; National Inst. on Drug Abuse (DHHS/PHS), Rockville, Md. REPORT NO RP0778 PUB DATE Jun 90 NOTE 33p. PUB TYPE Guides - Non-Classroom Use (055) -- Reports - Descriptive (141) EDRS PRICE MF01/PCO2 Plus Postage. DESCRIPTORS Alcohol Abuse; ;1.1ied Health Occupations Education; *Clinical Teaching (Health Professions); Drug Abuse; *Evaluation Methods; *Faculty Development; Graduate Medical Education; Higher Education; Medical Education; Medical School Faculty; Mental Health; *Program Evaluation; Qualitative Research; *Research Methodology; Sampling; Statistical Bias; *Substance Abuse ABSTRACT Intended to provide an overview of program evaluation as it applies to the evaluation of faculty development and clinical training programs in substance abuse for health and mental health professional schools, this guide enables program developers and other faculty to work as partners with evaluators in the development of evaluation designs that meet the specialized needs of faculty development and clinical training programs. Section I discusses conceptual issues in program evaluation, including the uses of evaluation (management and monitoring, program description, program improvement, accountability, and creating new knowledge); the major options (formative/summative, process/outcome/impact, quantitative/qualitative); and the benefits and risks of conducting evaluation studies. Section II, an introduction to research methods, includes the following discussions: sampling with known sampling errors (simple random, systematic, multistage random, stratified, cluster, stratified cluster, and sequential sampling); sampling without known sampling errors (c'jnvenience, quota, modal, purposive, and snowball sampling); sample size and generalizability and sample size and statistical power; the validity of evaluations and potential sources of bias, including issues related to internal validity (history, maturation, testing, instrumentation, statistical regression, selection, mortality, interactions with selection, and ambiguity about the direction of causal influence); comparison and control groups; measurement of outcomes; and qualitative evaluation methods and analysis. (5 references) (JB)

Transcript of DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart,...

Page 1: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

DOCUMENT RESUME

ED 332 636 HE 024 615

AUTHOR Klitzner, Michael; Stewart, KathrynTITLE Evaluating Faculty Development and Clinical Training

Programs in Substance Abuse: A Guide Book.INSTITUTION Pacific Inst. for Research and Evaluation, Walnut

Creek, CA.SPONS AGENCY National Inst. on Alcohol Abuse and Alcoholism

(DHHS), Rockville, Md.; National Inst. on Drug Abuse(DHHS/PHS), Rockville, Md.

REPORT NO RP0778PUB DATE Jun 90NOTE 33p.

PUB TYPE Guides - Non-Classroom Use (055) -- Reports -Descriptive (141)

EDRS PRICE MF01/PCO2 Plus Postage.DESCRIPTORS Alcohol Abuse; ;1.1ied Health Occupations Education;

*Clinical Teaching (Health Professions); Drug Abuse;*Evaluation Methods; *Faculty Development; GraduateMedical Education; Higher Education; MedicalEducation; Medical School Faculty; Mental Health;*Program Evaluation; Qualitative Research; *ResearchMethodology; Sampling; Statistical Bias; *SubstanceAbuse

ABSTRACT

Intended to provide an overview of program evaluationas it applies to the evaluation of faculty development and clinicaltraining programs in substance abuse for health and mental healthprofessional schools, this guide enables program developers and otherfaculty to work as partners with evaluators in the development ofevaluation designs that meet the specialized needs of facultydevelopment and clinical training programs. Section I discussesconceptual issues in program evaluation, including the uses ofevaluation (management and monitoring, program description, programimprovement, accountability, and creating new knowledge); the majoroptions (formative/summative, process/outcome/impact,quantitative/qualitative); and the benefits and risks of conductingevaluation studies. Section II, an introduction to research methods,includes the following discussions: sampling with known samplingerrors (simple random, systematic, multistage random, stratified,cluster, stratified cluster, and sequential sampling); samplingwithout known sampling errors (c'jnvenience, quota, modal, purposive,and snowball sampling); sample size and generalizability and samplesize and statistical power; the validity of evaluations and potentialsources of bias, including issues related to internal validity(history, maturation, testing, instrumentation, statisticalregression, selection, mortality, interactions with selection, andambiguity about the direction of causal influence); comparison andcontrol groups; measurement of outcomes; and qualitative evaluationmethods and analysis. (5 references) (JB)

Page 2: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

a

VermImim

BEST COPY AVAILABLE

EVALUATINGFACULTY DEVELOPMENT AND

CLINICAL TRAINING PROGRAMSIN SUBSTANCE ABUSE:

A GUIDE BOOK

Michael Klitzner, Ph.D.Kathryn Stewart, M.S.

Pacific Institute for Research and EvaluationBethesda. Maryland

June, 1990

Development of this Guide was supported by the NationalInstitute on Alcohol Abuse and Alcoholism and the National

Institute on Drug Abuse

U.S. DEPARTMENT OF EDUCATIONOrig. or Educational Research and mororamont

EDuCATIONAL RESOURCES INFORMATIONCENTER (ERIC)

It This docuront has been reproduced asfecewed born the person or argarnraf,pnoriginating itMtn°, changes have been made to wnproveroprochectsoo auellTv

Points of %woe or opinions misted in this docllment do not necessarily represent offic.alOERI position or policy

RP0778

Page 3: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

ACKNOWLEDGMENTS

This Guide Book is based on materials originally assembled for A Guide to First DUI Offender ProgramEvaluation prepared by the authors for the California Office of Traffic Safety. The authors wish toacknowledge the contribution of Paul Moberg, Ph.D., with whom the authors originally developed theapproach to program evaluation reflected in this Guide Book.

.1

i

Page 4: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

TABLE OF CONTENTS

ACKNOWLEDGEMENTS

TABLE OF CONTENTS

ABOUT THIS GUIDE BOOK

SECTION I: ISSUES IN PROGRAMEVALUATION

USES OF PROGRAM EVALUATION

Evaluation for Management and Monitoring

Evaluation for Program Description

Evaluation for Program Improvement

Evaluation for Accountability

Evaluation for Creating New Knowledge

MAJOR OPTIONS

The Formative/Summative Dimension

The ProcessiOutcome/Impact Dimension

The Quantitative/Qualitative Dimension

BENEFITS AND RISKS

Benefits

Risks

SECTION II: INTRODUCTION TORESEARCH METHODS

SAMPLLNG 11

Samples with Known Sampling Errors 12

1

1

3

3

4

4

4

5

5

6

7

8

8

9

11

..=mmai,.,

Page 5: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

TABLE OF CONTENTS (CON'T)

Simple Random Sampling

Systematic Sampling

Multistage Random Sampling

Stratified Sampling

Cluster Sampling

Stratified Cluster Sampling

Sequential Sampling

Samples Without Known Sampling Errors

Convenience Sampling

Quota Sampling

Modal Sampling

Purposive Sampling

Snowball Sampling

ISSUES RELATED TO SAMPLE SIZE

Sample Size and Generalizability

Sample Size and Statistical Power

THE VALIDITY OF EVALUATIONS ANDPOTENTIAL SOURCES OF BIAS

Issues Related to External Validity

Sampling Biases

Measurement Biases5

12

12

12

1 .2

13

13

13

14

14

14

14

14

14

15

15

15

16

16

16

17

Page 6: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

TABLE OF CONTENTS (CON'T)

Issues Related to Internal Validity

History

Maturation

Testing

Instrumentation

Statistical Regression

Selection

Mortality

Interactions with Selection

17

17

18

18

18

18

19

19

19

Ambiguity About the Direction of Causal 19Influence

COMPARISON AND CONTROL GROUPS 19

The True No-Treatment Control 20

The Waiting List Control 20

The "Traditional Treatment" Control 21

The Minimal Treatment Control21

The Placebo Control21

The Non-Eligible Comparison Group 21

MEASURING OUTCOMES 21

Self-Report Measures

Other-Reports23

Behavioral Observation23

iv

Page 7: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

TABLE OF CONTENTS (CON'T)

Records Data

QUALITATIVE EVALUATION METHODS 24

Data Collection Techniques 24

Observation 24

Participant Observation 24

Qualitative Interviews 24

QUALIT4TIVE ANALYSIS 25

CONCLUSION 26

NOTES 26

Page 8: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

ABOUT THIS GUIDE BOOK

There are many text books on program evaluationand this Guide Book is not intended to be one ofthem. Instead it is intended to provide an overviewof program evaluation as it applies to the evalu-ation of faculty development and clinical trainingprograms in substance abuse for health and mentalhealth professional schools. The Guide Book ref-erences other resources that can be consulted formore detailed and technical information.

The Guide Book is divided into two major sections.Section I discusses conceptual issues in programevaluation, including die uses of evaluation, themajor options, and the benefits and risks of con-ducting evaluation studies. Section II is intendedas an introduction to research methods. It includesdiscussions of sampling, statistical power, thevalidity of evaluations and potential sources of

biu, comparison and control groups, and measure-ment of outcomes. A section on data analysis is notincluded because this aspect of evaluation has be-come so complex in recent years that no simpletreatment is possible. However, all universitieshave departments of statistics that may be calledupon for assistance in data analysis.

It is important to note that it is not expected that anindividual unfamiliar with program evaluation willbe able to plan an evaluation based on the materialpresented in this Guide Book. Rather, the pumoseof the Guide is to enable program developers andother faculty to work as partners with evaluators inthe development of evaluation designs that meetthe specialized needs of faculty development andclinical training programs.

Page 9: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

SECTION I:ISSUES IN PROGRAM EVALUATION

USES OF PROGRAM EVALUATION

In many ways, "program evaluation" is a mislead-ing phrase. "Evaluation" implies judgement ofwhat is good and what is not good. Althoughietennination of whether or not a program isgood" is the goal of some types of program

evaluation, it is by no means the only -- of. even themajor -- goal of program evaluation in general.Unfortunately, the "valuanve" dimension of pro-gram evaluation is what most often comes to mindfor many people. Thus, individuals unfamiliarwith the wide range of activities subsumed underprogram evaluation may approach this topic withsome trepidation -- especially if it is their programsthat are to be evaluated.

A much more general goal of program evaluation-- and one which underlies almost all evaluationactivity -- is the production of systematic informa-tion to be used in making decisions about pro-grams. The complexity and scope of these deci-sions can vary widely, as can the complexity andscope of the evaluation, but the principle is alwaysthe same:

Evaluations are implemented toimprove decision-making.

All the diverse, specific purposes of a given evalu-ation derive from this principle. With this generalprincipal in mind, we now examine some of thespecific purposes of program evaluation. In eachcase, the discussion will be ded to the kinds ofdecisions that underlie each specific purpose.

Before discussing the specific purposes of pro-gram evaluation, it is important to define clearlywhat is meant by a "program." In the area offaculty development and clinical training in sub-stance abuse, a program may be as broad as anentire institution's attempt to improve its instnic-tional program or as narrow as the techniques usedby an individual practitioner to confront substanceabusing patients or counsel adolescents. A pro-gram may be the introduction of new content intoan existing course, or the implementation of aninnovative teaching strategy. All of these "pro-grams" may be evaluated on many different levelsand for a variety of purposes. Of course, somekinds of evaluations will be more appropriate forsome programs than for others. In general, how-ever, all of the types of evaluation described beloware applicable, to some degree, to any aspect offaculty development or clinical training programs.

2

Page 10: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

Evaluation for Management and Monitoring

One common use of program evaluation is themonitoring and management of programs. Espe-cially in large programs such as the implementa-tion of a new substance abuse curriculum or theopening of a new substance abuse unit in a teachinghospital, so much is going on that it is oftendifficult to keep track of the program on a day-to-day basis.

Management information or monitoring systemsare designed to meet ongoing information needs(i.e., those information needs that can be expectedto exist as long as the program is in operation). Ingeneral. the questions addressed by such systemsrelate to events that are recurrent and routine.Examples of questions related to an institution'sinstructional program might include:

How many hours of didactic and clinical instruc-tion on substance abuse are offered each quarteror semester?

How many students are exposed each quarter orsemester to a given course or clinical experience?

What are the trends in student exam scores in thearea of substance abuse?

Similarly, questions concerning the provision ofclinical services might include:

How many patients are diagnosed each month withsubstance-related problems? On what servicesare they diagnosed? Where are they referred?

How many beds are filled at any given time on thesubstance abuse unit? From where were patientsreferred? What are the intake diagnoses?

Regular review of the data generated by a manage-ment information or monitoring system providesan up-to-date picture of the program as a whole orof specific program elements (e.g., a specific courseor practicuurn) which can be used both to makeshort-term corrections and to plan future program

directions.

Evaluation for Program Description

Often, there is a need to capture the overall opera-tion of a program or program element at a givenpoint in time -- that is, to develop a comprehensiveprogram description. Such a need might arise, forexample, in developing a manual to allow thereplication of a substance abuse cuniculum orcurriculum element. Systematic program descrip-tions are a major vehicle for communicatir g withthe field, and, as discussed later, a comprehensiveprogram description is a prerequisite for any evalu-ation activities designed to assess program effec-tiveness. In all cases, the need is to answer thequestion:

What does this program look like as it actuallyoperates?

Many of the evaluation activities undertaken forthe purpose of program description are similar tothose undertaken for the purposes of managementand monitoring. The major differences betweenevaluation for the purpose of management andmonitoring and evaluation for the purpose of de-scription concern the time-frame of the data col-lection and the depth of the data collection.

As suggested. management or monitoring systemsare ongoing and provide a dynamic picture ofchanges over time. By contrast, a descriptiveevaluation usually attempts to capture a "still pic-ture" of a program or program element as it oper-ates at a given point in time. Of course, a programdescription may contain historical or developmen-tal elements, but the emphasis ison the program asit operates now.

Descriptive evaluations strive for a greater depththan can be captured in the numbers generated bya management information system. For example.a management information system might revealthat in a given semester, X number of learnersattended a newly offered selective course on an-ticipatory guidance for adolescent and pre-adoles-

3 1 0

Page 11: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

cent patients and their parents. In addition, adescriptive evaluation would seek informationconcerning the learners' experiences in the course,how they reacted, and whether or not learners feltthey could or would apply the course content inclinical practice. Gathering such information mightrequire observation of course sessions, interviewswith learners. debriefing of instructors, and so on.Such in-depth data collection strategies are thehallmark of descriptive evaluations. It is the fine-grained detail they provide that often makes themso useful in management decision-making and inenhancing assessments of program effectiveness.

Evaluation for Program Improvement

All professional schools are eager to improve theinstructional program offered to learners. Programevaluation provides a mechanism for developingthe information needed to make such improve-ments in a rational and planful manner.

Program improvement can take many forms. Insome cases, it might mean attracting more learnersto imponant courses or finding more appropriateclinical settings for clinical placements. In othercases it might mean making changes in the policiesof the department or institution. In still other cases,program improvement might mean more effectiveor efficient learning, or increased impact on clini-cal practice behavior.

The questions that might be raised relevant toprogram improvement are numerous, but they alltake the general form:

How can this program or program element bemade more effective?

Sometimes the answers to these questions derivefrom management or monitoring systems or de-scriptive evaluations. Sometimes they requirecomplex experimental designs. In either case, theemphasis is on identifying strengths upon which tocapitalize and weaknesses in need of correction. Inthis sense, program evaluations for program im-provement tend to fit with the traditional "valu-

ative" connotations of program evaluation dis-cussed earlier. However, the emphasis is on theuse of data internally and in the overall service ofthe program.

Evaluation for Accountability

Evaluation for the purpose of accountability -- thatis, evaluation for the purpose of determining whetheror not a program is meeting its stated objectives -- also informs decision making. Often, however,the decision maker is someone outside the program(e.g., funders or university administration), andsometimes the decision is one of program continu-ation or termination. Thus, accountability evalu-ations can be threatening to those involved.

Accountability evaluations are similar to the evalu-ation activities described thus far, and may addressmany of the same questions. For example, anaccountability evaluation of a clinical trainingprogram might address issues of course availabil-ity, course content, and course attendance (man-agement or monitoring evaluation), learner expe-riences with and reactions to the courses (descrip-tive evaluation), and/or strengths asid weaknesses(program improvement evaluation). Thus. a pro-gram that has these evaluative elements in place isgenerally in a good position to respond to requestsfor at least some kinds of accountability informa-tion.

Simply demonstrating an evaluation-based approachto program development. implementation, andimprovement is one indication that a program iswell run and rationally managed. Thus, programsin which evaluation is an ongoing and integralcomponent of program operations tend to fare wellwhen required to be accountable to outside deci-sion makers.

Evaluation for Creating New Knowledge

A final purpose of evaluation is to develop and/ortest hypotheses about programs and their effects.In the broadest sense, the goal of such evaluationsis to contribute to the general knowledge base

Page 12: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

about what does or does not lead to improvementsin professional education, clinical practice behav-ior, and/or in patient outcomes regarding sub-stance abuse.

If they are to be accepted by the scientific commu-nity, evaluations rhat seek to contribute to thegeneral knowledge base must generally meet rig-orous standar& of research design. Many of thesedesign issues are discussed in Section II of thisGuide Book. It is important to note, however, thateven the most rigorous evaluation design is greatlyenhanced if other evaluative elements are in place(good program monitoring, a capacity for detailedprogram description, etc.).

It is also imponant to note that evaluations de-signed to produce new knowledge can have direct.practical implications for the institutions that con-duct them. For example, an institution consideringthe adoption of interactive laser disc technology asa pedagogical strategy, or the use of standardizedpatients (actors trained to present a standardizedcomplaint) as an assessment strategy may wish tocompare these strategies to traditional, but lessexpensive strategies to determine if the improve-ment in outcomes is worth the extra cost.

MAJOR OPTIONS

Once the purpose of a given evaluation initiativehas been determined, the evaluator must chooseamong available evaluation options. Many ofthese options are methodological and are discussedin detail in Section II. However, there are concep-tual options that should be addressed before meth-odological options are considered. These concep-tual options determine the overall direction andscope of the evaluation. They also guide, to somedegree, the methodological decisions to follow. Itis these conceptual options that are discussed here.

Program evaluation activities may be seen as vary-ing along three major conceptual dimensions.

The overall goal of the evaluation (Formative,Summative)

The level of information gathered (Process. Out-come, Impact)

The type of information gathered (Quantitative.Qualitative).

Before discussing these dimensions in more detai!it is important to emphasize that they do notrepresent "either/or" options. For examples whilean evaluation may be largely formative, it mayhave summative components. Similarly, a largelyquantitative data collection effort may includequalitative data collection as well. Thus, evalu-ation activities are most often combinations ofseveral components from each dimension, depend-ing upon the overall purpose of the evaluationinitiative.

The Formative/Summative Dimension

The terms "summative" and "formative" refer tothe uses to which evaluation information is put.Formative evaluation information is used for pro-gram improvement and is usually reviewed on aregular basis as it is collected. A management ormonitoring evaluation system in which dau arereviewed each quarter or semester would be arelevant example. Surnmative evaluation infor-mation is most often used to assess the efficiencyor effectiveness of a program that has reached astable operational state. In contrast to formativeinformation, sumrnarive evaluation information isoften not analyzed until all data collection iscompleted in order to avoid contamination of theprogram under consideration. A controlled ex-perimental study comparing an innovative cur-riculum module to a standard module is an ex-ample of a summative evaluation.

There is a tendency to associate formative evalu-ation with program descriptive information andsummative evaluation with assessment ofprogramoutcomes. In fact, any type of evaluation may beeither summative or formative, depending on thepurpose of the evaluation and the uses to which theinformation is put. For example, a descriptiveevaluation implemented in order to develop a

512

Page 13: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

dissemination manual for an innovative programcomponent (e.g., the use of standardized patientsin a training) would be a summative evaluationbecause the purpose of the evaluation is to providea finalized description of the component, and becausethe program component is stable at the point atwhich it is described. Similarly, an evaluationwhich provides an assessment of the effectivenessof a promising new educational technique wouldbe considered formative if the purpose of the studywere to aid in the further development of thetechnique.

In general, sumrnative evaluation, whether ori-ented to description or outcome assessment willrequire a higher level of scientific rigor than willformative evaluation, and will, therefore, be moredifficult and costly to implement. Keep in mindthat it is inappropriate to carry out a summativeevaluation when a program is still in a state ofevolution, and when program components havenot yet reached the level of stability required for asummative assessment.

The Process/Outcome/Impact Dimension

Some evaluators have found it useful to distinguishthree levels of evaluation: process, outcome, andImpact.

The process level is concerned with the quality andquantity of program inputs and activities, and withthe socio-political context in which the programoperates. Examples of the kinds of informationsubsumed by the process level include faculty andlearner demographics, course descriptions, learnerevaluations of courses, and context variables suchas institutional support for curriculum changes.

Process data are the basis for program manage-ment and monitoring systems and for evaluationsdesigned for the purpose of program description.Process data also play several key roles in evalu-ations designed to assess program effectiveness.At the simplest level, process data allow a determi-nation of the actual program that is being evalu-ated. Programs as implemented are almost always

different from programs as planned. and it is theeffectiveness of the program as implemented thatis of primary interest.

Process data also aid in unraveling thl usuallycomplex outcomes of programs and programcomponents. For example, answers to the question"What pedagogical approaches work best withdifferent categories of learners (e.g., undergradu-ates, post-graduates, fellows, faculty, etc.)?" arederived from an examination of differential effec-tiveness data in light of process infotmation.

The preceding discussion suggests a general rule:Assessments of program effectiveness should notbe undertaken in the absence of a well developedprocess evaluation.

The outcome level subsumes assessments of theaccomplishment of specific program objectivesconcerned with changes in knowledge, attitudes,clinical behavior, patient outcomes, and so on.This simple conceptual definition belies the tech-nical complexities involved in implementing ade-quate evaluations at the outcome level.

The technical complexities associated with out-come assessments derive from the necessity ofmeeting two challenges. First, measurement tech-niques must be located or developed that provideadequate representation of the outcome objectivesof the program. Because substance abuse is arelatively new field in professional education, fewstandardized measures are available. Second, anevaluation design must be developed that supportsthe inference that it is the program or programcomponent under study that has caused changes inattitudes, knowledge, clinical behavior, or patientoutcomes -- that is, a design that rules out alterna-tive exp.anations of the observed changes. A largeproportion of the methodological discussions inSection II are concerned with strategies for meet-ing these two challenges of outcome assessments.

Evaluation at the impact level assz:ss the broadereffects of a program or the aggregate effects ofseveral programs operating over time. Examining

6 1 3

Page 14: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

the extent to which substance abuse training isinstitutionalized over a period of years would beone example of an impact evaluation. Impactevaluations also look for unintended program ef-fects, both positive and negative. For example. awell regarded substance abuse mining program inone professional school might stimulate the devel-opment of similar programs in other professionalschools in the university. Generally, such unin-ttnded effects are discovered through descriptivedata collection, and only the most complex andcostly impact evaluations seek to study such ef-fects systematically.

The Quantitative/Qualitative Dimension

The distinction between quantitative and qualita-tive evaluation is often made in terms of themethods used to collect data. Thus, knowledgetests, closed-ended interviews, and counts of pa-tients referred for treatment are considered quanti-tative methods, while observation, open-endedinterviews, and case histories are considered quali-tative methods. The former are characterized assystematic and quantifiable while the latter areconsidered less systematic and less quantifiable.This distinction rapidly breaks down, however, asit becomes clear that almost any data collection canbe made systematic and all data can be quantified.For example. observational methods may use stan-dardized protocols and sophisticated time samplesthat yield precise data: open-ended interviews maybe coded to produce reliable categories, and so on.

A more useful distinction is to consider the generalparadigm that underlies quantitative and qualita-tive analyses. Quantitative analyses derive from ahypothetico-deductive paradigm in which a hy-pothesis or series of hypotheses are tested byrelating variations in independent variables tovariation in dependent variables. Such analysesrequire a substantial body of knowledge about thephenomenon (program) under study and the mecha-nisms that are thought to underlie its operation and/or effects. Here, precise definitions of variablesand precision in measurement are key to the hy-pothesis testing process.

Qualitative analyses derive from a holistic-induc-tive paradigm which works backwards from thephenomenon (program) under study in an attemptto develop hypotheses about its operation andeffects. Although data collection may still behighly systematic, the evaluator casts a much widernet than in quantitative analyses, attempting toestablish patterns and regularities on which todevelop insights into the mechanisms underlyingoperations and effects.

The choice of options along the qualitative/quanti-tative dimension will depend upon the level ofexisting knowledge concerning the program orprogram component under consideration and thenature of the questions being asked. Where thereis sufficient knowledge to develop specific hy-potheses and to define relevant variables precisely,a quantitative approach is indicated. By contrast,when relatively little is known about a program orcomponent (as might be the case in a newly initi-ated, novel pedagogical practice), a more qualita-tive approach will generally provide a richer yieldof useful information.

Frequently, evaluations evolve from qualitative toquantitative. For example, a given departmentmight be interested in exploring the mentoringcomponent of a faculty development program.Very little may be known at first about how thecomponent operates, its relevant features, even itsfeasibility. The evaluation might start, therefore,by exploring the component qualitatively throughsemi-structured interviews with participating fac-ulty, their mentors, and other key informants in thedepartment (e.g., administrative personnel). Oncethe specific variables of interest have been identi-fied qualitatively, they might be measured quanti-tatively through a closed-ended survey with quan-titative rating scales.

Most successful evaluation efforts comprise a blen,..of both quantitative and qualitative data. Forexample, a study of learner attitudes towards sub-stance abusers might rely on a paper-and-pencilassessment amenable to quantitative analysis sup-plemented by in-depth, open-ended debriefing of a

7 4

11

Page 15: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

smaller sample of learners. Here, the more precise1but necessarily narrow, data from the quantitativeanalysis are supplemented by the descriptive depthoffered by the qualitative analysis.

Quantitative data supplemented by qualitativeaneca,tes are also a powerful tool for communi-cating evaluation results. Evaluation practitionershave long known that, although quantitative rigoris the sin qua non of program evaluation, mostindividuals are more compelled by anecdotes andcase examples than they are by statistical graphsand tables. This is not to suggest that evaluationsshould therefore be based solely on anecdotes andcase examples. Rather, the point is that suchqualitative data are useful communication toolswhen they are consistent with and support quanti-tative findings.

BENEFITS AND RISKS

Benefits

Many of the benefits of evaluation have alreadybeen described or alluded to in the precedingiiscussions. As suggested, the primary benefit ofevaluation is that it provides a rational basis forprogram decision making -- including decisionsconcerning the most effective ways for integratingsubstance abuse content into professional educa-tion -- by providing systematic input into theuecision making process. To this may be added atleast four other benefits:

Evaluation as an early warning system

Evaluation as positive reinforcement

Evaluation as burnout prevention

Evaluation as legitimization

As already suggested, evaluation activities canprovide regular feedback on the operation of aprogram or program component. One of the mostimportant functions of such activities is their abil-ity to serve as an early warning system, uncovering

problems as they arise. A regular review of man-agement or monitoring system data allows theidentification of problem areas which might other-wise go unnoticed, and to take corrective actionbefore these problems become serious.

Everyone is interested in feedback about job per-formance, and, in professional education, facultyare always interested in the impact their work ishaving on learners. Similarly, clinicians are ex-tremely concerned about the impact they are hav-ing on the health and well being of patients orclients. An evaluation system can be of directbenefit in meeting these needs and can providepositive reinforcement to individuals who mightotherwise have little systematic way of assessingtheir own performance.

Related to the reinforcing value of evaluation is itsrole in preventing "burnout." It has often beenobserved that the intmduction of evaluation activi-ties into a program provides a new and interestingfocus for individuals who may be somewhat jadedor discouraged. It is also the case that the self-examination that accompanies the intrcduction ofevaluation activities into a program provides apositive forum for discussion of operational issuesthat might otherwise remain hidden. Repeatedobservation of this self-examination phenomenonnas led some evaluators to refer to evaluationplanning activities as the "evaluation therapy"process.

Educating the public and other important constitu-ents about programs -- i.e., public relations --another important function of evaluation data. Asnoted, descriptive evaluations are often used ascommunication tools to answer the common ques-tion. "What exactly does this program do?" Simi-larly, management or monitoring systems allowready answers to the four journalistic "W's" (who,what, where, and when), issues often raised byoutside individuals interested in program opera-tions.

Finally, the legitimizing function of evaluationmust be considered. All else being equal, a pro-

815

Page 16: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

gram that is actively engaged in examining itsactivities and its impact will be viewed as morelegitimate than a program that is not. The legiti-macy that derives from an active program of evalu-ation is well warranted -- programs that includeevaluation as part of their ongoing activities arealso more likely to be those programs that areeffectively managed and whose growth and devel-opment are rational and planful.

Risks

The most commonly discussed risk of evaluation isthat the data will reveal something negative aboutthe program or its impact. This perspective derivesfrom the view of evaluation as largely a "valu-ative" activity. As has been repeatedly empha-sized. the "valuative" function of evaluation is byno means its only, or even its primary function.Even when evaluations are designed to directlyassess program worth or efficacy, evaluation re-sults are almost always a mixture of positives andnegatives. Rarely, if ever, do evaluation resultsdeliver the programmatic coup de gras.

This is not to imply that evaluation activities arewithout risks. However, these risks tend to havelittle to do with what the results of the evaluationreveal about the program. Rather, they involve thepracticalities of introducing evaluation activitiesinto the ongoing operations of a program. Specifi-cally:

Evaluations can be threatening to those who havean investment in program success.

Evaluation activities can interfere with other ac-tivities.

Evaluations can take away resources from pro-gram activities.

Despite all arguments to the contrary (includingthose presented in this manual), some people in-volved with a program will be initially threatenedby the prospect of initiating evaluation activities.Even when it is understood that the evaluation will

not be used to make judgments about the adequacyof performance (i.e.. it is the program, not thepeople, that is being evaluated), evaluation can stillbe threatening because evaluation may portendchange and a threat to the status quo. In addition,many people harbor the suspicion (sometimesjustified by previous experience with evaluators),that the tools of evaluation are ill-suited to measurethe quality of certain program activities or to assess"intangible" outcomes.

Experience suggests that the threat of evaluationcan be reduced by including the individuals whoseprogram is to be evaluated in the evaluation plan-ning process and by providing them with an activerole in the interpretation of evaluation results.Moreover, certain evaluation components (e.g.,mdnagement or monitoring systems) can be ofbenefit to faculty and other staff by providingregular feedback as discussed earlier.

Experience also suggests that autocratic attemptsto impose evaluations on resistant participants willusually fail. The ways in which an unwantedevaluation can be undermined are too numerous tolist -- it suffices to say that without fuli coopera-tion, most evaluation activities will produce littlein the way of useful information.

Almost all evaluation activities obtrude, in someway, into ongoing program or institutional activi-ties. Such interference can be minimized by care-ful selection of evaluation methods, but the factremains that additional activities will be requiredthat are not always consistent with programmaticobjectives. For example, an evaluation design mayinclude the observation of interactions with pa-tients or clients, and it may be argued that thepresence of the observer negatively affects theprovider/patient relationship. Similarly, timedevoted to an extensive knowledge or attitudespre-test is time takea away from instruction. Thereis no "magic bullet" to resolve these conflictsbetween the evaluation agenda and the programagenda. Rather, a balance must be struck betweencompromise in the program and compromise in theevaluation design given the probable benefits of

Page 17: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

'mplementing a given evaluation activity.One final risk of evaluation bears consideration --the risk of misusing evaluation results. In someinstances, slipshod or equivocal evaluations areused to support claims of program effectivenessthat are not justified by the evaluation design or bythe data. Here, the loss of credibility can besufficiently damaging to the program or institutionas to overshadow any benefits that might derivefrom the "positive" evaluation results.

Similarly, design or implementation weaknessesin the program itself or in the evaluation may leadto negative evaluation findings that are not a truereflection of the effectiveness or potential effec-tiveness of the program. An otherwise soundprogram idea may be prematurely rejected.

Both promotion of an unsound program and the re-jection of an effective one are serious risks -- theycan best be avoided by using the most rigorousmethodology feasible and by interpreting evalu-ation results carefully.

10

Page 18: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

SECTION II:INTRODUCTION TO RESEARCH METHODS

This section provides an overview of the methodo-logical issues involved in evaluating faculty devel-opment and clinical training programs in sub-stance abuse. The overview is intended to raise keyissues in program evaluation research methods,and to relate these issues to the specific problemsfaced in the design and implementation of thevarious kinds of evaluation activities describedearlier (monitoring and management systems,descriptive studies, accountability studies, etc.).Where appropriate, references are provided to moredetailed, technical descriptions for exploration bythe interested reader.

SAMPLING

Sampling involves strategies for selecting a por-tion of cases from or about which data will becollected with the goal of making generalizationsabout the larger population from which the casesare drawn. For example, in evaluations focusedonlearner outcomes, a case might be a single learner,the enrollees in a course, or all the learners in adepartment. Similarly, the population to whichgeneralizations are to be made might be all learnersenrolled in a given course, all learners in a givenclinical specialty, all departments within a profes-sional school, and so on.

In some instances, an evaluation design may callfor data collection from or about all cases in apopulation. This might be the case, for example. ina management or monitoring system that trackslearner outcomes from year to year based on stan-dardized test scores. More often, however, it is notpractical, or even desirable, to collect data from anentire population. Under these circumstances, asampling strategy is required.

Sampling strategies may be divided into two broadcategories -- those that yield samples with knownor measurable sampling errors, and those that donot. Sampling error is a measure of the precisionwith which the parameters of a sample representthe parameters in the larger population. From thestandpoint of methodological rigor, samples withknown sampling errors are clearly preferred.However, such samples may not always be practi-cally or conceptually feasible, and other samplingstrategies must, out of necessity, be employed.

All samples, with or without a known or measur-able sampling error, may be further categorizedaccording to their variability. In general, sampleswith lower variability are preferred, owing to theinverse relationship between variability and statis-tical power (i.e., the ability to detect effects if theyare present). It is important to note that thepreference for samples with lower variability applies

I 8

1 I

Page 19: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

in qualitative studies may be quite different -- inqualitative studies, samples may be constructed insuch a way as to increase variability in order togather as divtrse a data base as possible.

The following sections describe the major sam-pling strategies used in program evaluation. Strate-gies yielding samples with and without knownsampling errors are discussed separately.

Samples with Known Sampling Errors'

Simple Random Sampling: In simple randomsampling, each population member (case) is as-signed a unique number. The sample is thenselected via use of random numbers. Simplerandom sampling has three advantages: 1) itrequires only minimum knowledge of the popula-tion a priori. 2) it avoids classification errors, and3) it facilitates analysis of data and computation oferrors. Disadvantages of simple random samplinginclude that it does make use of knowledge of thepopulation which might be possessed by the evalu-ator. and it yields larger errors (for the same samplesize) than does stratified sampling.

Systematic Sampling: Systematic sampling ex-ploits the natural ordering of a population. Arandom starting point is selected between the numberone and thn nearest integer to the "sampling ratio"-- defined as the ratio (n/N) between the size of thesample (n) and the size of the population (N).Items are then selected at the interval nearest (at thewhole number) to the sampling ratio. If the popu-lation is ordered with respect to some pertinentproperty (e.g., department, clinical specialization),then systematic sampling yields a stratificationeffect (e.g., insures roughly equal representationof departments or specialties). This stratificationreduces variability compared to a simple randomsample -- the major advantage of systematic sam-pling. In addition, systematic sampling facilitatesboth the drawing and checking of the sample.

However, if the sampling interval is related to aperiodic ordering of the population (e.g., sub-stance abuse-related emergency room admissions

by time of day or day of the week), increasedvariability may be introduced This is the majordisadvantage of systematic sampling. When thereis this type of stratification effect, estimates oferror are likely to be high.

Multistage Random Sampling: Multistage ran-dom sampling involves stages, all of which are aform of random sampling. For example, in afollow-up smdy of patient outcomes, clinical sitesmight be randomly selected (stage one), and thenpatients or clients randomly selected within sites(stage two). A major advantage is that samplinglists, identification, and numbering are requiredonly for units belonging to subgroups (e.g., clini-cal sites) actually selected. If sampling units aregeographically dispersed (e.g., satellite clinics),multistage random sampling cuts down on the timeand trouble needed to collect data.

On the negative side, errors are likely to be largerfor multistage random sampling than for simplerandom or systematic sampling for the same samplesize. Errors increase as the number of samplingunits selected decreases -- if, for example. only asmall number of satellite clinics can feasibly besampled.

In a major variant of multistage random sampling,sampling units are selected with probability pro-portionate to their size. This procedure has theadvantage of reducing variability. Its major disad-vantage is that lack of a priori knowledge of thesize of each sampling unit increases the variability.

StratifiedSampling: Stratified sampling strategiesmake use of knowledge about characteristics of thepopulation (e.g., age, sex, and ability to pay ofpatients in a clinic) which may be related to de-pendent variables (e.g., patient referrals). Thereare three major variants: 11 proportionate sam-pling, 2) optimum allocatic n sampling, and 3)disproportionate sampling. Each of these threevariations is discussed in turn.

In proportionate stratified sampling, selection fromevery sampling strata is random with probability

12 19

Page 20: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

proportionate to size. This assures representationwith respect to the property or properties whir.thdefme the strata and, therefore, yields less variabil-ity than simple random sampling or multistagerandom sampling. Proportionate stratified sam-pling also decreases the chance of failure to includemembers of the population because the classifica-tinn process helps insure that a wider range on thevariables that defme the strata will be included. Inaddition, the stratified sample facilitates compari-sons between or among strata (e.g., adult patientsvs. adolescent patients). On the negative side,proportionate stratified sampling requires accurateinformation on the proportion of the population ineach stntum. Otherwise, error will be increased.If stratified lists are not available, these may becosdy to prepare. There is also the possibility offaulty classification and hence. increased variabil-ity.

Optimum allocation sampling procedures are thesame as those in proportionate sampling, exceptthat the sample is proportionate to the variabilitywithin strata as well as to their size. This assuresthat there will be less variability for the samesample size than in proportionate stratified sam-pling. The major disadvantage is that optimumallocation requires knowledge of variability ofpertinent characteristics within each stratum. Forexample, it may be useful to classify patientswithin clinics according to variability in the natureand extent of substance abuse problems. However,accurate assessments of such variability may bedifficult to obtain.

Disproportionate stratified sampling proceeds asin the proportionate and optimum allocation vari-ants, except that the size of the sample is notproportionate to the size of the sampling units, butis determined rather by analytic considerations orconvenience. For example, one might choose tooversample a cell within a stratified samplingdesign if the proportion of this cell within thegeneral population is so low as to yield unanalyz-able results (for example, oversampling patients ofa particular sex, age, and usage patterns such ascocaine addicted pregnant teenagers).

Disproportionate stratified sampling is more effi-cient than proportionate stratified sampling forcomparison of strata or where different errors areoptimum for different strata. The major disadvan-tage is that disproportionate stratified sampling isless efficient than proportionate sampling for de-termining population characteristics. That is. ityields higher variability for the same sample size.

Cluster Sampling: In cluster sampling, samplingunits are selected via some form of random proce-dures. The ultimate units are groups (e.g., a clinic,enrollees in a course). Cluster sampling has sev-eral advantages: 1) if clusters are geographicallydefined, it yield the lowest field costs, 2) it requireslisting only units in selected clusters, 3) character-istics of clusters as well as those of the populationcan be estimated. and 4) it can be used for subse-quent samples because clusters rather than unitsare selected, and substitution of units may bepossible. The disadvantages are larger errors forcomparable sample sizes than with other probabil-ity samples, and the requirement that each memberof the population be uniquely assigned to a cluster-- inability to do so may result in duplication oromission of units.

Stratified Cluster Sampling: Stratified clustersampling is to cluster sampling what stratifiedsampling is to simple random sampling. That is.sampling strata are defined based on known char-acteristics of the clusters, and clusters are selectedat random from every sampling strata. Stratifica-tion reduces the variability of ordinary clustersampling, but combines the disadvantages of strati-fied sampling with those of cluster sampling. Inaddition, because cluster properties may change,the advantage of stratification my be reduced, andthe sample may not be usable for subsequentresearch.

Sequential Sampling: Sequential sampling is aprocedure whereby two or more samples of any ofthe types discussed above are taken, with resultsfrom earlier samples used to design later ones (orto determine if they are necessary). Sequentialsampling provides estimates of population charac-

13 20

Page 21: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

terisdcs that facilitate efficient planning of suc-ceeding samples, and thereby reduces errors in thefinal estimate. In the long run, sequential samplingalso reduces the number of observations required.It has several disadvantages: 1) it complicates theadministration of field work, 2) it requires morecomputation and analysis than does non-repetitivesampling, and 3) it can be used only where a verysmall sample can approximate representativenessand where the number of observations can beincreased conveniently at any stage of the research.

Samples Without Known Sampling Errors

Convenience Sampling: The simplest samplingtechnique is convenience sampling. Here, theevaluator collects data on those individuals whoare most readily available. An example would bean assessment of changes in learners' patient inter-viewing skills using learners assigned to a certainclinic chosen on the basis of geographic proximity(i.e.. it is easy to get to) or timing (the clinic hoursfit the schedule of the data collectors). Althoughthe convenience sample is one of the weakestavailable sampling options, it is usually sufficientfor pilot studies or for studies aimed at very generalassessments.

Quota Sampling: Quota sampling attempts toovercome some of the disadvantages of conven-ience sampling by introducing stratification basedon a priori knowledge of a population. For ex-ample. in a study of case notes entered by learnersexposed to an educational innovation, clinic rec-ords might be used to deterrnine the proportion ofmale and female patients in different age groups,and "quotas" set for male and female patients ofdifferent ages that represent the population propor-tions. Quota samples reduce variability and in-crease representativeness, although the selectionof cases within categories (e.g., male, female) maystill be highly biased.

Modal Sampling: Modal sampling involves theselection of a sample that is judged to be represen-tative of a population based on knowledge of thegeneral characteristics of the population of inter-

est. For example, experience might suggest thatpatients in a given clinic tend to fall into categoriescharacterized by a set of related characteristics(e.g., young males who are heavy drinkers but notalcoholics: older males who are either light drink-ers or alcoholics, etc.). The modal sample isselected to include individuals from each of thevarious categories which, taken together, ate thoughtto represent the range of patients in that popula-tion. Modal sampling is particularly useful inevaluations where a large amount of resources areto be devoted to the study of a small number ofcases, as might apply when a study requires lengthyinterviews with patients.

In general, the quality of a modal sample willdepend upon the extent and sophistication of theevaluator's understanding of the characteristics ofthe population. Like quota sampling, modal sam-pling has the advantage of increasing representa-tiveness and decreasing variability, and like quotasampling, the possibility of biased selection withinmodal categories is a major disadvantage.

Purposive Sampling: Purposive sampling is usedin qualitative evaluation to ensure that informationis obtained fnam each group or individual of inter-est. For example, in order to describe the changesthat have occurred as the result of the introductionof a new curriculum, key informants (such as thedepartment chair and some professors) might beselected from each participating department.

Snowball Sampling: Snowball sampling is a vari-ation on purposive sampling. Key informants areselected and each of these informants is asked tosuggest other people from whom data should becollected. For example, investigators interested inthe opinions of students most interested in alcoholand other drug abuse -,sues might ask the profes-sors of key courses to suggest students to beinterviewed. Those students might in turn be askedto suggest other students who share their interest inalcohol and other drug abuse topics. This processcould continue until a sample of sufficient size hasbeen identified or until most of the suggestedinformants have already been named by others.

Page 22: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

ISSUES RELATED TO SAMPLE SIZE

One of the most commonly asked questions inplanning an evaluation is: "How large a sample isrequired?" Actually, there are two different ques-tions related to sample size. The first question is:"How big a sample is required in order to makereasonable generalizations from the sample to thelarger population?" The second question is: "Howbig a sample is needed in order to make meaningfulinferences about effects?"

Sample Size and Generalizability

The first question relates to the issues of samplingerror discussed earlier -- that is, how well does thesample represent the population. In general. thelarger the sample size, the smaller the samplingerror, and hence, the better the representation ofthe population of interest. Specific formulae forcalculating sampling error for different sampletypes and strateEies can be found in Kish (1967).1As in most things, a law of diminishing returnsapplies to sample sizes. As sample size increases,the relative increase in precision derived from eachadditdonal case decreases. Thus, evaluators some-times refer to the "net gain from sampling," acomparison between the increased cost in terms ofdata collection, data analysis, etc., of each addi-tional case and the concomitant benefit in terms ofincreased precision.

When sampling strategies are employed that yieldsamples for which sampling error cannot be meas-ured, the sample size is largely a practical judge-ment based on the available resources for theevaluation, and a consensus among those who willuse the evaluation results. Here, the issue becomesone of negotiating the number of cases needed toinstill confidence in those individuals who will beusing the information the evaluation produces..

Sample Size and Statistical Power

The second question relates to statistical power,that is, the ability of inferential statistics to detecteffects when such effects exist. The power of a

statistical test represents a complex relationshipamong three factors:

The probability of mistakenly rejecting the hy-pothesis that there is no effect when, in fact. noeffect exists (referred to as alpha) -- as alphaincreases, power increases

The magnitude of the true effect size in the popula-tion -- as the magnitude of the effect increases.power increases

The sample size -- as the sample size increases,power increases.

Knowledge of two of these three parameters al-lows a calculation of the third. Alpha is set by theevaluator. Thus, knowledge of the true effect sizeallows a calculation of the sample size needed toobtain a given level of power. Conversely, for agiven sample size, the minimum detectable effectcan be estimated. Procedures for making thesecalculations are found in Cohen (1969) and Cohenand Cohen (1975).3

Unfortunately, the true effect size is generally notknown. Cohen and Cohen (1975)4 suggest threestrategies which allow a calculation of sample sizein the absence of a priori knowledge of the trueeffect size. First, other similar studies may beconsulted in order to ascertain the probable effectsize in a given situation. For example, in order toestimate the expected change in learner knowledgescores as a result of participation in a given course,one might examine changes observed in otherstudies of similar courses.

Second, one might consider the size of effect thatwould make a practical difference -- e.g., how biga change in clinical practice behavior would berequired to make it worth implementing a givenstrategy to improve a given clinical skill?

Finally, Cohen (1969)5 has suggested conventionsfor "small," "medium," and "large" effect sizesbased on the effects generally observed in thesocial sciences. In the absence of other informa-

15

Page 23: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

tion, sample sizes may be calculated for all three"conventional" effect sizes in order to establish arange of sample sizes for a given evaluation study.

THE VALIDITY OF EVALUATIONSAND POTENTIAL SOURCES OF BIAS

The overall goal of the various social sciencemethods used in evaluation is simply to ensure, tothe greatest extent possible, that the informationproduced is accurate enough to yield valid conclu-sions useful in decision malcing. Stated anotherway, the goal of these methods is to reduce, to thegreatest extent possible, the biases which mayaffect the validity of evaluation conclusions.

Issues Related to External Validity

All evaluation seek to maximize external validity -

- that is. the extent to which the results of a givenevaluation are generalizable beyond the highlyspecific conditions under which the results areproduced. External validity will be of concernwhether the evaluation is a management or moni-toring system. a descriptive evaluation, or a com-plex assessment of program outcomes or impact.In general, the biases that impair external validityare to be found in the ways that samples are drawnand measurement is conducted.

Sampling Biases: Several issues related to sam-pling have already been discussed. As suggested,various sampling techniques yield samples whichestimate population parameters with varying de-grees of precision, and hence, will yield evaluationresults with varying degrees of external validity.When samples are not systematic, and often theyare not, biases will be introduced that decrease themeaning of results beyond the specific populationfrom whom data are gathered. At the very least,these biases should be acknowledged in reportingevaluation results -- better still, attempts should bemade to estimate and/or correct for their effects.

Two of the most common sample biases that arisein program evaluation -- and these obtain no matterhow the initial sample is drawn -- are refusal bias

and attrition bias. Refusal biases may occur when-ever a study population is given the option to refuseto participate in some or all of the data collectionactivities. The ethics of evaluation (and the rulesgoverning all Federally funded research) requirethat subjects be given the option to refuse coopera-tion, to terminate cooperation at any time, and torefuse to answer individual questions. Althoughcareful attention to the rights and concerns of studyparticipants can reduce refusal, some refusal isinevitable in most evaluation data collection. Forexample, a common example of refusal bias is non-return of mail surveys. The question then arises:"How are subjects that refused to provide datadifferent from those who did not refuse?". That is,what biases are introduced by refusal?

There are a number of important ways in whichthose who refuse to cooperate may differ fromthose who do not. For example, it might beassumed that the patients most likely to refuse toparticipate in a survey of the incidence of alcoholand other drug problems in a given clinic popula-tion are those with the most serious alcohol oi otherdrug problems. Estimates of problem rates will bebiased if those with the worst problems refuse tocooperate. A completely different sort of biasmight be introduced if those with the least educa-tion refuse to cooperate to avoid the embarrass-ment of admitting that they cannot read or do notunderstand the data collection instrument.

Refusal biases may also arise in system levelstudies. For example, one or more departmentsmight refuse to panicipate in a study of departmen-tal support for the introduction of substance abuse-related curricular content. Is such a refusal theresult of a desire to prevent the discovery ofexisting problems? Alternately, does the depart-ment chair believe that the department is alreadydoing an adequate job in this area and that nobenefit will derive to the department from thestudy? In either case, the study would be biased bythe lack of data from the non-cooperativedepartment(s).

Related to refusal biases are attrition biases --

Page 24: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

AO

those biases that arise when study pardcipants donot complete the study. Like refusal, attritionraises the concern that those who do not completethe study may be systematically different fromthose that do. So. for example, a learner satisfac-tion survey taken at the end of a course may showinflated satisfaction because it does not includethose highly dissatisfied individuals who droppedout. Similarly, a comparison might be made oflifestyle changes observed in patients counseled byclinicians who have or have not been exposed to aneducational innovation. This comparison will bebiased if attrition of heavy drug or alcohol users isdifferential between the two clinician groups.

Assessment of the impact of refusal or attritionbiases generally relies on comparisons of the char-acteristics of cases from which complete data areavailable and those cases from which they are not.Of course, such a strategy requires some minimumlevel of data on lost participants. Thus, evenpatients who refuse to complete a questionnaire orinterview might be asked if they would be willingto provide basic demographic information. Some-times, assessment of the effects of refusal or attti-don biases must be based on very rudimentary datasuch as race, sex, and approximate age. Such datamight be gathered from case records or even throughobservation.

Measurement Biases: Accurate generalizationsare possible only to the extent that accurate meas-ures are used in the evaluation. In general. theexternal validity of evaluation studies is threatenedwhen measures either fail to accurately representthe phenomenon of interest (so-called failures ofconstruct validity), or systematically over- orunderestimate the phenomenon of interest. Ofcourse, measures may also be "noisy" or unreli-able. However, low reliability is generally consid-mid to be a threat to internal, rather than externalvalidity.

The issue of construct validity is a major concernin evaluations of substance abuse programming.How does one measure (or even define) such keyconstructs as alcoholism, addiction, dependency,

denial, social drinking, and so on. To say that agiven intervention -is successful in identifying "socialdrinkers" but not alcoholics requires a consensusabout hatis meant by these terms, and a consen-sus that the measurement techniques used to assessthem are adequate representations .

Even as simple a construct as whether or not alearner completed a course raises definitional andmeasurement issues. Does the learner have toattend all sessions? If not, how many may he or shemiss? Does the learner need to do all the requiredreading in order to be categorized as a coursecompleter?

Assuming that a satisfactory definition of a givenvariable is available, and usuming that one canadequately measure the variable in some practicalway, concern may still be raised abut systematicover- or underestimation. For example, if a learner'sdiagnostic classifications are being compared tothe results of a written assessment tool, does theassessment tool over- or underestimate the numberof patients with a given diagnosis?

Issues Related to Internal Validity

Internal validity refers to the extent to which agiven evaluation design allows one to rule outalternative explanations for the effects observed.In general, then, internal validity is of concernwhen one is attempting to derive causal relation-ships among variables (e.g., participation in agiven practicum caused improvements in clinicalskills). The various threats to internal validity maybe defined in terms of various challenges to theassertion that "exposure to X caused effect Y."Discussions of the common threats to internalvalidity follow.'

History: Historical threats to internal validity referto specifictvents or conditions -- in addition to thetreatment or experimental influence -- whic h occurbetween first and subsequent measurements of thedependent or outcome variable, and which mightinfluence the magnitude of observed differences.

Page 25: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

For example an observed change in attitudes re-garding alcohol and other drug abusers might betaken as evidence of the success of an educationalintervention. However, such changes might actu-ally have resulted from a widely publicized inci-dent such as the overdose death of a famous athleteor a major oil spill attributed to alcohol abuse.Similarly, increased identification of drug abusersin the emergency room might result from theincreased diagnostic sltills of E.R. staff, but mightalso be attributable to actuill increases in the numberof drug abusers who are present at the E.R. duringthe study period.

Maturation: Maturation refers to changes in studyparticipants between pretest and subsequent test-ing which influence the observed outcome, but arenot a part of the program or treatment of interest.Participants may grow older, wiser, stronger, orundergo biological change.

For example. it is well known that dni g and alcoholuse generally increases throughout the teenageyears, peaks in early adulthood, and then declines.Thus, depending upon the age of patients in astudy, use rates may either increase or decreaseover time independent of the exposure to anyclinical intervention. Similarly, most psychopa-thology (including alcohol and other drug prob-lems ) is associated with a "spontaneous recoveryrate" -- i.e., some individuals will improve in theabsence of any treatment.

Testing: Sometimes, simply taking a test affectsparticipant's subsequent performance on the sametest. In pretest/post-test designs, the concern iswith the influence of the pretest experience on thepost-test score. For example, pretesting learners'attitudes concerning alcohol and other drug abus-ers may sensitize learners to the fact that attitudechange is desired. The potential influence of test-ing is ever more likely in time series or repeatedmeasurements designs where multiple testing isrequired.

Instrumentation: Instrumentation artifacts arisewhen changes occur in measurement instruments

or procedures over repeated observations. Forexample, measures derived from observations ofclinician-patient encounters may change in someunknown way as the study progresses and repeatedobservations are made. Observers may becomemore skilled, bored, or inferential as they ratesamples of clinician behavior. Similarly, if meas-ures are based on interview data, the quality of thedata may change as interviewers gain experienceand confidence, become more familiar with theinterview schedule, or gain insight into the contentof the interviews.

"Objective" data sources are also subject to instru-mentation artifacts. For example, patient recordkeeping systems may be altered or improved, newor refined diagnostic categories might be added tostandardized nosologies, laboratory tests maybecome more sensitive, and so on.

Statistical Regression: Statistical regression arti-facts may arise when study participants am classi-fied into or selected from extreme groups on thebasis of pretest scores, correlates of pretest scores.or some other basis. When participants are as-signed to groups on the basis of high or low pretestscores, the high groups will tend to score lower,and the lower groups higher at the post-test. Forexample, if patients are assignec .3 a specializedintervention based on the severity of their alcoholor other dnig problems. these individuals will tendto exhibit fewer problems in the future due simplyto statistical regression. Similarly, the often ob-served effect in psychotherapy that very disturbedindividuals tend to improve more than less dis-turbed individuals is probably due to statisticalregression efects rather than to a relationshipbetween thc extent of psychopathology and theeffectiveness of psychotherapy.

Statistical regression effects are especially likely ifthe pm- and post-tests are unreliable, or include aconsiderable amount of measurement error. Un-der these conditions, changes from pre- to post-testare very likely due to statistical regression, andattributing such changes to program influenceswould almost certainly be incorrect.

Page 26: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

Selection: As used here, selection refers to atreatment effect due to a lack of initial equivalencebetween study groups. For example. the effects ofan innovative pedagogical strategy might be testedby recruiting volunteers for the new strategy fromall learners enrolled in a particular course. Underthese conditions, differences observed between thetwo groups of learners might be due to initialdifferences between those who volunteerand thosewho do not, rather than to any particular benefit ofthe new strategy.

Mortality: Mortality artifacts arise when partici-pants drop out as the evaluation progresses frominception to completion. Here, differences be-tween pretest assessments and subsequent assess-ments may reflect changes in the study populationrather than any effects of the intervention.

In comparisons among strategies, there is always aconcern that mortality will be differential -- that is,different types of participants will be lost in thedifferent treatment conditions. For example, acomparison might be made between two methodsfor improving learner attimd4s towards abusers.The first might involve a series of lectures, whilethe second might include lectures plus guideddiscussion of the learners' personal experienceswith alcohol and other drug abuse and with abus-ers. The latter strategy might cause learners withthe most negative attitudes or most serious per-sonal problems sufficient discornfort that theydrop out of the course. Thus, differential changesin attitudes in the two groups would be observed,but these changes could not reasonably be attrib-uted to differences between the two educationalmethods.

Interactions with Selection: When cases are notrandomly assigned to conditions, the possibilityexists that certain characteristics of study groupswill interact with other threats to internal validityand produce additional spurious treatment effects.For example, the influence of maturation on atreatment group might be different from that on acomparison group if the treatment and comparisongroups are not initially equivalent. This might beam...m.,...11-

the case for example, if patients exposed to atreatment are drawn from two clinics that differ intheir age distributions.

Ambiguity About the Direction of Causal Influ.ence: The direction of causal influence is a threatto internal validity in correlational studies, espe-cially when an equally plausible argument can bedeveloped for the conclusion that A causes B, or Bcauses A. For example, a study might assess theassociation berween learner satisfaction with acourse and the degree to which the learners applythe course content in clinical practice. One mightconclude from such a study that learners who likedthe course are more likely to apply it than learnerswho do not. Equally plausible, however, is theconclusion that learners who apply the content aremore likely to report that they liked the course thanare learners who do not apply the content-

COMPARISON AND CONTROL GROUPS

Imagine the following scenario:

Learners entering an innovative clinical practicuumare given a battery of assessments, including alco-hol and other drug knowledge and attitudes, accu-racy in diagnosing dependency, and frequency ofreferrals of patients to treatment. Following thepracticum, the assessment battery is repeated.Comparisons of the test scores priorto and after thepracticuum reveal that knowledge and attitudesimproved, diagnostic accuracy remained about thesame, and frequency of referrals went down.

What can be concluded about the effects of thepracticum from the results obtained? Based on thematerial discussed thus far, the answer is: "Notmuch!"

The reason that not much can be concluded fromthis hypothetical study is that there is no basis forruling out a variety of other explanations of theobserved results -- that is, no way of assessingthreats to internal validity that may have caused theobserved changes (or lack of change) in the ab-sence of any program effects. For example, the

19

Page 27: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

011110111

observed improvement in knowledge and attitudesmight be a result of the learners' exposure to ar-ticles appearing in widely read journals, while theobserved reductions in referrals might result fromthe fact that a major treatment facility stopped ac-cepting new patients (both are historical artifacts).Even the lack of change in diagnostic accuracy is

suspect -- perhaps diagnostic accuracy would havegotten worse without exposure to the practicum aslearners were assigned increasingly more difficultpatients (an insmimentation artifact).

The single major solution to the problems raised bythreats to internal validity is the use of control orcomparison groups -- that is, the testing of indi-viduals who are identical to the treatment popula-tion in every way, except for exposure to thetreaunent. The results obtained for the control orcomparison group are those which are attributableto history, maturation, testing effects, etc. Thcresidual differences observed in the treatment groupare those attributable to the treatment. So. forexample, if a control group were used in the studydiscussed, and if these individuals showed de-creased diagnostic accuracy over the study period,the practicum would be deemed successful byvirtue of having stemmed this rate of decrease.

The difference between a control and a comparisongroup is that control groups are developed byrandomly assigning cases to treatment, whilecomparison groups are developed in any otherway. The great majority of evaluation studies relyon comparison groups rather than randomly as-signed controls. Unfortunately, the ability ofcomparison groups to rule out certain threats tointernal validity is often weak, and statistical meth-ods for adjusting initial group differences, al-though commonly employed, can never approxi-mate the inferential power that derives from truecontrol groups. On the other hand, a non-ran-domly created comparison group is- much betterthan no comparison at all, and may be the onlypractical option in many evaluations.

A major roadblock to the use of either control or

Alrormimilil=mlamilmle

comparison groups is the fact that, for some indi-

viduals, the opportunity to participate in a givenprogram (i.e., the one under study) must be with-held. In studies of patient outcomes, withholding

a treatment seen as potentially beneficial maycause ethical or legal problems. However, a numberof options exist for constructing control and/orcomparison groups which may be acceptable and

feasible within a given setting. These are dis-cussed below.

The term "control group" is used throughout thediscussion that follows to refer to both control(randomly created) and comparison (non-randomlycreated) groups. Where no distinction is made,either a random or non-ran lom assignment strat-egy is possible. -- i.e., either a control or compari-son group is possible. In cases where only a non-random assignment strategy is possible, the term"comparison group" is used.

The True No-Treatment Control

In some evaluations, the most desirable controlcondition is one in which some cases simply do notreceive the treatment under study. For example,participants in an innovative substance abuse prac-ticum might be randomly selected from amongthose learners who express an interest in partici-rat-ing. Those not selected to participate serve as thecontrol group. Such a strategy is often justifiableon practical grounds .-.len the number of availableslots exceeds the nut. a- of interested learners.

The Waiting List Control

The waiting list control is implemented by assign-ing some cases to receive the program now andsome cases to receive the program later. All casesare pretested before the "now" group receives theprogram and are post-tested after the "now" groupcompletes the program. The "later" group thusserves as a non-treated control during the waitingperiod. As is the case with the no-treatmentcontrol, the waiting list control provides a practicalsolution when demand for a program exceedssupply. The waiting list control has the disadvan-

20 2 7,1111.....MMIMINI.11010111MNINI

Page 28: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

ale

mrismalmargramormosswitisomisoismilitilsormeminiosmestimmiustiosistage that the waiting list may not truly provide a"no treamient" comparison in that individuals mayreceive some "stop gap" services. In addition, itprecludes any long-term comparisons. As soon asthose on the waiting list enter the program, allcomparisons between the groups are nullified.

The "Traditional Treatment" Control

The "traditional treatment" control compares aninnovative strategy to an existing strategy (tradi-tional treatment). For example, an innovativetraining program might be compared to an existingtraining program. Although this design does notallow an assessment of the absolute effects ofeither the "traditional" or the innovative strategy,an assessment of the relative effectiveness is avail-able.

The Minimal Treatment Control

The minimal treatment control is one in whichminimally acceptable treatment (defined by law,licensing guidelines, program standards, or ethicalconsiderations) is used as a comparison. Forexample, in-depth anticipatory guidance for par-ents of adolescents could be compared to a condi-tion in which parent education pamphlets are sim-ply available in a clinic waiting room.

The Placebo Control

When a control treatment is so minimal that it isexpected to have no effect whatever, it is referredto as a placebo control. Sometimes, placebo con-trols arc used to determine the potential operationof so-called "Hawthorne Effects" -- that is, theeffects of simply paying attention to study partici-pants.

The Non-Eligible Comparison Group

Some evaluation studies use individuals who arenot, for one reason or another, eligible to receivethe treatment as a comparison group. The term"comparison group" is used here because assign-

ment to such a group is necessarily non-random.For example, learners whose schedules precludethem from taking an elective course on patientinterviewing might constitute a group to whichlearners who take the course may be compared.This strategy presents a number of conceptualproblems since the criteria that determine eligibil-ity or non-eligibility may be related in a variety ofways to the outcomes of interest.

MEASURING OUTCOMES

Faculty development and clinical training pro-grams seek to attain a variety of outcomes. includ-ing improvements in knowledge, attitudes, andclinical skills, organizational change, and improve-ments in patient outcomes. The specific out-comes measured in an evaluadon of any aspect ofa faculty development or clinical training programwill be determined by three factors: 1) the specificobjectives of the program, 2) the questions theevaluation is designed to answer, and 3) the scopeand time frame of the evaluation effort.

The relationship between program objectives andprogram outcome measures cannot be emphasizedtoo strongly. Perhaps the most common error in alloutcome evaluations is the use of measures that donot relate directly to what the program is attempt-ing to accomplish. For example, improved aware-ness of learners' own use of alcohol and otherdrugs is a worthwhile goal, and the evaluator maybe motivated to measure changes in this area aspart of the outcome evaluation of an educationalinnovation. But if the program is not specificallydesigned to increase awareness (for example, if theinnovation is designed to teach pharmacology), thechoice to measure awareness may result in anartificial finding of program ineffectiveness -- i.e.,awareness does not increase because there is noparticular reason why it should.

Evaluators must also consider the use to whichoutcome information is to be put. In some cases,such as a summative evaluation of a well estab-lished program or program component, highly

21

Page 29: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

0111111011Iirprecise (and probably expensive) measures ofoutcome may be required. Indeed, as discussedbelow, multiple, independent outcome measureswould Se appropriate in such an evaluation. In

other cases, however, less precise (and less expen-sive) measures may be sufficient. For example, apreliminary assessment of the effectiveness of aninnovative educational strategy might be accom-plished by asking for student feedback about whetherthe innovation seemed useful and appropriate.Such measures may be all that is required toevaluate whether or not the program element isworth developing further.

Finally, the scope and the time frame of the evalu-ation must be considered. For example, long-termpatient outcomes might be a very important indica-tor of the success of an effort to improve clinicalskills. However, such an assessment may requirestudying large numbers of patients or followingpatients for significant periods of time. This isespecially the case if the incidence of alcohol andother drug problems in the patient populati-an islow -- e.g., adolescents or pre-adolescents.

is also the case that many outcome measuresthe percentage of patients for whom alcohol

and drug histories are taken) might show signifi-cant changes immediately following exposure to aseminar on history taking. However, these changesmay decay within a short period of time. There-fore, if a reasonable follow-up period cannot beaccommodated, the evaluator may choose to deem-phasize such measures in order to avoid basingprogram decisions on highly unstable outcomes.

Generally, the outcomes of faculty developmentand clinical training programs are measured in oneof four ways: 1) self-reports, 2) other-reports, 3)observation of behavior, and 4) review of records.The strengths and weaknesses of each of thesemeasurement techniques follow.

Before proceeding to a discussion of each of thesetypes of outcome measures, it is important to notethat the most conclusive evaluations of outcome

. rarely rely on a single type of outcome measure.

This is because all outcome measures have weak-nesses that limit the strength ofconclusions basedon them. Thus, when multiple measures are em-ployed, the strengths of one measure may balancethe weaknesses of another. Moreover, to the extentthat the various measures coincide (i.e., suggestthe same conclusion), the inferential strength ofthe evaluation increases. The use multiple meas-ures is an example of "triangulation" in socialscience research.'

Self-Report Measures

Self-report measures include those in which re-spondents are asked to assess the quantity or qual-ity of their own behavior, or to provide a variety ofother information that may be used to assess knowl-edge, attitudes, demographics, and so on. Theusual methods for gathering self-report data arepaper-and-pencil questionnaires and interviews.

The major advantages of self-report data are theirease of collection and relatively low cost, and theirgenerally high reliability -- i.e., freedom frominternal measurement error. In addition, self-reports are the only practical way to assess certainoutcomes -- such as changes in attitudes -- whichare not directly observable.

The major disadvantage of self-reports is theirquestionable validity -- i.e., the extent to whichthey provide accurate information. The validity ofself-reports is the topic of considerable debate,especially when used to assess value-laden behav-iors. However, self-reports constitute a useful andacceptably valid measurement methodology if anumber of precautions are taken.

First, the confidentiality of responses must beclosely guarded, and the procedures for protectingconfidentiality emphasized to respondents. Sec-ond, the importance of the data collection and therespondent.s role in the data collection should beemphasized. If the respondent feels that he or sheis a key "partner" in the evaluation effort, theprobability of providing valid data will be in-creased.

22 P

f

Page 30: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

Finally, self-report data collection instrumentsshould be carefully pilot-tested on respondentssimilar or identical to those who will participate inthe evaluation. As part of the pilot-testing proce-dure, respondents should be debriefed concerningtheir willingness to answer all sensitive items tnith-fully, and suggestions should be solicited for waysthat offensive items may be improved. This lastmay sound simplistic, but experience suggests thatpilot respondents are a key source of ideas for waysto improve the validity of self-reports.

Other-Reports

Other-reports are identical in all respects to self-reports, except that data are gathered about a targetindividual from someone other than the targethimself or herself.

One source of other-reports are individuals whohave no stake in the performance of the targetindividual. Such other-reports may be particularlyuseful in assessing changes in clinical practicebehavior. For example, rather than relying entirelyon clinicians' reports regarding whether or notthey questioned patients about alcohol and otherdrug use, the evaluator may conduct exit inter-views with patients.

Other-reports may also be gathered from individu-als who know the target individual well (spouse.other relative, close friend, etc.). Such reports maybe useful in studies of patient outcomes, wherepatient self-repons may provide biased estimatesof alcohol or other drug use

Behavioral Observations

Observations may be designed to measure some ofthe same behaviors assessed through self-reportsor other-reports. However, observational data col-lection relies on the ratings or assessments oftrained and unbiased observers: For example, ob-servations of provider/patient interactions can beused to assess interviewing or counseling skills, or

to assess patient reactions to the practitioner. Simi-larly, observation of a lecture or seminar can assessthe skill of the instructor, the extent to which thesessions accomplish the objectives set forth in acurriculum guide, the nature of learner interaction.and so on. The major disadvantage of observa-tional techniques is that the presence of the ob-server may change or inhibit the behavior of thosebeing observed. Thus, the use of observers is mostadvantageous when it is unobtrusive.

One variant of observational data collection whichovercomes some of its disadvantages is the use ofsimulated or standardized patients. As discussedearlier, this technique employs actors trained topresent a standardized complaint. Depending onthe evaluation design employed, the cliniciansbeing assessed may either know that they aleinteracting with a simulated or standardized pa-tient, or they may simply be told that such patientswill be intermingled with actual patients seen bythe clinicians.

Records Data

Health and mental health settings keep a variety ofrecords on patients which may be used to assess theoutcomes of faculty development and clinical train-ing programs. Admissions records, patient charts,and insurance claims records all can be used toindicate the extent to which alcohol and other drugissues are being addressed with patients.

For example, the International Classification ofDiseases. 9th Edition, Clinical Modification (ICD-9-CM) is one system used in many medical settingsfor the preparation of third-party reimbursementdocumentation or in medical record keeping. ICD-9-CM provides a standardized, detailed classifica-tion of mortality and morbidity information, aswell as classification of diagnostic, therapeutic,and surgical procedures. The ICD-9 diagnosticcodes relating to substance abuse cover a widerange of morbidity. Separate codes are allotted toabuse and dependence, and sub-codes within theseclassifications include alcohol abuse and depend-

Page 31: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

dons of drugs including tobacco and various poly-drug combinations. Separate categories are allot-ted to acute intoxication, withdrawal, medical andpsychiatric sequelae, fetal complications, and per-sonal history of alcoholism. Still other codes areused to indicate substance abuse as a contributingfactor to other morbidity. 1CD-9 procedure codesrelated to substance abuse include alcoholism anddnig counseling, and referral to alcoholism anddrug addiction rehabilitation.

A general disadvantage of records data is that theyare usually gathered for purposes other than evalu-ation. Thus, the data they contain may be incom-plete or inaccurate, or the record keeping methodsmay change during the course of an evaluation.Insurance records appear to be more carefullymaintained than other medical records in terms offiscal information, but are probably subject to thesame reporting errors and biases as other patientrecords.

QUALITATIVE EVALUATION METHODS

Data Collection Techniques

As discussed previously, certain types of evalu-ation questions. especially more general. explora-tory questions. are most appropriately evaluatedusing qualitative methods. The qualitative meth-ods most often used are observation, participantobservation, and qualitative interviewing.

Observation. Behavioral observation was dis-cussed in the previous section. Such observationcan be highly structured and quantified. For ex-ample. observers might count the number of timesper minute a clinician makes eye contact with apatient in order to test hypotheses concerning thequality of clinician-patient interaction. On theother hand, if the evaluator has not developedspecific hypotheses, more general observationsmight be carried out: The evaluator might want toobserve clinician-patient interactions as a way ofdeveloping a categorization of different types ofintervention approaches or different types of pa-

tient reactions to interventions. Or the evaluatormight want to learn something about the generaltone of interactions or get an idea if and howlearners are putting into practice the techniquessuggested in training. Here, a structured observa-tion protocol listing the general issues to be ad-dressed in the observation sessions could be devel-oped to guide the data collection.

Participant Observation. The examrles ot obser-vation techniques described above relate to ob-servers who sit quietly in the background and donot become involved. Pardcipant observers, onthe other hand become immersed in the situation,experiencing it, at least to some extent, as an actualparticipant would. For example, the participantobserver might actually go through an educationalintervention along with the other learners in orderto experience first hand what the learning experi-ence is like and to hear the reactions of otherparticipants. Such observations would be useful inevaluating the quality of implementation of anintervention, developing hypotheses about theoutcome of the intervention, etc.

Observation and participant observation can beimpressionistic and even casual in initial explora-tory phases of evaluation. However, these obser-vation techniques can also be quite systematic,with the periods of evaluation precisely selectedand the observations carefully documented. Theusual form of documentation is detailed field notesdescribing the setting, the participants, conversa-tions and events, as well as the observer's reac-tions. The notes should be written immediatelyafter the event. Ample time should be allocated tothe preparation of notes: it may take as much timeto write the notes as it did to observe the events.

Qualitative Interviews. Qualitative interviews seekin-depth data from the respondent's perspective.The interviewer usually works from a list of gen-eral topic areas te be covered, but the questionsused are 'open-ended. The interviewer does notstructure the responses or constrain the informa-tion the respondent provides.

24 31

Page 32: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

A

QUALITATIVE ANALYSIS

The purpose of qualitative research is to understandrather than to enumerate. Therefore. data collectedusing qualitative techniques yields a narrative analy-sis that can take a lumber of forms. For example.responses to an open ended question can be repro-duced entirely, without comment. Responses can beorganized by topic (e.g., all comments dealing withlearners' apprehensiveness about discussing alco-hol and other drug issues with patients). Qualitativedata also lends itself very well to the development ofdescriptive case studies. When more than one caseis described, they can be compared and contnsted.Frequently, the results of a qualitative study lead tomore specific hypotheses and typologies that can betested using quantitative data collection and analy-sis. The qualitative data can then be used to ennchquantitative reports with illustrations and examples.

Page 33: DOCUMENT RESUME - ERICDOCUMENT RESUME ED 332 636 HE 024 615 AUTHOR Klitzner, Michael; Stewart, Kathryn TITLE Evaluating Faculty Development and Clinical Training Programs in Substance

CONCLUSION

The recent interest in further integrating alcoholand other drug issues into the professional trainingof health and mental health professionals promisesto improve the health care system's response to amajor cause of mortality and morbidity. Carefulevaluations of faculty development and clinicaltraining programs in alcohol and other drug abusecan help insure that the growth of these programsis planful and rational, and that the strategies

shown to be successful in one institution can beexported to others.

It is our hope that the materials presented in thisGuide Book contribute to an understanding ofwhat evaluations can and cannot accomplish,and that interested readers will examine thesourcebooks we have recommended in theirfurther explorations of evaluation issues.

NOTES

Adapted from French, J. and Kaufman. N. (Eds.), 'Ibid.Handbook of Prevention Evaluation: PreventionEvaluation Guidelines. Rockville, MD: National 1/bid.Institute on Drug Abuse. 1981.

'Kish. L.A. Survey Sampling. New York: Wiley.1967.

'Cohen. J. Statistical Power Analysis for the Behav-ioral Sciences. New York: Academic Press, 1969:Cohen. J. and Cohen. P. Applied Multiple RegressionAnalysis for the Behavioral Sciences. New York:John Wiley and Sons, 1975.

114

'Adapted from French, J. and Kaufman. N.(Eds.), Handbook of Prevention Evaluation..Prevention Evaluation Guidelines. Rockville.MD: National Institute on Drug Abuse. 1981.

'Webb. E. Unconventionality, triangulation, andinference. In N. Denzin (Ed.), SociologicalMethods: A Sourcebook, New York: AldinePress, 1970.

26