Standards for Assessment and Testing for Educational and Psychological Testing • Reliability •...

Applying the Concepts of Validity, Reliability, and other St d d f Ed ti l dand other Standards for Educational and Psychological Testing to Assessment of Program Student Learning OutcomesProgram Student Learning Outcomes

Jeremy Penn, Ph.D.Di t U i it A t d T tiDirector, University Assessment and Testing

Faculty Certificate Program / Graduate Student Endorsement in Program Outcomes do se e t og a Outco es

Assessment

• Participate in 8 out of 10 workshopsParticipate in 8 out of 10 workshops– 8 offered this year

Some limited substitutions allowed– Some limited substitutions allowed• Create or modify an assessment plan for a

( b h th ti l)program (can be hypothetical)• Certificate and $500 (faculty) $100

(graduate student) award upon completion

University Assessment and Testing

Pfungst (1911), p. viii


Pfungst (1911), p. viii

Inference QualityInference Quality

S dStudents

Faculty

Poor i i

Curriculum

writing skills

A


Assessment

Standards for Educational and Psychological Testing

• ReliabilityReliability• Validity• Test developmentTest development• Scales, scores,

comparabilitycomparability• Fairness in testing

and test usea d test use


APA, AERA, NCME (1999)

Why pay attention to the Standards?

• Help avoid drawing misleading or p g ginappropriate inferences

• Provide guidance on test selection / developmentdevelopment

• Protect students’ rights• Fairness in testing and test use• Fairness in testing and test use• Guidance on assessment development and

score reportingp g• Implications for public policy


Using the StandardsUsing the Standards• Reasonable to expect high-stakes testing, admissions

i li i i l d i i f lltesting, licensure, critical decisions, etc. to carefully follow the Standards

• Some specific Standards are more salient in some contexts than others

• May be unreasonable to expect every quiz, test, interview, portfolio to follow every element of the p yStandards

• May not expect substantial evidence of validity and reliability for a10-point quizy p q– However, principles should be considered when the quiz is

developed and used


Using the StandardsUsing the Standards• Standards are not a checklist to be marked

– “Evaluating the acceptability of a test or test application does not rest on the literal satisfaction of every standard in this document, and acceptability

t b d t i d b i h kli t” ( 4)cannot be determined by using a checklist” (p. 4). • Professional judgment is critical

– Consideration of intent of the standardConsideration of intent of the standard– Consideration of alternatives– Feasibility of meeting the standardStandards under revision• Standards under revision – Updated version possibly in 2012?


ReliabilityReliability

“Consistency”Consistency

Could be consistently goodgood…


or consistently bad…or consistently bad



“Reliability refers to the consistency ofReliability refers to the consistency of such measurements when the testing procedure is repeated on a population ofprocedure is repeated on a population of individuals or groups.”

• Multiple raters• Multiple raters• Multiple forms• Multiple administrations (test-retest)



Careful consideration should be given toCareful consideration should be given to the training of reviewers / scorers

Inter-rater reliability should be examined d lit t l i l t dand quality control process implemented



Efforts taken to control error in exam designEfforts taken to control error in exam design• Items not prone to multiple interpretations

C f ll f d• Carefully proofread• Similar instructions given to all students• Equal difficulty of multiple forms of the

same exam


Issues in ReliabilityIssues in ReliabilityFAIL: Committee asks faculty members to evaluate students

i thusing the same rubric.

FIX: Faculty members are trained on the rubric so theyon the rubric so they will score students consistently.


ValidityValidity

“the degree to which evidence and theorythe degree to which evidence and theory support the interpretations of test scores entailed by proposed use of tests” (p 9)entailed by proposed use of tests (p. 9)

“It i th i t t ti f t t“It is the interpretations of test scores required by proposed uses that are

l t d t th t t it lf” ( 9)evaluated, not the test itself” (p. 9)


ValidityValidity

Sources of evidence:Sources of evidence:– Test content

Students’ response process (related to– Students response process (related to content)

– Internal structure (factor analysis)– Internal structure (factor analysis)– Relationships to other variables

The consequences of testing (intended and– The consequences of testing (intended and unintended consequences of score use)


Validity ImplicationsValidity Implications

• When selecting / designing a test, mustWhen selecting / designing a test, must consider the possible uses of the test and how scores will be interpretedp

• When sharing results, must consider how different audiences may be tempted to y pmisinterpret findings

• Should gather evidence to support findings g pp g(multiple measures)


Validity ImplicationsValidity Implications

• Clearly identify the construct (or concepts)Clearly identify the construct (or concepts) the test is intended to measure

• The higher the stakes the more important• The higher the stakes, the more important it is that test-based inferences are supported with strong evidence ofsupported with strong evidence of technical quality


Item Analysis (briefly)Item Analysis (briefly)• Item difficultyItem difficulty

– Percentage of students who get the item correct– Can indicate poorly worded / developed item,

poorly taught concept, or actual low ability• Item discrimination

Abili f i l “ ” hi h– Ability of item to correctly separate “true” high achievers and “true” low achievers

– Problem if low achievers get item correct but highProblem if low achievers get item correct but high achievers get it incorrect


Item analysis with Excel (see Harnisch, 1983, Journal of Educational Measurement)

Issues in ValidityIssues in ValidityFAIL: Assessment Committee selects aFAIL: Assessment Committee selects a

standardized test because it is inexpensive and used by many other institutions.

FIX: The test is selected because it is a good match for the intended construct, matches the curriculum at the institution, and has evidence the test supports the inferences theevidence the test supports the inferences the AC wants to draw.


Test DevelopmentTest Development

• Need a systematic process for developingNeed a systematic process for developing a local rubric / test / portfolio

Define content or construct– Define content or construct– Clear instructions for administrators and

examineesexaminees– Careful item / rubric development

Training for scorers and quality checking– Training for scorers and quality checking scorers


Test DevelopmentTest Development

FAIL: Assessment committee asks facultyFAIL: Assessment committee asks faculty members to submit items for a test.

FIX: Assessment committee clearly defines the domain for the test. A committee ofthe domain for the test. A committee of faculty members evaluate a large number of items for relevance and suitability for ythe test.


Scales Scores ComparabilityScales, Scores, Comparability• Scales (method to develop a total score), cut scores,

( i ) d bilinorms (comparison groups) and comparability are developed to assist in interpreting scores– Scale calculation and interpretation should be clearly

d ib ddescribed– Norm-referenced or criterion referenced scoring – different

interpretations• Norming groups (if used) should be relevant and updated• Norming groups (if used) should be relevant and updated

– A reasonable process should be used to establish cut scores (e.g., Angoff, Bookmark, etc.)

– Before scores are compared with alternate forms or otherBefore scores are compared with alternate forms or other settings comparability needs to be established


Scales Scores ComparabilityScales, Scores, ComparabilityFAIL: Assessment committee decides studentsFAIL: Assessment committee decides students

must score above 60% on a test in order to graduate.

FIX: Assessment committee implements a systematic process to develop a cut score on the graduation exam. Evidence is gathered to show that students scoring below the cutshow that students scoring below the cut score do not have the sufficient skills needed.


Fairness in Testing and Test Use

• Fairness as lack of bias– Occurs when deficiencies in a test or its use result in

different meanings for scores earned by students from identifiable groups – avoid use of prompts or items th t b diff tl i t t dthat may be differently interpreted

• Fairness as equitable treatment in the testing processp– All examinees be afforded appropriate testing

conditions including equal access to materials provided by test developerp y p

– Respect of confidentiality (protect small n groups that could be identified in reporting)


Fairness in Testing and Test Use

• Fairness in equality in outcomes of testingFairness in equality in outcomes of testing– Examine test for comparable pass rates across

groups – it is not required that they be equal but i ht l ibl i l ti f bi l k fmight reveal possible violations of bias or lack of

equitable treatment that should be investigated• Fairness as opportunity to learnFairness as opportunity to learn

– Low test scores may result from the examinee not having the opportunity to learn the material tested (not generally relevant for employment, credentialing, or admissions testing)


ActivityActivity

• Application of the Standards to commonApplication of the Standards to common higher education scenarios


Upcoming WorkshopsUpcoming Workshops

Spring Schedule TBASpring Schedule TBA


Standards for Assessment and Testing for Educational and Psychological Testing • Reliability •...

Documents

Transcript of Standards for Assessment and Testing for Educational and Psychological Testing • Reliability •...