Dr. Dave Whitney California State University, Long Beach Me? Create a Selection Test?

download Dr. Dave Whitney California State University, Long Beach Me? Create a Selection Test?

If you can't read please download the document

description

Test Development: Test Specification

Transcript of Dr. Dave Whitney California State University, Long Beach Me? Create a Selection Test?

Dr. Dave Whitney California State University, Long Beach Me? Create a Selection Test? Our Plan Test Development Item Analysis Developing Cut Scores Test Development: Test Specification Specify the type of measure Maximal or typical performance? Define the test domain What construct are you measuring? Is it a general trait or relative to a specific context? How is this construct similar to and different from other constructs? What is the dimensionality of the construct? What type of item format will be used? Whats the appropriate test length? How difficult should the test be? Are they any practical considerations that must be considered? Test Development: MC Item Writing Identify the knowledge element to be measured Write an appropriate stem Write the item key (i.e., the correct response) Write the distractors Evaluate the item for clarity, clues, and other item quality characteristics Edit the item as necessary Level of Cognitive Objective (Bloom, 1956) Level of Abstraction Item Characteristics Key item terms Knowledge Test takers recall concrete facts, terms, and basic concepts Identify, define, recall, state... Comprehension Assess understanding and interpretation of facts and ideas Explain, summarize, interpret... Application Assess the ability to use information to solve novel problems. Compute, use, implement... Analysis Requires test takers to identify motives or causes; see patterns and organize components of a whole Compare, analyze, break down... Synthesis Requires test takers to draw appropriate conclusions from information or propose alternative solutions Develop, formulate, plan... Evaluation Comparing and discriminating between ideas; Applying criteria to make judgements about information, ideas, or quality of work Review, justify, defend... Who was the 41st President of the United States? Whats the most appropriate measure of central tendency in determining the typical sales price of a house in your neighborhood? Test-wiseness The ability to answer items correctly based on clues presented by the items themselves, not based on knowledge of the subject matter There are individual differences in test-wise skills Proper item writing can decrease the influence of test- wise skills 1.The fribbled breg will minter best with an a.mors b.ignu c.derst d.sortar 2.Why does the sigla frequently overfesk the trelsum? a.All siglas are mellious b.Siglas are always votial c.The trelsum is usually tarious d.No trelsa are directly feskable 3.Trassing normally occurs under which one of the following conditions? a.When dissels frull b.When lusp trasses the vom c.When the belgo lisks easily d.When the viskal flans, if the viskal is zortil Taken from: Exercise in Franzipanics Example Item Writing Tips General item-writing tips KISS Assess only important information Use gender and ethnicity in an inclusive fashion Item stem tips Avoid non content-related suggestions as to the correct answer If you must use a negative in the stem, use bold or underline Response option tips Prefer shorter to longer response options Words that appear in every response option should be moved to the stem In distracters, include true statements that do not correctly answer the question presented in the stem What item writing tips do you recommend? Test Development: CTT Item Analysis Item Difficulty In Classical Test Theory, item difficulty is the proportion of individuals that get an item correct Den0ted by p p ranges from 0 to 1 Very easy items have a p value near 1.0 Very difficult items have a p value near 0.0 Item p values should likely range from.2 to.8 Is there an ideal average p value for the items on a test? An average p value near.5 would maximize test variability, and thus increase reliability Useful for employment tests and other norm-referenced tests, but less concerning for criterion-referenced tests used in academics For academic tests, we often desire an average p value substantially higher! Discrimination is an essential purpose of testing Item discrimination indices inform us how well an item differentiates between test-takers How well does an item differentiate between people with high knowledge of test content, and low knowledge of test content? Item Discrimination Methods for Determining Item Discrimination Contrasting groups Compares the dichotomized item responses (correct or incorrect) of the top 27% of test takers (Upper performing group) with the bottom 27% of test takers (Lower performing group) For any item, we examine the proportion of test-takers in each of these 2 extreme groups that got the item correct Wed expect that a higher proportion of test-takers in the Upper group would get any particular item correct than those in the Lower group Item discrimination = p(Upper) - p(Lower) Item 3: p(High) =.59, p(Low) =.24 Point-biserial correlation Correlates item response (correct/incorrect) with total test score Higher positive correlations indicate better discrimination Though wed love a correlation near 1.0, in practice any correlations.10 and better are considered evidence of acceptable discrimination The sign is more important than the magnitude of the correlation here Biserial correlation A point-biserial correlation will underestimate the correlation between an item response and total test score (due to dichotomized item scoring) Biserial correlation corrects for this problem Examining the p-values along with either the point- biserial or biserial correlation for each response option helps us to further modify items Free web-based program for item analysis: Choosing Cut Scores Should I use a cut score? Yes If you need to ensure a minimum level of competency No If you plan on rank-ordering candidates based on test performance AND you do not need to ensure a minimal level of competency for the KSAs measured by the test Methods for Determining Cut Scores Unacceptable method Holistic Judgment Acceptable methods Nedelsky Angoff Ebel What NOT To Do Anything above a score of 64 is passing. Holistic Judgment An individual or a panel of SMEs examine the test and the content area A decision is made regarding what percentage of items should be answered correctly by a minimally competent person This percentage is the cut-off score Minimally Competent Person The examinee or employee that would be considered to have achieved just the minimal level of performance to be considered successful Borderline proficiency Angoff (1971) Method For each item on the test, a SME makes a judgement regarding the proportion of minimally competent examinees that would be expected to get the item correct The proportions assigned by the SME to each item are then summed over the entire test The obtained values are averaged across SMEs Item Prop. MCPs that would get the item correct This SMEs recommended cut score: = 3.0 Nedelsky (1954) Method For each item on a test, an SME crosses out response options that a MCP should be able to eliminate The reciprocal of the number of remaining response options is recorded The SME sums the reciprocal values assigned to each item on the test The obtained values are averaged across SMEs Item # R.O.s Eliminated Reciprocal of other R.O.s 121/2 211/ /4 521/2 Example assumes 4 response options per item This SMEs recommended cut score: = 2.58 Each SME makes ratings on both the difficulty of the item (as in the Angoff Method) but also the relevance of the item to the job Utilizes a two-dimensional grid for categorizing each tiem One dimension is item difficulty One dimension is item relevance Items are first categorized into each cell The SME then assigns a percentage value to each cell, indicating the percentage of items in that cell that should be answered correctly by the MCP Ebel (1979) Method In the Ebel method, the cut score is determined using the formula: X c = p(M) where, p is the proportion of items in the cell that a MCP should answer correctly M is the number of items in that cell The obtained values are averaged across SMEs DIFFICULTY RELEVANCE EasyModerateDifficult Essential 90% 20 items 50% 25 items 10% 5 items Important 60% 35 items 30% 22 items 20% 10 items Acceptable 40% 19 items 20% 12 items 10% 15 items Questionable 25% 7 items 0% 20 items 0% 10 items Example Ebel Method table taken from Crocker & Algina (1986) X c = p(M) =.90(20)+.50(25)+.10(5)+.60(35)+.30(22)+.20(10)+.40(19)+.20(12)+.10(15)+.25(7)+0(20)+0(10) = 73.85 Looking for more detail? May I suggest...