Educational and Measurement Testing

download Educational and Measurement Testing

of 84

Transcript of Educational and Measurement Testing

  • 8/8/2019 Educational and Measurement Testing

    1/84

    -1-

    INTRODUCTION

    National trends have pretty well been set for the 1990s. They indicate continued

    emphasis on testing, at least as far as standardized test are concerned, with expanded

    uses of test result.

    The local school continues to be the implementing agency for assessment programs,

    typically those mandated at the state level but also influenced by national trends.

    Technically, measurement is the assignment of numerals to objects or events according

    to rules that give numerals quantitative meaning.

    Measurements may differ in the amount of information the numbers contain. These

    differences are distinguished by the terms nominal, ordinal, interval, and ratio scales of

    measurements.

    The four levels of measurement can be summarized as follows:

    1. Nominal scales categorize but do not order.

    2. Ordinal scales categorize and order.

    3. Interval scales categorize, order, and establish an equal unit in the scale.

    4. Ratio scales categorize, order, establish an equal unit, and contain a true zero

    point.

    Norm-referenced interpretation is a relative interpretation based on an individuals

    position with respect to some group, often called the normative group. Norms consist of

    the scores, usually in some form of descriptive statistics, of the normative group.

    Criterion interpretation is an absolute rather than relative interpretation, referenced to a

    defined body of learner behaviors, or, as is commonly done, to some specified level of

    performance.

    Diagnostic tests are intended to identify student deficiencies, weaknesses, or problem

    and to locate the source of the difficulty. If related learning activities are prescribes, the

    termprescriptive tests may be used.

  • 8/8/2019 Educational and Measurement Testing

    2/84

    Formative testing occurs over a period of time and monitors students progress.

    Summative testing is done at the conclusion of instruction and measures the extent to

    which students have attained the desired outcomes.

    Key terms and concepts

    Tailored Test Assessments

    Measurement Test

    Levels of Measurement Criterion-referenced

    Nominal Scale Diagnostic test

    Ordinal Scale Prescriptive test

    Interval Scale Formative testing

    Ratio Scale Summative testing

    Nor-referenced

    Review Items

    1. Traditionally, the use of test in the schools has been:

    a. Predominately norm-referenced.

    b. Predominately criterion-referenced.

    c. About evenly split, norm-referenced and criterion-referenced.

    d. Neither norm-referenced nor criterion referenced.

    2. When state legislators have mandated testing, such legislation has been focused

    primarily on:

    a. Increased classroom testing to improve instruction.

    b. Inherent ability testing, for example, using IQ tests.

    c. Minimum competency testing.

    d. Increased testing of student motivation.

  • 8/8/2019 Educational and Measurement Testing

    3/84

    3. Norm-referenced interpretations of test result are directed primarily to the purposes

    of:

    a. Discriminating among individualism.

    b. Discriminating among groups.

    c. Discriminating among programs.

    d. Discriminating between a program and a standard.

    4. Criterion-referenced interpretations of test result are directed to:

    a. Relative interpretations of individual scores.

    b. Absolute interpretations of individual scores.

    c. Relative interpretations of group scores.

    d. Absolute interpretations of group scores.

    5. Of the terms below, the one having the narrowest meaning is:

    a. Evaluation.

    b. Assessment.

    c. Measurement.

    d. Test.

    6. What characteristic distinguishes evaluation from measurements?

    a. Evaluation requires quantification.

    b. Measurements include testing; evaluation does not.

    c. Evaluation involves a value judgment.

    d. Evaluation includes assessment, measurement does not.

  • 8/8/2019 Educational and Measurement Testing

    4/84

    7. Of the four levels of measurement, the one that contains the most information in the

    number is:

    a. Interval.

    b. Nominal.

    c. Ordinal.

    d. Ratio.

    8. If, when reading information about a class, we assign a 1 to girls and a 0 to boys,

    this level of measurement is:

    a. Interval.

    b. Nominal.

    c. Ordinal.

    d. Ratio.

    9. The scores from a typical classroom test are probably better that

    ____________measurement, but not quite___________.

    a. Ordinal, nominal.

    b. Interval, ratio.

    c. Nominal, ordinal.

    d. Ordinal, interval.

    10. When using a performance test, the difference between scores of 65 and 70 equals

    the difference between scores of 85 and 90, but a score of 80 is not twice a score of

    40. The level of measurement is:

    a. Interval

    b. Nominal.

  • 8/8/2019 Educational and Measurement Testing

    5/84

    c. Ordinal.

    d. Ratio.

    11.Grading on the curve is a criterion-referenced interpretation of test scores.

    T F

    12. Whether a test is norm-referenced or criterion-referenced depends on the format of

    the items included in the test.

    T F

    13.Most standardized achievement tests are designed for norm-referenced

    interpretations.

    T F

    14.Criterion-referenced tests tend to be more general and comprehensive than

    norm0referenced tests.

    T F

  • 8/8/2019 Educational and Measurement Testing

    6/84

    -2-

    PLANNING THE TEST

    Tests are given for many different reasons. In order to achieve such diverse purposes,

    they need to be carefully planned. In classroom settings, this planning usually entails

    instructional objectives and/ or a table of specifications.

    When teachers use specific instructional objectives, it becomes very clear what should

    be on the test. The desired student behaviors from the objectives translate directly into

    the items for the test.

    Objectives can be classified in terms of the kind of understanding that is required.

    Bloom et al.s taxonomy is useful in sorting objectives into six hierarchical levels of

  • 8/8/2019 Educational and Measurement Testing

    7/84

    understanding. When test are based on such a taxonomy, they are more likely to asses

    higher levels of reasoning.

    A table of specifications is another tool that is used in test design. A two-dimensional

    grid, content by cognitive process, is used to plan the number and kinds of items that

    will be on the test.

    The use of objectives or a table of specifications may imply strict guidelines for

    constructing tests, but much of testing is determined by practical considerations. Test

    constructors must consider such things as how much testing time is available, what item

    formats the students can handle, and the developmental level of the examinee. A well-

    planned test is no accident.

    Key terms and concepts

    Test purpose Taxonomy of educational objectives

    Objectives Table of specifications

    Knowledge

    Review Items

    1. The bloom et al. taxonomy for educational objectives in the cognitive domain is a

    hierarchy based on the:

    a. Extent of recall required to attain an objective.

    b. Level of understanding required to attain an objective.

    c. Reading level required to attain an objective.

    d. Aptitude required to attain an objective.

    2. Tests are constructed for many purposes. Which of these is not a common

    purpose of tests?

    a. Affirmation.b. Prediction.

    c. Evaluation.

    d. Description.

  • 8/8/2019 Educational and Measurement Testing

    8/84

    3. A test is useful for one purpose is also likely to be effective for many other

    purposes.

    T F

    4. Test items should be evaluated in terms of how well they match the tests:

    a. Purpose.

    b. Reading level.

    c. Reliability.

    d. Assessment.

    5. Explaining in ones own words what a compromise entails is task at what level of

    Blooms taxonomy?

    a. Application.

    b. Knowledge.

    c. Comprehension.

    d. Analysis.

    6. An assignment to design a politically acceptable solution to nuclear waste

    disposal would be at which taxonomic level?a. Synthesis.

    b. Knowledge.

    c. Comprehension.

    d. Analysis.

    7. Matching authors names and titles of their books is a task at what taxonomic

    level?

    a. Application.

    b. Knowledge.

    c. Comprehension.

    d. Analysis.

  • 8/8/2019 Educational and Measurement Testing

    9/84

    8. A knowledge-level understanding of reproduction is necessary but not sufficient

    for a comprehension level of understanding of reproduction.

    T F

    9. Which is the following is not a part of a well-written instructional objective

    according to Mager?

    a. A description of the learner.

    b. The behavior that is to be observed.

    c. The conditions under which the behavior will occur.

    d. Criterion of acceptable performance.

    10.Which of the following educational goals is not stated in behavioral terms?

    a. Read.

    b. Understand.

    c. List.

    d. Count.

    11.A table of specifications categories test items by:

    a. Content and reading level.b. Content and cognitive process.

    c. Cognitive process and reading level.

    d. Item type and cognitive process.

    12.A table of specifications is used for:

    a. Standards for evaluating a test.

    b. Converting test scores to evaluation.

    c. Listing instructional objectives.

    d. Planning a test.

    13.A table of specification is not useful when instructional objectives are available.

    T F

  • 8/8/2019 Educational and Measurement Testing

    10/84

    14.A table of specifications would be appropriate for an achievement test but not for

    a test that is used to predict future academic performance.

    T F

    15.Almost all important educational outcomes can be expressed in terms of

    behavioral objectives.

    T F

    -3-

    SELECTED-RESPONSE ITEMS

    Test items can be distinguished by the response required-the response is selected from

    two or more options or constructed by the test taker.

    The three commonly used selected-response it formats are true-false, multiple-choise,and matching.

    Selected-response items have three general qualities:

    1. They can be reliably scored.

    2. They can sample the domain of content extensively.

  • 8/8/2019 Educational and Measurement Testing

    11/84

    3. They tend to measure memorization of unimportant facts unless care is taken in

    constructing the items.

    True-false can be effective when a few guidelines are followed in the construction:

    1. Statements must be clearly true or false.

    2. Statements should not be lifted directly from the text.

    3. Specific determiners should be avoided.

    4. Trick questions should not be used.

    5. Some statements should be written at higher cognitive levels.

    6. True and false items should be of the same frequency and length.

    Multiple-choice items can be improved by following these guidelines:

    1. Avoid grammatical clues.

    2. Keep option length uniform.

    3. Use plausible distracters.

    4. Do not repeat key words from the stem in the options.

    Correct answer should be randomly ordered the response-option position. The position

    of the correct response should not provide a clue about the correctness of the response.

    Complex options (e.g., all of the above; none of the above; or a and b, but not c) should

    be used sparingly, if at all.

    Multiple-choice items are versatile:

    1. They can measure higher cognitive outcomes.

    2. They can provide diagnostic information.

    Matching items are usually presented in a two-column format: one column consists of

    premises and the other consists of responses.

    Matching items should contain homogeneous content so that all responses must be

    considered plausible answers.

    The following guidelines apply to selected-response items formats:

    1. Teachers should be aware of the appropriate number of items on the test that can be

    guessed correctly.

  • 8/8/2019 Educational and Measurement Testing

    12/84

    2. Test items should be independent: The content of one item should not provide the

    answers to others, nor should correctly answering one question be a prerequisite to

    correctly answering another.

    3. Reading level of the test should be lower than the grade level, unless reading is

    being tested.

    Key Terms and Concepts

    Objective item Content sampling

    Selected-response item Barrier

    True-false item Stem

    Multiple-choice Distractors

    Matching item Clue

    Options Premiere

    Review Items

    1. Objective items are objective only in their:

    a. Item content.

    b. Scoring.

    c. Distracters.

    d. Wording.

    2. Many selected-response items can be asked in each testing session; thus, they can

    provide good:

    a. Levels of difficulty.

    b. Objectivity.

    c. Content sampling.

    d. Time sampling.

    3. The objective scoring of selected-response items enhances the tests:

    a. Reliability.

    b. Validity.

  • 8/8/2019 Educational and Measurement Testing

    13/84

    c. Reliability and validity.

    d. Validity but not reliability.

    4. When constructing true-false items, it is best to:

    a. Use specific determiners.

    b. Reproduce statements directly from the text.

    c. Include about equal numbers of true statements and false statements.

    d. Vary the length of false statements and true statements.

    5. Which of the following is not a strength of multiple-choice items?

    a. Effective testing of higher cognitive levels.

    b. Content sampling.

    c. Scorer reliability.

    d. Allowing for educated guesses

    6. When constructing multiple-choice items, it is best to:

    a. Make all options the same length.

    b. Put the main idea of the options.

    c. Use options such as a and b, but not c.d. Repeat key words from the stem in the options.

    7. A matching item includes six events to be matches with nine responses consisting

    of dates, cities, and states. The error of item constructing is:

    a. Too many premises.

    b. Too few premises.

    c. Responses contain heterogeneous content.

    d. Responses contain homogeneous content.

    8. When constructing a matching item, the numbers of premises and responses

    should be:

    a. Equal, with all responses used only once.

  • 8/8/2019 Educational and Measurement Testing

    14/84

    b. Equal, but having the option of using responses more than once.

    c. Unequal, with a greater number of responses and any response being used only

    once.

    d. Unequal, with a greater number of responses and having the option of using

    response more than once.

    9. The tendency for true-false items to measure trivia is a weakness in the item writer

    more that the item format:

    T F

    10. A good true-false is clearly true or false.

    T F

    11. Increasing the length of a matching item tends to enhance the homogeneity of

    content.

    T F

    12. Multiple-choice items generally require about the same response time per item as

    matching items.T F

    13. The items format most appropriate for measuring knowledge of paired associates,

    such as symbols and their meaning, is multiple-choice.

    T F

    14. The column of a matching item that contains the items stems is called the_______.

    15. For usual classroom testing, the most desirable length for a matching item is

    between ___________and___________premises.

  • 8/8/2019 Educational and Measurement Testing

    15/84

    -4-CONSTRUCTED-RESPONSE ITEMS

    For a short-answer item, the student supplies the answer to a question, association, or

    completion form.

  • 8/8/2019 Educational and Measurement Testing

    16/84

    In constructing short-answer items, each item should have a unique, correct answer be

    structures so the student can clearly recognize its intent.

    An essay item is one for which the student structures the response. He or she selects

    ideas and then presents them according to his or her own organization and wording.

    Essay items are used quite effectively to measure higher-level learning outcomes, such

    as analysis, synthesis, and evaluation. Essay testing is not, however, an effective

    means of measuring lower-level learning outcomes.

    Essay items can be used to measure writing and self-expression skills. Although this

    may not be the primary purpose of a given test, it is certainly worthwhile.

    Scoring inconsistencies are the primary disadvantage of essay items. In addition,

    irrelevant factors, such as neatness and penmanship, may also influence the score.

    The extent of response is basically on a continuum, from restricted to extended. Writingitems with the response geared toward the restricted end tends to provide more focus

    for the item.

    The student must be directed to the desired response. This can be enhanced by

    identifying the intended student behaviors and including them in the essay item.

    The suggested time for responding to each test item should be provided to the students.

    This designates the weight or value of each item and also helps students budget their

    time.

    Analytic scoring focuses on individual points or components of the response; holisticscoring considers the response in its entirety, as a whole.

    If possible, responses to items should be scored anonymously. In addition, all

    responses to one item should be scored before moving on to the next item, rather than

    scoring an entire test at a time. Also, the papers should be reordered before scoring the

    next time.

    Key Terms and Concepts

    Constructed response Completion form

    Short-answer item Restricted response

    Essay item Extended response

  • 8/8/2019 Educational and Measurement Testing

    17/84

    Objective scoring Modal answer

    Question form Analytic scoring

    Association form Holistic scoring

    Review Items

    1. In an association form, short-answer item, the spaces for the responses should:

    a. Vary according to the length of the correct response.

    b. All be the same size.

    c. Vary in size, but not according to any order.

    d. Vary in size according to some system of ordering.

    2. Completion, short-answer items should have the blank(s) placed:

    a. At or near the beginning of the item.

    b. Between the beginning and the middle of the item.

    c. As close to the middle of the item as possible.

    d. At or near the end of the item.

    3. A Swiss cheese completion item has:a. The blanks evenly spaced throughout the item.

    b. Too many blanks.

    c. Blanks of unequal size.

    d. None of the above.

    4. Essay items are popular in teacher-constructed test because:

    a. Of the subjectivity in their scoring.

    b. They are perceived to be more effective in measuring higher-level outcomes than

    objective items.

    c. They tend to have greater content sampling than objective items.

    d. They tend to have greater reliability than objective items.

  • 8/8/2019 Educational and Measurement Testing

    18/84

    5. When scoring essay items, all responses to one item should be scored before

    scoring the next time, rather than scoring one entire test before scoring the next.

    This procedures:

    a. Increases the test validity.

    b. Enhances the consistency of scoring.

    c. Reduces bias against individual students.

    d. Enhances the objectivity of scoring.

    6. The test scorer reads the response to one essay item after already reading several

    other responses to the same item. The score of this response will tend to be:

    a. Higher, if the earlier responses were of poor quality.

    b. Higher, if the earlier responses were of high quality.

    c. Lower, if the earlier responses were of poor quality.

    d. Unaffected by the quality of earlier responses.

    7. The halo effect in scoring items is a tendency is to score more highly those

    response:

    a. Read later in the scoring process.

    b. Read earlier in the scoring process.c. Of students known to be good students.

    d. That are technically well written.

    8. A student receives a high score on an essay item, due, in part, to the quality of

    responses to the item read earlier. This is:

    a. A context effect.

    b. A hallo effect.

    c. A reader-agreement effect.

    d. None of the above.

    9. Anonymous scoring of essay item responses tends to reduce:

    a. Reader agreement.

  • 8/8/2019 Educational and Measurement Testing

    19/84

    b. The halo effect.

    c. Order effect.

    d. Effect due to technical characteristic, such as penmanship.

    10. From a measurement standpoint, using classroom test consisting entirely of essay

    items is undesirable because:

    a. Content sampling tends to be limited.

    b. Scoring requires too much time.

    c. It is difficult to construct the items.

    d. Structuring model responses is too time consuming.

    11. Short-answer items are generally easier to construct than matching items.

    T F

    12. The use of essay items is an effective means of measuring lower-level learning

    outcomes.

    T F

    13. Analytic scoring of essay items tends to be faster than holistic scoring.T F

    14. Analytic scoring of essay items tends to be more objective than holistic scoring.

    T F

    15. There is a tendency to score longer responses to essay items more highly than

    shorter response.

    T F

    16. Including optional items in an essay exam is a desirable practice.

    T F

  • 8/8/2019 Educational and Measurement Testing

    20/84

    -5-

    NORM-REFERENCED MEASUREMENT

  • 8/8/2019 Educational and Measurement Testing

    21/84

    Norm-groups are they referent groups for norm-referenced interpretations of test

    scores. Such groups must be appropriate for the individuals tested and the purposes at

    hand.

    Norm should be representative, relevant, and recent.

    Representativeness of the norm group depends on the size of the sample and the

    sampling method. The latter has numerous factors associated with it and is the most

    likely source of producing biased norms.

    Relevance depends on the degree to which the norm group is comparable to the group

    under consideration.

    National, local, and subgroup norms provide different perspectives for interpreting the

    result of tests.

    Norms are measure of the actual performance of a group on a test. They are not meantto be standards of what performance levels should be.

    Descriptive statistics are used to summarize characteristics of sets of test scores. The

    level of statistics commonly used in measurement is quite basic, requiring only simple

    arithmetic operations.

    Frequency distributions summarize sets of test scores by listing the number of people

    who received each test score. All of the test scores can be listed separately, or the

    scores can be grouped in a frequency distribution.

    The mean, the median, and the mode all describe central tendency:

    1. The mean is the arithmetic average.

    2. The median divides the distribution in half.

    3. The mode is the most frequent score.

    Descriptive statistics that indicate dispersion are the range, the variance, and the

    standards deviation. The range is the difference between the highest and lowest scores

    in the distribution plus one. The standard deviation is a unit of measurement that shows

    by how much the separate score tend to differ from the mean. The variance is thesquare of the standard deviation. Most scores are within two standard deviations from

    the mean.

    Key Terms and Concepts

  • 8/8/2019 Educational and Measurement Testing

    22/84

    Norm group Measure of central tendency

    Norms Measures of dispersion

    Representiveness Mean

    Recency Median

    National norms Mode

    Local norms Range

    Grade equivalent score Variance

    Descriptive statistics Standard deviation

    Frequency distribution

    Review Items

    1. When using a norm-referenced interpretation, the students score on a test is

    compared.

    a. A minimum score for passing the test.

    b. The score of others taking the test.

    c. As expected score based on the students ability.

    d. A predetermined percentage of correct responses.

    2. It is not important that the norm group for a nationally used achievement test:a. Is large.

    b. Is representative.

    c. Is from at least three grade levels.

    d. Has persons from all states.

    3. The extent to which a norm group is comparable to the group being tested

    determines the norm groups

    a. Relevance.

    b. Representativeness.

    c. Recensy.

    d. Reliability.

  • 8/8/2019 Educational and Measurement Testing

    23/84

    4. The Lake Wobegon Phenomenon in testing is the situation of:

    a. Norm groups scoring unusually high on standardized tests.

    b. Students scoring below the national average on standardized tests.

    c. Students scoring above average on locally normed.

    d. All states reporting above-average performance on nationally normed tests.

    5. Norms for published standardized tests are commonly based on the performance

    of:

    a. Individual students who will perform well.

    b. One or more groups of students.

    c. Students in a typical school system.

    d. A random sample of students from one state.

    6. Local achievement and aptitude norms might be more important than national

    norms is decisions about:

    a. Future occupation

    b. The likelihood of success in certain colleges.

    c. Selection into special high-school programs.

    d. Allocations among different school district.

    7. The central administration of a school district sets a goal of having all elementary-

    school students reading at or above the average on a nationally normed test. The

    mistake being made is:

    a. Making the assumption that the norm group is relevant to the local school.

    b. Using the norm as a standard.

    c. Attempting to have consistent reading performance in all schools.

    d. Establishing too modest a goal.

    8. Which of the following is not a measure of central tendency?

    a. Mean

    b. Variance.

  • 8/8/2019 Educational and Measurement Testing

    24/84

    c. Mode.

    d. Median

    9. When a distribution has a small number of scores, some of which are very extreme,

    the preferred measure of central tendency is the:

    a. Median

    b. Mean

    c. Range

    d. Mode

    10. A measure of dispersion for a distribution, whose computation involves only the

    extreme scores, is:

    a. Standard deviation

    b. Variance

    c. Mode

    d. Range

    11. Which of the following provides a mesure of dispersion in the same units as the

    original scores?a. Variance

    b. Median

    c. Standard deviation

    d. Correlation

    12. Measure of central tendency are to location as measures of dispersion are to:

    a. Points

    b. Spread

    c. Average

    d. Frequencies

  • 8/8/2019 Educational and Measurement Testing

    25/84

    13. If the mode and median of a distribution of scores are equal, the mean will also

    have to be equal to the median.

    T F

    14. When establishing national norms, size of the norm group is a major concern.

    T F

    15. Generally, the larger the numerical value of the median, the larger the value of the

    standard deviation.

    T F

    -6-

    COMPARING SCORES TO NORM GROUPS

  • 8/8/2019 Educational and Measurement Testing

    26/84

    When comparing an individuals score to the scores of the norm group, the point is to

    determine where the individuals score locates in the norm group distribution.

    Percentiles indicate the percentage of students in the norm group who are at or below aparticular score.

    The standard normal distribution has a mean of 0 and a standard deviation of 1.0. The

    area in the Appendix 4 table is given from the mean to the z-score and it is the

    proportion of the total area.

    Standard score and transformed standard scores express the relative position of a

    score in a distribution in terms of standard deviation units from the mean.

    Stanines provide equal units of measurement. There are nine stanine scores and the

    name comes from standard nine. Each stanine contains a band of scores, each bandequal to one-half standard deviation in width.

    The NCE score is a normalized standard score with a mean of 50 and a standard

    deviation of 21.06. Scores ranga from 1 through 99 and an equal unit is retained in the

    scale.

    Grade equivalent scores are intended to indicate the average level of performance for

    students in each month of each grade. Unfortunately, grade equivalents do not from an

    equal interval scale.

    Key Term and Concepts

    Percentile Stanines

    Percentile rank Normalized T-score

    Standard score Normal curve equivalents score

    Standard normal distribution Grade equivalent score

    Transformd standard score

    Review Items

    1. A student who scores at the 45th percentile on a test:

    a. Answered 45 percent if the items correctly.

    b. Is above average in performance.

  • 8/8/2019 Educational and Measurement Testing

    27/84

    c. Equaled or surpassed 45 percent of the other examinees.

    d. Had at least 45 percent of the right answer.

    2. Standard scores express an individuals position in the distribution of scores in

    terms of:

    a. Standards of performance.

    b. Standard deviations from the mean.

    c. Standard deviation from the maximum possible score.

    d. Deviation from a standard of performance.

    3. A test score that is at the 42nd percentile could also be said to be at which stanine?

    a. 3rd

    b. 4th

    c. 5th

    d. 6th

    4. A z-score of 1.5 would have what value if it were converted to a t-score?

    a. 120

    b. 50c. 35

    d. 65

    5. Which of these cannot be meaningfully average because the scores are ordinal

    rather than interval?

    a. Percentiles

    b. Stanines

    c. Standard scores

    d. None of the above

    6. A student receives a z-score of +1.25 on an exam. This mean the students

    performance is:

  • 8/8/2019 Educational and Measurement Testing

    28/84

    a. Below the mean performance of the group

    b. One-quarter of a standard deviation above the mean performance of the group

    c. At the average for the group

    d. Around the 89th percentile of the group

    7. The standard normal distribution has:

    a. a mean of 0 and a standard deviation of 1

    b. a mean of 50 and a standard deviation of 1

    c. a mean of 0 and a standard deviation of 10

    d. a mean of 50 and a standard deviation of 10

    8. A students t-score in a distribution of transformed standard scores is 40. This

    students performance is:

    a. Above average

    b. At the 40th percentile

    c. At the 7th stanine

    d. Below average

    9. Stanines divide a distribution into nine parts so that each part:a. Contains about 11 percent of the scores

    b. Is one-half standard deviation wide

    c. Represent 10 percentile ranks

    d. Contains the mode of distribution

    10. In a normal distribution, which of the following indicates the highest relative position

    in the distribution of scores?

    a. Z = 1.5

    b. Percentile rank = 90

    c. T = 65

    d. Stanine = 8

  • 8/8/2019 Educational and Measurement Testing

    29/84

    11. Joe has a stanine score of an exam. Hid performance is:

    a. Below the 6th percentile

    b. Between the 60th and 77th percentile

    c. At the 50th percentile

    d. Above the 80th percentile

    12. Normal curve equivalent score (NCEs) range from:

    a. -3.00 to +3.00

    b. 1 to 9

    c. 30 to 80

    d. 1 to 99

    13. Percentiles are more of an ordinal scale than an equal interval scale.

    T F

    14. When score are converted to percentiles, a specified gain in achievement will result

    in a larger increase in percentile rank if the gain is near the high end of the

    distribution than near the middle of the distribution.

    T F

    15. Grade equivalent scores are on an equal interval scale.

    T F

    -7-

  • 8/8/2019 Educational and Measurement Testing

    30/84

    ITEM STATISTICS FOR NORM-REFERENCED

    TESTS

    This chapter introduced the concepts of analyzing individual items of norm-referenced

    tests, that is, how well the items are performing in the total test. The correlation

    coefficient was introduced as a descriptive statisticone that can be used to indicate

    the direction and strength of the relationship between two variables. In testing

    applications, the variables are often scores on individual items, score on tests or other

    measuring instruments, and scores on external criteria that we try to predict, such as

    future grade-point average. The correlation coefficient will also be used in future

    chapters to develop the concepts of validity and reliability.

    Two items statisticsdifficulty and discriminationwere introduced and shown to be

    useful in evaluating the performance of individual items on norm-referenced tests. The

    difficulty index indicates the percentage of persons who answered an item correctly,

    whereas the discrimination index shown how well the item separated those who had

    high and low scores on the total test. The discrimination index is based on the

    correlation between scores on an individual item and those on the total test.

    Constructing a perfect test is not likely, especially for the initial draft of the best, even

    when we follow the guidelines for good test construction. Confusion, ambiguity, and

    poorly constructed options may enter into an item. Students may perceive items

    differently than intended by the teacher. Item analysis provides empirical data about

    how individual items are performing in a real test situation. Item statistics do not reveal

    specifically the deficiencies in the content of items, but they indicate when an item is

    deficient. Checking the item difficulty index and the discrimination index may give some

    clues as to what is wrong. A careful inspection of the item content and response

    patterns of students is often quite revealing.

    Key Terms and Concepts

    Correlation coefficient Coefficient of determination

    Scatterplot Difficulty index

    Pearson product-moment coefficient Discrimination index

    Review Items

  • 8/8/2019 Educational and Measurement Testing

    31/84

    1. To compute the correlation between attitude and achievement, one must have:

    a. Achievement score from one group of people and attitude scores from another

    group.

    b. Achievement and attitude scores on the same group of people.

    c. Achievement scores from two points in time and attitude scores from two points

    in time.

    d. The same tests given twice to the same group of people.

    2. The correlation coefficient is a number that can range of correlation:

    a. 0 to +1.00

    b. -1.00 to +1.00

    c. Minus infinity to plus infinity

    d. 0 to 100

    3. Which of the following indicates the greatest degree of correlation?

    a. 52

    b. 61

    c. + 23

    d. + 42

    4. The variance of a distribution is a measure of:

    a. Dispersion

    b. Central tendency

    c. Relationship

    d. Location

    5. Students who can high on an ability measure were found to be able to solve a

    learning task much faster than students scoring low on the ability measure. If scores

    on the ability measure and time to compete the learning task are correlated, we

    would expect:

    a. Zero correlation

  • 8/8/2019 Educational and Measurement Testing

    32/84

    b. A zero coefficient of determination

    c. Positive correlation

    d. Negative correlation

    6. An exam given to 40 students; 35 of the students respond correctly to an item. The

    difficulty index for the item is close to:

    a. 0

    b. 1

    c. 87

    d. 40

    7. Which difficulty index is indicative of the most difficult item?

    a. 90

    b. 50

    c. 25

    d. 12

    8. The preferred difficulty index for items of norm-referenced tests in:

    a. Close to 1b. Close to 0

    c. Close to 80

    d. Close to 50

    9. If an item has a high discrimination index, it means that scores on the item have:

    a. No correlation with total test scores

    b. High correlation with total test scores

    c. Low correlation with total test scores

    d. Negative correlation with total test scores

    10. An item has a negative discrimination index. Thus, if the student responds correctly

    to this item, for this student we would expect a:

  • 8/8/2019 Educational and Measurement Testing

    33/84

    a. Low total test score

    b. High total test score

    c. Total test score around the middle

    d. Total test score of zero

    11. Of the following, which provides information about the distribution of total test

    score?

    a. Difficulty index

    b. Correlation coefficient

    c. Discrimination index

    d. Standard deviation

    12. If we want to identify who is getting the test item correct, low-scores of high-scores,

    we would check the difficulty index.

    T F

    13. An item has a discrimination index around 8. This means that scores on the test are

    getting the item correct.

    T F

    14. An item has a difficulty index close to zero. This means that high-scores on the test

    are getting the item correct.

    T F

    15. The ideal situation for a test is to have high difficulty levels and high discrimination

    indices for the items.

    T F

    -8-

    RELIABILITY OF NORM-REFERENCED TESTS

  • 8/8/2019 Educational and Measurement Testing

    34/84

    Reliability of measurement is consistencyconsistency in measuring whatever the

    instrument is measuring.

    Stability reliability is consistency of measurement across time.

    Test-retest, with the same test administered at different times, provides the estimate of

    estimate of stability reliability. The reliability coefficient is the correlation between the

    scores of the two test administrations.

    Equivalence reliability is consistency of measurement across two parallel forms of a

    test.

    The split-half procedure divides the test into two parallel halves; the score of the two

    halves are then correlated. The reliability of the total test is then estimated using the

    Spearman-Brown formula.

    The KR-20 formula gives an estimate of internal consistency reliability (r 20), which, in

    essence, is the mean of all possible split-half coefficient.

    The r21 may be substituted forr20 if item difficulty levels are similar; r21 is computationally

    easier, but it underestimates reliability if the items vary in difficulty.

    The alpha coefficient provides an estimate of internal consistency reliability, based on

    two or more parts of a test. If each item is considered a part, the r is equivalent to r20.

    The length affects reliability in such a way that, the longer the test, the greater thereliability, assuming other factors remain constant.

    The Spearman-Brown formula is used for estimating the reliability of increased length. It

    is applied when using the split-half procedure since the total test is twice as long as the

    individual halves.

    Difference scores tend to be less reliable than scores on individual tests. As the

    correlation between the scores on the two tests creases, the reliability of the difference

    scores decrease.

    An observed test score may be considered as consisting of two parts, the truecomponent and error component.

    In considering the distribution of the observed, true, and error scores:

    Xo = Xt + Xe and so2 = st

    2 + se2

  • 8/8/2019 Educational and Measurement Testing

    35/84

    Reliability is the proportion of the variance in the observed scores that is true or

    nonerror variance.

    The standard error of measurement is the standard deviation of the distribution of errorscores. As reliability increase, the standard error of measurement decrease.

    We can use the concepts of reliability and standard error of measurement in making

    inferences about how an individuals score would fluctuate on repeated use of the same

    test. The distribution of scores would have a mean approaching the individuals true

    score and a standard deviation equal to the standard error of measurement.

    Increased range of performance of the students being tested tends to enhance

    reliability.

    Item similarity enhances reliability and item difficulty affect reliability such that items ofmoderate difficulty, around 50 percent correct responses per item, enhance reliability.

    Key Terms and Concepts

    Reliability Kuder-Richardson formula-20

    Reliability coefficient Kuder-Ricardson formula-21

    Test-retest Cronbach alpha

    Stability reliability Difference score

    Parallel forms Error variance

    Equivalence reliability True variance

    Interval consistency reliability Standard error of measurement

    Spearman-Brown formula

    Review Items

    1. The reliability coefficient can take on values:

    a. From 0 to +1.00, inclusive

    b. From 1.00 to + 1.00, inclusive

    c. Of any positive number

    d. From 1.00 to 0, inclusive

  • 8/8/2019 Educational and Measurement Testing

    36/84

    2. If a group of student was measured in September using a mathematics

    achievement test and then tested again in October using the same test, the

    correlation coefficient between the scores of the two test administrations would be

    a measure of:

    a. Stability reliability

    b. Equivalent reliability

    c. Interval consistency reliability

    d. Both stability and equivalence reliability

    3. Reliability estimates of a test:

    a. May be based on content or logical analysis of the test

    b. Require some correlation coefficient

    c. Increase with repeated test usage

    d. Are the same for all applications of the test

    4. If a reliability estimate is based on a single administration of a test, the reliability of

    interest is not:

    a. Stability reliability

    b. Split-half reliabilityc. Equivalence reliability

    d. Internal consistency reliability

    5. On a given test, the observed standard deviation of the scores is 20, and the

    reliability of the test is 84. The standard error of measurement is:

    a. 8.00

    b. 18.33

    c. 3.20

    d. 16.80

    6. A test of 40 items has reliability of 70. If the test is increased to 80 items, the

    reliability will be:

  • 8/8/2019 Educational and Measurement Testing

    37/84

    a. 0.99

    b. 0.54

    c. 0.82

    d. 0.90

    7. If the reliability of a test is 1.0, the standard error of measurement is:

    a. 1.0 also

    b. Greater than 1.0

    c. Undeterminable

    d. 0

    8. Identify the reliability estimation procedure appropriate for determining stability

    reliability of a test:

    a. Split-half

    b. Kuder-Richardson formula-20

    c. Parallel forms administered at the same time

    d. Parallel forms administered at difference times

    9. A reading test is given to two groups of sixth-grade students: Group A consists ofhigh-ability (IQ 120 or greater) students; Group B consists of students of

    heterogeneous ability (IQ range 90 to 150). The most likely reliability situation is:

    a. Test reliability will be the same for both groups

    b. Test reliability will be greater for the group A than group B

    c. Test reliability will be greater for group B than group A

    d. No inference can be made about test reliability

    10. In applying the split-half procedure for estimating reliability, the reliability coefficient

    for one-half the test is computed. To estimate the reliability of the entire test, we use

    the:

    a. Kuder-Richardson formula-20

    b. Kuder-Richardson formula-21

  • 8/8/2019 Educational and Measurement Testing

    38/84

    c. Spearman-Brown formula

    d. Cronbach alpha procedure

    11. The Kuder-Richardson 20 procedure (KR-20) is a procedure for estimating reliability

    that provides:

    a. An internal consistency coefficient

    b. The mean of all possible split-half coefficients

    c. Both and b

    d. Neither a nor b

    12. A test of 100 items is divided into five subtests of 20 items each. If we are interested

    in internal consistency reliability, the most appropriate procedure for estimating

    reliability is:

    a. Kuder-Richardson formula-20

    b. Kuder-Richardson formula-21

    c. Cronbach alpha

    d. Parallel forms

    13. A mathematics test is given to a class of gifted students and also to a regularungrouped class. The reliability of the test would likely:

    a. Greater for the gifted class

    b. Greater for the ungrouped class

    c. About the sane for both classes

    d. Unable to infer anything until the reliability coefficient is computed

    14. The standard error of measurement is a measure of:

    a. Location

    b. Central tendency

    c. Variability

    d. Association

  • 8/8/2019 Educational and Measurement Testing

    39/84

    15. As the standard error of measurement increase, the reliability of a test:

    a. Also increase

    b. Decreases

    c. Remains unchanged

    d. May increase or decrease

    16. Theoretically, with respect to variance, reliability can be considered the ratio of:

    a. Observed variance to true variance

    b. Error variance to observed variance

    c. True variance to error variance

    d. True variance to observed variance

    17. Conceptually, the true component and the error component of a test score are such

    that:

    a. The greater the true component, the greater the error component

    b. The greater the true component, the smaller the error component

    c. The component are equal

    d. The component are independent

    18. In conceptualizing the distributions of observed, true, and error scores, the following

    is true for the means:

    a. The observed mean equals the true mean

    b. The error mean equals zero

    c. The observed mean equals the true mean plus the error mean

    d. All of the above

    19. Conceptually, the variances of the distributions of the observed, true, and error

    score are such that:

    a. The variance of the error scores is zero

    b. The error variance plus the true variance equal the observed variance

    c. The observed variance is less than the true variance

  • 8/8/2019 Educational and Measurement Testing

    40/84

    d. The observed variance and the true variance are equal

    20. A difference score is generated by subtracting a pretest score from a posttest score.

    In order to obtain a high reliability for the difference score, we require:

    a. Low correlation between pretest and posttest scores

    b. High reliability for both pretest and posttest scores

    c. Both a and b

    d. Neither a nor b

    -9-

    VALIDITY OF NORM-REFERENCED TESTS

  • 8/8/2019 Educational and Measurement Testing

    41/84

    Validity is the correct to which a test measure what it is intended to measure.

    Content validity is concerned with the extent to which the test is representative of a

    defined body of contact consisting of topics and process.

    Content validity is based on a logical analysis. It does not general a validity coefficient,as is obtained with some other types of validity.

    Standardized achievement tests tend to have broad content coverage so they will have

    wide application. However, when used in a specific situation, the content validity of e

    prospective test should always be considered.

    Criterion validity is based on the correlation between scores on the test and scores on a

    criterion. The correlation coefficient is the criterion validity coefficient.

    Concurrent validity is involved if the scores on the criterion are obtained at the same

    time as the test scores. Predictive validity is involved if the scores on the criterion areobtained after an intervening period from those of the best.

    Concurrent validity applies if it is desirable to substitute a shorter test for a longer one.

    In that case, the score on the longer test is the criterion, and validity is that of the

    shorter test.

    The construct validity of a measure or test is the extent to which scores can be

    interpreted in terms of specified traits or construct.

    Factors analysis is a procedure for analyzing a set of correlation coefficient between

    measures; the procedure analytically identifies the number and nature of the constructsunderlying the measures. Different types of factors are general, group, and specific

    factors.

    For test validated through correlation with a criterion measure, validity can be expressed

    as the proportion of the observed test variance that is common variance with the

    criterion. The validity coefficient is the square root of this proportion or ratio.

    A test cannot be valid (either conceptually or practically) if it is not reliable; however. A

    reliable test could lack validity. Thus, reliability is a necessary but not sufficient condition

    for test validity.

    A well- constructed test with items of proper difficulty level will enhance validity. Validity

    tends to increase with test length. Low-item inter correlation may tend to enhance

    criterion validity if we have a complex criterion.

  • 8/8/2019 Educational and Measurement Testing

    42/84

    Increased heterogeneity of the group measure tends to enhance validity. Subtle,

    individual factors may also affect validity. Tests should be properly administrated, since

    any procedures that impede performance also lower validity.

    Key Terms and Concepts

    Validity Factor analysis

    Content validity Factor loading

    Criterion validity General factor

    Validity coefficient Group factor

    Concurrent validity Specific factor

    Predictive validity Covariation

    Construct validity

    Review Items

    1. Which of the following types of validity does not yield a validity coefficient?

    a. Predictive

    b. Concurrent

    c. Content

    d. Criterion

    2. When considering the terms reliability and validity, as applied to a test, we can say:

    a. A valid test ensures some degree of reliability

    b. A reliable test ensure some degree of validity

    c. Both a and b

    d. Neither a nor b

    3. If a test is representative of the skills and topics covered by a specific unit of

    instruction, the test has:

    a. Construct validity

    b. Concurrent validity

    c. Predictive validity

  • 8/8/2019 Educational and Measurement Testing

    43/84

  • 8/8/2019 Educational and Measurement Testing

    44/84

    8. Which characteristic is true of criterion validity?

    a. It is based on a logical correspondence between two tests

    b. It includes two types, concurrent and predictive validity

    c. It is based on two administrations of the same test

    d. All of the above

    9. A school system uses a test considered to be valid for measuring student

    achievement, but the test requires three hours of administration time. The principals

    and teachers are considering substituting a shorter test for the longer one. The

    validity of concern here is:

    a. Concurrent

    b. Content

    c. Construct

    d. Predictive

    10. The testing division of a school system is attempting to analyze the traits that are

    inherent in the six sub scores of an academic achievement test. The validity of

    concern here is:

    a. Concurrentb. Content

    c. Construct

    d. Predictive

    11. Construct validity is establish through:

    a. Logical analysis

    b. Statistical analysis

    c. Both logical and statistical analysis

    d. Neither logical nor statistical analysis

  • 8/8/2019 Educational and Measurement Testing

    45/84

    12. Factor analysis is a procedure often used in establishing construct validity of a set

    of tests. In the analysis, the factor loadings that are computed are correlation

    coefficients between:

    a. Scores on two or more tests of the set

    b. Factors and test scores

    c. Two or more factor scores

    d. None of the above

    13. A factor analysis is constructed on the scores from six different IQ tests. One of the

    factors has a large loading with a single IQ test and very small loading with the

    other five tests. This is a:

    a. General factor

    b. Specific factor

    c. Group factor

    d. None of the above

    14. If a validity coefficient is computed for a test, and the test has been used a very

    homogeneous group of student, we expect that the validity coefficient will be:

    a. Moderate, around 55b. High

    c. Low

    d. Unable to make an inference

    15. Which of the following is least like the others?

    a. Criterion validity

    b. Construct validity

    c. Concurrent validity

    d. Predictive validity

    16. A test is found to have reliability but low validity. In order for this to occur, the test

    has:

  • 8/8/2019 Educational and Measurement Testing

    46/84

    a. Little true variance

    b. Large error variance

    c. Large specific variance

    d. Little observed variance

    17. When using criterion measures for establishing validity, a validity coefficient is

    computed. Theoretically, in terms of variance, the validity coefficient is the square

    root of the ratio of:

    a. Variance common with the criterion to observed variance

    b. Observed variance to variance common with the criterion

    c. True variance in the criterion to observed variance

    d. True variance in the criterion to true variance in the best test being validated

    18. In order to enhance validity, given a criterion consisting of several abilities, we

    would want a test with low-item inter correlations.

    T F

    19. Predictive validity of a test is increased as the groups tested become more

    homogeneous.T F

    20. Construct validity refers to the adequacy of item construction for a test.

    T F

    -10-

  • 8/8/2019 Educational and Measurement Testing

    47/84

    CRITERION-REFERENCED TESTS

    A criterion-referenced test score indicates the level of performance on a well-specified

    domain of content.

    When the test items are not representative of a well-specified domain we cannot

    generalize our results beyond the specific items on the test.

    Item forms contain enough detail about how the items should be constructed so that

    they represent a well-specified domain.

    Instructional objectives are usually too terse to provide an adequate description of a

    domain. Objectives and test specifications are needed before criterion-referenced tests

    are appropriate.

    Teachers can construct criterion-referenced tests through the use of objectives and item

    specifications.

    A standard of minimal acceptable performance is required whenever a decision about

    mastery is to be made. There are several methods for setting such standard, none of

    them perfect, that can be used with criterion-referenced and other kinds of tests.

    Key Terms and Concepts

    Norm-referenced Test specification

    Criterion-referenced Stimulus attributes

    Domain Response attributes

    Item form Mastery decision

    Objective Standard setting

    Review Items

    1. Items forms are seldom used by classroom teachers because the forms:

    a. Lack validity

    b. Are complex and unwieldy

    c. Require extensive pilot testing

    d. Are appropriate only for standardized tests

  • 8/8/2019 Educational and Measurement Testing

    48/84

    2. Item forms refer to:

    a. Item-writing rules

    b. Response types (e.g., true-false, multiple-choice)

    c. Parallel forms if items for reliability

    d. Patterns of responses to sets of items

    3. Which of the following is most critical in a criterion-referenced test?

    a. A prespecified standard of performance

    b. Objectively scored items

    c. Specific behavior objectives

    d. A well-scored objectives

    4. A major strength of criterion-referenced testing is the ability to:

    a. Generalize the results to a large set of items

    b. Compare individuals in terms of relative standing

    c. Establish objective performance criteria

    d. Measure difficult to define constructs

    5. A poorly defined domain results in items that are:

    a. Too difficult

    b. Dissimilar

    c. Ambiguous to the examinees

    d. Unreliable

    6. Teachers often prefer criterion-referenced tests to norm-referenced tests because:

    a. They are very concerned with which student is best in the class

    b. They need to compare the learning in their class to that of their classes

    c. Of the specific, discrete knowledge or skills that are assessed rather than global

    constructs

    d. They are usually easier to construct

  • 8/8/2019 Educational and Measurement Testing

    49/84

    7. The test construction concept that is more details than instructional objectives but

    less cumbersome than item forms is (are):

    a. Item calibrations

    b. Test blueprints

    c. Item objectives

    d. Test specifications

    8. When a test has a presets standard of minimum acceptable performance, it is a

    criterion-referenced test.

    a. Always true

    b. Always false

    c. Sometimes true

    9. The method of setting standards that is most likely to be used in classroom setting

    is the:

    a. Professional judgment method

    b. Nedelsky method

    c. Angoff methodd. Constructing groups method

    10. A panel of qualified experts is not used in which of the following methods of setting

    standards?

    a. Professional judgment

    b. Nedelsky

    c. Angoff

    d. Constructing group method

    11. Teachers often prefer criterion-referenced measures to norm-referenced measure

    because criterion-referenced measures:

    a. Are more reliable

  • 8/8/2019 Educational and Measurement Testing

    50/84

    b. Are less intimidating to students

    c. Indicates what the student can do

    d. Indicate who in class has done the best

    12. Critics of criterion-referenced test are correct when they characterize standard-

    setting procedures as:

    a. Vague

    b. Subjective

    c. Inconsistent

    d. Sophisticated

    13. Most likely tests that are liked to brief, specific instructional objectives are good

    examples of criterion-referenced tests.

    T F

    14. The item formats (e.g., multiple-choice, essay, etc.) should be different for norm-

    referenced tests than for criterion-referenced tests.

    T F

    15. The panel of experts that is used in some standard-setting methods in school

    setting consist of:

    a. Academically talented students

    b. Classroom teachers

    c. Parent volunteers

    d. Students from higher grades

    -11-

  • 8/8/2019 Educational and Measurement Testing

    51/84

    ITEM STATISTICS FOR CRITERION-

    REFERENCED TESTS

    A test score is determined by the performance of the student on each of the items on

    the test. In order to understand the test score it is essential that we understand how

    each item contributes to that score. The quality of the test depends on the quality items

    that comprise it. The procedures that are described in this chapter are ways to look at

    the quality of the test items.

    Items should be subjected to a content review before the test is given. Experts and

    colleagues can help us by reviewing the test items for their match with the domain

    specifications or the objectives, for any potentially biased wording, and for any

    observable flaws in the items construction.

    After the tests have been administered and scored, there should be a review of the

    kinds of errors that were made so that remediation can focus on these errors. Statistical

    analysis should be done so that we have evidence about the difficulty levels of the items

    and about the degree to which the items are discriminating between masters and non-

    masters or between students before and after instruction.

    The difficulty levels of test items often turn out to be quite different from what the

    teacher expected. Difficulty levels are clear measure of how the students performed on

    a specific taskthe test item. As such, they provide very useful information to theteacher.

    The discrimination index provides information that is directly related to the purpose of

    the test. A discrimination index can be seen as an analogy to the sport of rowing. If all of

    the items on a test are likened to the crew members, we see that things work best when

    they are all pulling together. This is the case when all of the discrimination indexes are

    positive. If one of the crew lifts his or her oars out of the water and does nothing, it is

    like an item with a zero discrimination index. A negative discrimination index would be

    the situation of the crew member (items) rowing in the opposite direction as the rest of

    the crew. Clearly this latter case requires some correction action.

    We cannot merely assume that we create high-quality test items. We need to subject

    those items to item analysis in order to convince ourselves and others that the item

    analysis is time well spent. The information that is provided will help us to understand

    the quality of our tests so that we can base decisions on those test scores with

    confidence.

  • 8/8/2019 Educational and Measurement Testing

    52/84

    Key Terms and Concepts

    Items analysis Pre and post-discrimination index

    Content review Mastery/ non-mastery

    Pilot testing Item discrimination index

    Error patterns Item difficulty index

    Review Items

    1. An item with ap value near 1.0 is quite:

    a. Easy

    b. Difficult

    c. Discriminating

    d. Reliable

    2. Analysis of test result at the item level is useful for:

    a. Decisions about individual students

    b. Decisions about instruction

    c. Decision about the test items

    d. All of the above

    3. Panels with diverse background can be used to examine test items for:

    a. Items bias

    b. Difficulty

    c. Discrimination

    d. Continuity

    4. It is critically important in criterion-referenced tests that the test items:

    a. Are difficult when used on a pretest

    b. Match the domain or objective

    c. Discriminate between competent and less competent students

    d. Not be difficult

  • 8/8/2019 Educational and Measurement Testing

    53/84

    5. The difficulty index refers to:

    a. values of student ratings of whether an item was easy or difficult

    b. the percentage of examinees who answered an item correctly

    c. teacher judgment of how well students are likely to do on an item

    d. the likelihood of guessing the correct answer to a test item

    6. Pre and post-discrimination refers to whether a test item.

    a. Is easier on the posttest than on the pretest

    b. Adequately discriminates pre-posts from post-posts

    c. Discriminate unfairly against certain ethnic group

    d. Would be better placed on a pretest than on a posttest

    7. If 30 students are tested and 20 answer item 4 correctly, the difficulty index for

    items 4 would be:

    a. 10

    b. -10

    c. 33

    d. 67

    8. The item statistic that would indicate the most serious concern would be:

    a. Difficulty equal to 85

    b. Difficulty equal to 05

    c. Discrimination equal to -50

    d. Discrimination equal to 00

    9. Items that match a well-specified domain should have difficulty levels that:

    a. Are exactly equal

    b. Are very similar

    c. Range from 0 to 1

    d. Match the domain specification

  • 8/8/2019 Educational and Measurement Testing

    54/84

    10. The higher the value of the difficulty index, the:

    a. Easier the item

    b. More discriminating the item

    c. Lower the percentage correct on the item

    d. More biased the item

    11. If an item has a positive discrimination index:

    a. The item should be received

    b. The item is biased

    c. The item appears to be effective

    d. The test will not be valid

    12. Item analysis is:

    a. A content analysis

    b. A statistical analysis

    c. Both a and b

    13. When can statistical item analysis be done?a. Before the test is given

    b. While the test is being given

    c. After the test is given

    d. Both b and c

    14. Other teachers would be most needed when determining:

    a. Item difficulty

    b. Item discrimination

    c. Item reliability

    d. Item bias

  • 8/8/2019 Educational and Measurement Testing

    55/84

    15. A test item that is positively discriminating for third-graders would be positively

    discriminating for second-graders.

    a. Definitely true

    b. Possibly true

    c. Definitely false

    -12-

  • 8/8/2019 Educational and Measurement Testing

    56/84

    RELIABILITY OF CRITERION-REFERENCEDTESTS

    A test is reliable if it provides consistence information about examinees. This can meanthat a criterion-referenced test provides consistent estimates of performance on a

    domain or that the test provides consistent placement of an examinee in a mastery or

    non-mastery category. Different kinds of reliability evidence are needed for each of

    these uses of criterion-referenced tests.

    Whether a test is consistent relative to mastery decisions is shown by giving the test on

    two occasions to the same group of examinees and finding the percentage of

    examinees whose mastery/ non-mastery classifications were both the same on the two

    test occasions. This procedure could also be used when a parallel form of the test is

    given on the second testing. A reliable test would have a high percentage of examineeswith the same mastery/ non-mastery classification on the two tests.

    When performance on a domain is to be estimated from the test scores, the standard

    error of measurement can be used to form an interval estimate. An interval estimate

    suggests the degree of imprecision that is in our test scores. The standard error of

    measurement gives us an idea about how much we can expect test scores to fluctuate

    across repeated testing.

    The reliability of the test can be increased by careful attention to the test items, the test

    setting, and the examinees. A reliable test would have items, the test are

    homogeneous. The more similar the items are, the more consistent will be students

    approach to those items. The items should be free of flaws or vagueness of wording so

    that inconsistencies are reduced. And, because there is a direct relationship between

    the length of the test and the reliability of the test, there should be a sufficient number of

    items.

    Inconsistencies in student performance can be reduced by making sure that the testing

    conditions are appropriate. There should be adequate light and quite so that the student

    can concentrate on the task. Interruption or distractions should be eliminated and the

    test items and directions about how to answer them should be clear.

    Reliable scores depend on the students being motivated to apply themselves to the

    task. This is promoted when the teacher encourages the students to do well and

    explains how the test scores will be used. The teacher should be alert for individual

    student problems such as fatigue or anxiety that might be affecting the reliability of the

    test scores.

  • 8/8/2019 Educational and Measurement Testing

    57/84

  • 8/8/2019 Educational and Measurement Testing

    58/84

    5. Which one of the following computed for the reliability of a test would indicate that

    the test is totally unreliable?

    a. 10

    b. 00

    c. 50

    d. 100

    6. Exactly 100 students took a criterion-referenced test twice. The test had a mastery

    cut-off score; 70 students were above the cut-off score on both tests and 15

    students were below the cut-off score on both tests. The reliability of the test for

    mastery decisions would be:

    a. 40

    b. 55

    c. 70

    d. 85

    7. When estimating the reliability of a domain score, it is appropriate to use the:

    a. Standard error of measurement

    b. Average score for the classc. Range of possible domain scores

    d. Measurement error coefficient

    8. Other things being equal, the longer a test is, the____________will be its reliability.

    a. Higher

    b. Lower

    c. Less ambiguous

    d. More valid

    9. The reliability coefficients that were developed for norm-referenced test can also be

    used effectively with criterion-referenced tests.

    T F

  • 8/8/2019 Educational and Measurement Testing

    59/84

    10. The same criterion-referenced test was given to 30 children on consecutive days.

    Of the children, 10 who surpassed the mastery cut-off score, the first day failed to

    do so on the second day. The test could be said to be:

    a. Unfair

    b. Biased

    c. Unreliable

    d. Discriminating

    11. A test is either reliable or it isnt

    T F

    12. Test reliability is primarily determined by the test itself. The test setting and the

    examinee have a minimal impact on test reliability.

    T F

    13. Which of the following is most related to high criterion-referenced reliability?

    a. Item difficulty near 50

    b. Item discrimination near 50c. Short tests

    d. A wide range of item types

    14. When estimating a domain score, the reliability would increase if:

    a. The items were more difficult

    b. The items were somewhat dissimilar

    c. The test had a cut-off score for mastery decisions

    d. The test was longer

    15. The longer the test, the smaller the___________.

    a. Time between pre- and posttesting

    b. Standard error of measurement

  • 8/8/2019 Educational and Measurement Testing

    60/84

    c. Difficulty index

    d. Reliability discrepancy

    -13-

    VALIDITY OF CRITERION-REFERENCED TESTS

  • 8/8/2019 Educational and Measurement Testing

    61/84

    A test that adequately serves the purpose for which it is used is considered to be a valid

    test. Validity is always defined in terms of the purpose for which the test scores will be

    used. Validity is a matter of degree. One test may be more valid than another but tests

    are not usually totally lacking in validity and they are never perfectly valid.

    Because criterion-referenced tests are used for several different purposes, including

    estimating performance on a domain and determining whether students have achieved

    mastery, it is not surprising that different kinds of logical and statistical evidence should

    be presented to support the validity claims. The three kinds of test validity that were

    introduced are content validity, criterion validity, and construct validity.

    Content validity is a determination of the extent that the test items match the domain

    specifications or objectives. Validity is established by having qualified persons, a panel

    of expert, review the test items for appropriateness and congruence with the domain.

    Criterion validity is concerned with whether the test would be an adequate predictor of

    performance on some other variable. Validity evidence is established by finding the

    correlation coefficient that links the test with the criterion that is to be predicted. The

    choice between two competing test would be based on which test has the higher

    correlation with the criterion. When we are concerned about mastery decision on two

    measures, the degree of validity is shown by the percentage of persons for which the

    mastery/ non-mastery decision is consistent.

    Construct validity is shown by making predictions about the test scores and then

    conducting analyses to see whether the predictions are confirmed. Some of thereasonable predictions are: (1) the test scores should be positively correlated with other

    measures of the same thing, (2) groups that are known to differ on the domain should

    have test scores that are significantly different, and (3) we should not find different

    patterns of responses across distracters for persons of different races, grades, or other

    characteristics.

    We cannot merely assume that are valid. We need to conduct careful analyses to show

    that our tests have sufficient content, criterion or construct validity so that we can justify

    the use of the tests.

    Key Terms and Concepts

    Validity Construct validity

    Content validity Logical analysis

  • 8/8/2019 Educational and Measurement Testing

    62/84

    Criterion validity Statistical analysis

    Correlation validity Distractor analysis

    Review Items

    1. If a test is valid it is certainly also reliable.

    T F

    2. Which of the following is nota validity that is described in the technical standards for

    test publisher?

    a. Criterion-referenced validity

    b. Content validity

    c. Construct validity

    d. Criterion validity

    3. Whether the items on a test match the domain of the criterion-referenced test is

    primarily a concerned about:

    a. Cut-off score validity

    b. Item validity

    c. Content validity

    d. Criterion validity

    4. Essentially the same processes are used to establish the validity of criterion-

    referenced tests as are used with norm-referenced tests.

    T F

    5. A well-specified domain for a criterion-referenced test should enhance the tests:

    a. Reliability

  • 8/8/2019 Educational and Measurement Testing

    63/84

    b. Content validity

    c. Discrimination

    d. Criterion validity

    6. A panel of experts will sometimes be used to rate items in order to promote:

    a. Content validity

    b. Test sales

    c. Construct validity

    d. User validity

    7. If students who surpass the mastery cut-off score for the addition of three-digit

    numbers also tend to be those students who achieve master

    a. Content validity

    b. Criterion validity

    c. Convergent validity

    d. Mathematical validity

    8. The correlation coefficient is a statistical way of expressing:

    a. The standard error of measurement

    b. Mathematical validity

    c. Content validity

    d. Criterion validity

    9. Which validity requires both a logical process and a statistical process?

    a. Content validity

  • 8/8/2019 Educational and Measurement Testing

    64/84

    b. Convergent validity

    c. Construct validity

    d. Criterion validity

    10. If items on a criterion-referenced test do not match a well0defined domain, the test

    lacks adequate:

    a. Construct validity

    b. Content validity

    c. Criterion-referenced validity

    d. Criterion validity

    11. If two writers, working from the same test specifications, created test items that

    quite different from each other, the test would have inadequate:

    a. Criterion validity

    b. Item validity

    c. Specification validity

    d. Content validity

    12. In order to know whether a test is valid, it is most important to know:

    a. The purpose for which the test scores will be used

    b. A description the persons who will take the test

    c. As estimate of the reliability of the test

    d. Whether the test has ever been used before

    13. When a test does not achieve the purpose for which it was designed, the test lacks:

    a. Validity

  • 8/8/2019 Educational and Measurement Testing

    65/84

    b. Reliability

    c. Purposefulness

    d. Discrimination

    14. Lengthening a test will make it more valid.

    a. True, if it is somewhat valid to begin with

    b. False, test length affects reliability, not validity

    c. True, but only for older students

    d. False, validity is related to purpose rather than length

    15. The primary validity for most criterion-referenced tests is:

    a. Construct validity

    b. Criterion validity

    c. Content validity

    d. Criterion-referenced validity

  • 8/8/2019 Educational and Measurement Testing

    66/84

  • 8/8/2019 Educational and Measurement Testing

    67/84

    Key Terms and Concepts

    Test wiseness Separate answer sheets

    Correction for guessing Testing arrangement

    Positional preference Take-home examBluffing Oral exam

    Test anxiety

    Review Items

    1. Programs for teaching test-taking skills tend to be:

    a. Equally effective throughout grades 1-8

    b. More effective with lower grades than upper elementary grades

    c. More effective with upper grades than lower elementary grades

    d. Of no effect throughout grades 1-8

    2. A student who guesses on every test item will have the highest score on which kind

    of test? (Assume the tests have equal length)

    a. True-false

    b. Multiple-choice (four-option)

    c. Multiple-choice (five-option)

    d. Fill-in-the-blank

    3. If a correction rather than a penalty for guessing is used on a multiple-choice test,

    students should be urged to:

    a. Guess when they are unsure of an answer

    b. Not guess when they are unsure of an answer

    c. Guess only on items for which they have some partial knowledge

    4. What is the expected score of a student who guesses on all items of a 50 item

    multiple-choice test with five options per item?

    a. 0

    b. 5

  • 8/8/2019 Educational and Measurement Testing

    68/84

    c. 10

    d. 15

    5. The relationship between test anxiety and test performance is generally:

    a. String and positive

    b. Strong and negative

    c. Weak and positive

    d. Weak and negative

    6. A limit is set on the length of response (in words) to an item. This is an attempt to

    limit the effect of:

    a. Guessing

    b. Bluffing

    c. Positional preference

    d. Changing answer

    7. Which of the following is least related to the others? The effect of:

    a. Test anxiety

    b. Bluffingc. Penmanship and spelling

    d. Positional preference

    8. A major with oral examinations is that they tend to be:

    a. Very time consuming

    b. Very anxiety producing

    c. Unreliable

    d. Formal rather than informal

    9. Students should be shown how to take tests so that the tests provide:

    a. Enriched diagnostic information for the future informational planning

    b. Information about the test setting, as well as the test content

  • 8/8/2019 Educational and Measurement Testing

    69/84

    c. Information about wrong answers, as well as right answer

    d. A more accurate picture of what the student is able to do

    10. Take-home tests are subject to the following problem:

    a. Lack of control over the testing situation

    b. They are not appropriate for evaluation purposes

    c. Time spent on the test varies among students

    d. All of the above

    11. Generally, more test answers are changed from wrong to right than right to wrong.

    T F

    12. Applying the correction for guessing raises the score on a test.

    T F

    13. Goo penmanship and spelling tend to be positively correlated with the grades

    assigned to essay responses.

    T F

    14. The physical arrangement of the testing situation is as important for enhancing

    student performance as is establishing control and rapport.

    T F

    15. Separate answer sheets can be used effectively for students beginning with those in

    second grade.

    T F

  • 8/8/2019 Educational and Measurement Testing

    70/84

    -15-

    THE USE OF STANDARDIZED ACHIEVEMENT

    TESTS

    Standardized achievement tests are widely used in our schools. There are many tests

    on the market, available in a variety of forms, including norm-referenced and criterion-

    referenced tests.

    Test results can be reported for individual students, classroom, or even school

    buildings. In addition, local and national norms are available for many tests.

    Publisher of standardized achievement tests are guided by Standards for Educational

    and Psychological Testing(1985). These guidelines can also be used by consumers to

    evaluate the test information that publisher provide.

    Teachers play an important role in standardized achievement testing. They need to take

    this role seriously and make sure that the physical and psychological settings promote a

    positive testing environment.

    Despite their popular use, standardized achievement tests do have certain limitations.

    Some of these deal with the time required for the test administration and the processing

    of test scores. Other limitations concern the usefulness and accuracy of the reported

    scores. Some people worry that the average performance has become a standard of

    performance and that achievement tests may have too much influence over school

    curricula.

    High-quality standardized achievement tests are available; they do the job that they

    were designed to do. However, when tests are used for other purposes, their

    effectiveness will be limited. Therefore, those who select standardized achievementtests must do a careful and complete job of comparing alternative.

    The major determinant of the appropriateness of a standardized achievement test is the

    match between the items and what was actually taught in the schools. Technical

    adequacy and cost are also factors.

  • 8/8/2019 Educational and Measurement Testing

    71/84

    Standardized achievement test are perhaps our best example of high-quality testing in

    education. Careful item preparation, extensive reliability and validity studies, well-design

    norm-groups, and clear reporting formats combine to make standardized achievement

    tests useful, accurate measure of student performance.

    In the future we can expect to see efficient, computerized achievement testing.However, the types of standardized testing programs that we know now are likely to be

    maintained in the 1990s.

    Key Terms and Concepts

    Standardizedachievement test

    Standards for Educational and Psychological Testing

    RelevanceTechnical adequacy

    Usability

    Computerized adaptive testing

    Review Items

    1. What is standardized on a standardized achievement test?

    a. The anticipated level of performance

    b. The conditions for test administration

    c. The test validity

    d. The purpose for which the test is given

    2. Which of the following uses of standardized achievement test scores is closest to

    the main purpose for which such tests are constructed?

    a. Assessing the achievement level of an individual student

    b. Measuring school, class, and district wide achievement levels

    c. Assessing whether students have adequate levels of achievement for promotion

    to the next grade

    d. Providing objective evidence about the teachers competence

  • 8/8/2019 Educational and Measurement Testing

    72/84

    3. Standardized achievement tests are norm-referenced rather than criterion-

    referenced tests.

    T F

    4. The Standards for Educational and Psychological Testingare:

    a. Guidelines about the use of the test scores

    b. Summaries of legal cases concerning the use of test scores

    c. Example, including case studies, of the misuse of test score

    d. The legal minimum requirements for corporations that sell tests

    5. Which of the following factors should be the most important consideration when

    selecting a standardized achievement test?

    a. Reliability

    b. Cost

    c. Publishers reputation

    d. Relevance

    6. Mr. Jones allows his student an extra 20 minutes on a standardized achievement

    test. What is the major consequence of his actions?a. The reliability coefficient will increase

    b. The content validity will decrease

    c. Norm-referenced interpretations of the score will not be meaningful

    d. Criterion-referenced interpretations of the scores will not be meaningful

    7. Most teachers find that the results of standardized achievement test are helpful to

    them when planning instruction for individual students.

    T F

    8. Most commercially available standardized achievement tests are so excellent that

    they can form the sole basis for many decisions about a students academic

    progress.

  • 8/8/2019 Educational and Measurement Testing

    73/84

    T F

    9. Achievement-at-grade-levelis:

    a. A meaningless term statistically

    b. An expectation that we should have for all students

    c. Average performance for students in that grade

    d. An arbitrary assessment based on standardized achievement test scores

    10. What is a form of testing in which the items that are presented to the student

    depend on the students answers to previous items?

    a. Non-standardized testing

    b. Response-dependent testing trials

    c. Step-by-step testing

    d. Computerized adaptive testing

    11. If the items on a standardized achievement test did not match what was taught in a

    particular school, the test would lack:

    a. Technical adequacy

    b. Relevancec. Utility

    d. Reliability

    12. The verb that is frequently used in Standards for Educational and Psychological

    Testingand that shows the orientation of the Standards is:

    a. Should

    b. Must

    c. Might

    d. Shall

    13. Some major standardized achievement tests interpret the same test performance in

    both norm-referenced and criterion-referenced formats.

  • 8/8/2019 Educational and Measurement Testing

    74/84

    T F

    14. Standardized achiev