Educational and Measurement Testing
-
Upload
gilang-kanigara -
Category
Documents
-
view
221 -
download
0
Transcript of Educational and Measurement Testing
-
8/8/2019 Educational and Measurement Testing
1/84
-1-
INTRODUCTION
National trends have pretty well been set for the 1990s. They indicate continued
emphasis on testing, at least as far as standardized test are concerned, with expanded
uses of test result.
The local school continues to be the implementing agency for assessment programs,
typically those mandated at the state level but also influenced by national trends.
Technically, measurement is the assignment of numerals to objects or events according
to rules that give numerals quantitative meaning.
Measurements may differ in the amount of information the numbers contain. These
differences are distinguished by the terms nominal, ordinal, interval, and ratio scales of
measurements.
The four levels of measurement can be summarized as follows:
1. Nominal scales categorize but do not order.
2. Ordinal scales categorize and order.
3. Interval scales categorize, order, and establish an equal unit in the scale.
4. Ratio scales categorize, order, establish an equal unit, and contain a true zero
point.
Norm-referenced interpretation is a relative interpretation based on an individuals
position with respect to some group, often called the normative group. Norms consist of
the scores, usually in some form of descriptive statistics, of the normative group.
Criterion interpretation is an absolute rather than relative interpretation, referenced to a
defined body of learner behaviors, or, as is commonly done, to some specified level of
performance.
Diagnostic tests are intended to identify student deficiencies, weaknesses, or problem
and to locate the source of the difficulty. If related learning activities are prescribes, the
termprescriptive tests may be used.
-
8/8/2019 Educational and Measurement Testing
2/84
Formative testing occurs over a period of time and monitors students progress.
Summative testing is done at the conclusion of instruction and measures the extent to
which students have attained the desired outcomes.
Key terms and concepts
Tailored Test Assessments
Measurement Test
Levels of Measurement Criterion-referenced
Nominal Scale Diagnostic test
Ordinal Scale Prescriptive test
Interval Scale Formative testing
Ratio Scale Summative testing
Nor-referenced
Review Items
1. Traditionally, the use of test in the schools has been:
a. Predominately norm-referenced.
b. Predominately criterion-referenced.
c. About evenly split, norm-referenced and criterion-referenced.
d. Neither norm-referenced nor criterion referenced.
2. When state legislators have mandated testing, such legislation has been focused
primarily on:
a. Increased classroom testing to improve instruction.
b. Inherent ability testing, for example, using IQ tests.
c. Minimum competency testing.
d. Increased testing of student motivation.
-
8/8/2019 Educational and Measurement Testing
3/84
3. Norm-referenced interpretations of test result are directed primarily to the purposes
of:
a. Discriminating among individualism.
b. Discriminating among groups.
c. Discriminating among programs.
d. Discriminating between a program and a standard.
4. Criterion-referenced interpretations of test result are directed to:
a. Relative interpretations of individual scores.
b. Absolute interpretations of individual scores.
c. Relative interpretations of group scores.
d. Absolute interpretations of group scores.
5. Of the terms below, the one having the narrowest meaning is:
a. Evaluation.
b. Assessment.
c. Measurement.
d. Test.
6. What characteristic distinguishes evaluation from measurements?
a. Evaluation requires quantification.
b. Measurements include testing; evaluation does not.
c. Evaluation involves a value judgment.
d. Evaluation includes assessment, measurement does not.
-
8/8/2019 Educational and Measurement Testing
4/84
7. Of the four levels of measurement, the one that contains the most information in the
number is:
a. Interval.
b. Nominal.
c. Ordinal.
d. Ratio.
8. If, when reading information about a class, we assign a 1 to girls and a 0 to boys,
this level of measurement is:
a. Interval.
b. Nominal.
c. Ordinal.
d. Ratio.
9. The scores from a typical classroom test are probably better that
____________measurement, but not quite___________.
a. Ordinal, nominal.
b. Interval, ratio.
c. Nominal, ordinal.
d. Ordinal, interval.
10. When using a performance test, the difference between scores of 65 and 70 equals
the difference between scores of 85 and 90, but a score of 80 is not twice a score of
40. The level of measurement is:
a. Interval
b. Nominal.
-
8/8/2019 Educational and Measurement Testing
5/84
c. Ordinal.
d. Ratio.
11.Grading on the curve is a criterion-referenced interpretation of test scores.
T F
12. Whether a test is norm-referenced or criterion-referenced depends on the format of
the items included in the test.
T F
13.Most standardized achievement tests are designed for norm-referenced
interpretations.
T F
14.Criterion-referenced tests tend to be more general and comprehensive than
norm0referenced tests.
T F
-
8/8/2019 Educational and Measurement Testing
6/84
-2-
PLANNING THE TEST
Tests are given for many different reasons. In order to achieve such diverse purposes,
they need to be carefully planned. In classroom settings, this planning usually entails
instructional objectives and/ or a table of specifications.
When teachers use specific instructional objectives, it becomes very clear what should
be on the test. The desired student behaviors from the objectives translate directly into
the items for the test.
Objectives can be classified in terms of the kind of understanding that is required.
Bloom et al.s taxonomy is useful in sorting objectives into six hierarchical levels of
-
8/8/2019 Educational and Measurement Testing
7/84
understanding. When test are based on such a taxonomy, they are more likely to asses
higher levels of reasoning.
A table of specifications is another tool that is used in test design. A two-dimensional
grid, content by cognitive process, is used to plan the number and kinds of items that
will be on the test.
The use of objectives or a table of specifications may imply strict guidelines for
constructing tests, but much of testing is determined by practical considerations. Test
constructors must consider such things as how much testing time is available, what item
formats the students can handle, and the developmental level of the examinee. A well-
planned test is no accident.
Key terms and concepts
Test purpose Taxonomy of educational objectives
Objectives Table of specifications
Knowledge
Review Items
1. The bloom et al. taxonomy for educational objectives in the cognitive domain is a
hierarchy based on the:
a. Extent of recall required to attain an objective.
b. Level of understanding required to attain an objective.
c. Reading level required to attain an objective.
d. Aptitude required to attain an objective.
2. Tests are constructed for many purposes. Which of these is not a common
purpose of tests?
a. Affirmation.b. Prediction.
c. Evaluation.
d. Description.
-
8/8/2019 Educational and Measurement Testing
8/84
3. A test is useful for one purpose is also likely to be effective for many other
purposes.
T F
4. Test items should be evaluated in terms of how well they match the tests:
a. Purpose.
b. Reading level.
c. Reliability.
d. Assessment.
5. Explaining in ones own words what a compromise entails is task at what level of
Blooms taxonomy?
a. Application.
b. Knowledge.
c. Comprehension.
d. Analysis.
6. An assignment to design a politically acceptable solution to nuclear waste
disposal would be at which taxonomic level?a. Synthesis.
b. Knowledge.
c. Comprehension.
d. Analysis.
7. Matching authors names and titles of their books is a task at what taxonomic
level?
a. Application.
b. Knowledge.
c. Comprehension.
d. Analysis.
-
8/8/2019 Educational and Measurement Testing
9/84
8. A knowledge-level understanding of reproduction is necessary but not sufficient
for a comprehension level of understanding of reproduction.
T F
9. Which is the following is not a part of a well-written instructional objective
according to Mager?
a. A description of the learner.
b. The behavior that is to be observed.
c. The conditions under which the behavior will occur.
d. Criterion of acceptable performance.
10.Which of the following educational goals is not stated in behavioral terms?
a. Read.
b. Understand.
c. List.
d. Count.
11.A table of specifications categories test items by:
a. Content and reading level.b. Content and cognitive process.
c. Cognitive process and reading level.
d. Item type and cognitive process.
12.A table of specifications is used for:
a. Standards for evaluating a test.
b. Converting test scores to evaluation.
c. Listing instructional objectives.
d. Planning a test.
13.A table of specification is not useful when instructional objectives are available.
T F
-
8/8/2019 Educational and Measurement Testing
10/84
14.A table of specifications would be appropriate for an achievement test but not for
a test that is used to predict future academic performance.
T F
15.Almost all important educational outcomes can be expressed in terms of
behavioral objectives.
T F
-3-
SELECTED-RESPONSE ITEMS
Test items can be distinguished by the response required-the response is selected from
two or more options or constructed by the test taker.
The three commonly used selected-response it formats are true-false, multiple-choise,and matching.
Selected-response items have three general qualities:
1. They can be reliably scored.
2. They can sample the domain of content extensively.
-
8/8/2019 Educational and Measurement Testing
11/84
3. They tend to measure memorization of unimportant facts unless care is taken in
constructing the items.
True-false can be effective when a few guidelines are followed in the construction:
1. Statements must be clearly true or false.
2. Statements should not be lifted directly from the text.
3. Specific determiners should be avoided.
4. Trick questions should not be used.
5. Some statements should be written at higher cognitive levels.
6. True and false items should be of the same frequency and length.
Multiple-choice items can be improved by following these guidelines:
1. Avoid grammatical clues.
2. Keep option length uniform.
3. Use plausible distracters.
4. Do not repeat key words from the stem in the options.
Correct answer should be randomly ordered the response-option position. The position
of the correct response should not provide a clue about the correctness of the response.
Complex options (e.g., all of the above; none of the above; or a and b, but not c) should
be used sparingly, if at all.
Multiple-choice items are versatile:
1. They can measure higher cognitive outcomes.
2. They can provide diagnostic information.
Matching items are usually presented in a two-column format: one column consists of
premises and the other consists of responses.
Matching items should contain homogeneous content so that all responses must be
considered plausible answers.
The following guidelines apply to selected-response items formats:
1. Teachers should be aware of the appropriate number of items on the test that can be
guessed correctly.
-
8/8/2019 Educational and Measurement Testing
12/84
2. Test items should be independent: The content of one item should not provide the
answers to others, nor should correctly answering one question be a prerequisite to
correctly answering another.
3. Reading level of the test should be lower than the grade level, unless reading is
being tested.
Key Terms and Concepts
Objective item Content sampling
Selected-response item Barrier
True-false item Stem
Multiple-choice Distractors
Matching item Clue
Options Premiere
Review Items
1. Objective items are objective only in their:
a. Item content.
b. Scoring.
c. Distracters.
d. Wording.
2. Many selected-response items can be asked in each testing session; thus, they can
provide good:
a. Levels of difficulty.
b. Objectivity.
c. Content sampling.
d. Time sampling.
3. The objective scoring of selected-response items enhances the tests:
a. Reliability.
b. Validity.
-
8/8/2019 Educational and Measurement Testing
13/84
c. Reliability and validity.
d. Validity but not reliability.
4. When constructing true-false items, it is best to:
a. Use specific determiners.
b. Reproduce statements directly from the text.
c. Include about equal numbers of true statements and false statements.
d. Vary the length of false statements and true statements.
5. Which of the following is not a strength of multiple-choice items?
a. Effective testing of higher cognitive levels.
b. Content sampling.
c. Scorer reliability.
d. Allowing for educated guesses
6. When constructing multiple-choice items, it is best to:
a. Make all options the same length.
b. Put the main idea of the options.
c. Use options such as a and b, but not c.d. Repeat key words from the stem in the options.
7. A matching item includes six events to be matches with nine responses consisting
of dates, cities, and states. The error of item constructing is:
a. Too many premises.
b. Too few premises.
c. Responses contain heterogeneous content.
d. Responses contain homogeneous content.
8. When constructing a matching item, the numbers of premises and responses
should be:
a. Equal, with all responses used only once.
-
8/8/2019 Educational and Measurement Testing
14/84
b. Equal, but having the option of using responses more than once.
c. Unequal, with a greater number of responses and any response being used only
once.
d. Unequal, with a greater number of responses and having the option of using
response more than once.
9. The tendency for true-false items to measure trivia is a weakness in the item writer
more that the item format:
T F
10. A good true-false is clearly true or false.
T F
11. Increasing the length of a matching item tends to enhance the homogeneity of
content.
T F
12. Multiple-choice items generally require about the same response time per item as
matching items.T F
13. The items format most appropriate for measuring knowledge of paired associates,
such as symbols and their meaning, is multiple-choice.
T F
14. The column of a matching item that contains the items stems is called the_______.
15. For usual classroom testing, the most desirable length for a matching item is
between ___________and___________premises.
-
8/8/2019 Educational and Measurement Testing
15/84
-4-CONSTRUCTED-RESPONSE ITEMS
For a short-answer item, the student supplies the answer to a question, association, or
completion form.
-
8/8/2019 Educational and Measurement Testing
16/84
In constructing short-answer items, each item should have a unique, correct answer be
structures so the student can clearly recognize its intent.
An essay item is one for which the student structures the response. He or she selects
ideas and then presents them according to his or her own organization and wording.
Essay items are used quite effectively to measure higher-level learning outcomes, such
as analysis, synthesis, and evaluation. Essay testing is not, however, an effective
means of measuring lower-level learning outcomes.
Essay items can be used to measure writing and self-expression skills. Although this
may not be the primary purpose of a given test, it is certainly worthwhile.
Scoring inconsistencies are the primary disadvantage of essay items. In addition,
irrelevant factors, such as neatness and penmanship, may also influence the score.
The extent of response is basically on a continuum, from restricted to extended. Writingitems with the response geared toward the restricted end tends to provide more focus
for the item.
The student must be directed to the desired response. This can be enhanced by
identifying the intended student behaviors and including them in the essay item.
The suggested time for responding to each test item should be provided to the students.
This designates the weight or value of each item and also helps students budget their
time.
Analytic scoring focuses on individual points or components of the response; holisticscoring considers the response in its entirety, as a whole.
If possible, responses to items should be scored anonymously. In addition, all
responses to one item should be scored before moving on to the next item, rather than
scoring an entire test at a time. Also, the papers should be reordered before scoring the
next time.
Key Terms and Concepts
Constructed response Completion form
Short-answer item Restricted response
Essay item Extended response
-
8/8/2019 Educational and Measurement Testing
17/84
Objective scoring Modal answer
Question form Analytic scoring
Association form Holistic scoring
Review Items
1. In an association form, short-answer item, the spaces for the responses should:
a. Vary according to the length of the correct response.
b. All be the same size.
c. Vary in size, but not according to any order.
d. Vary in size according to some system of ordering.
2. Completion, short-answer items should have the blank(s) placed:
a. At or near the beginning of the item.
b. Between the beginning and the middle of the item.
c. As close to the middle of the item as possible.
d. At or near the end of the item.
3. A Swiss cheese completion item has:a. The blanks evenly spaced throughout the item.
b. Too many blanks.
c. Blanks of unequal size.
d. None of the above.
4. Essay items are popular in teacher-constructed test because:
a. Of the subjectivity in their scoring.
b. They are perceived to be more effective in measuring higher-level outcomes than
objective items.
c. They tend to have greater content sampling than objective items.
d. They tend to have greater reliability than objective items.
-
8/8/2019 Educational and Measurement Testing
18/84
5. When scoring essay items, all responses to one item should be scored before
scoring the next time, rather than scoring one entire test before scoring the next.
This procedures:
a. Increases the test validity.
b. Enhances the consistency of scoring.
c. Reduces bias against individual students.
d. Enhances the objectivity of scoring.
6. The test scorer reads the response to one essay item after already reading several
other responses to the same item. The score of this response will tend to be:
a. Higher, if the earlier responses were of poor quality.
b. Higher, if the earlier responses were of high quality.
c. Lower, if the earlier responses were of poor quality.
d. Unaffected by the quality of earlier responses.
7. The halo effect in scoring items is a tendency is to score more highly those
response:
a. Read later in the scoring process.
b. Read earlier in the scoring process.c. Of students known to be good students.
d. That are technically well written.
8. A student receives a high score on an essay item, due, in part, to the quality of
responses to the item read earlier. This is:
a. A context effect.
b. A hallo effect.
c. A reader-agreement effect.
d. None of the above.
9. Anonymous scoring of essay item responses tends to reduce:
a. Reader agreement.
-
8/8/2019 Educational and Measurement Testing
19/84
b. The halo effect.
c. Order effect.
d. Effect due to technical characteristic, such as penmanship.
10. From a measurement standpoint, using classroom test consisting entirely of essay
items is undesirable because:
a. Content sampling tends to be limited.
b. Scoring requires too much time.
c. It is difficult to construct the items.
d. Structuring model responses is too time consuming.
11. Short-answer items are generally easier to construct than matching items.
T F
12. The use of essay items is an effective means of measuring lower-level learning
outcomes.
T F
13. Analytic scoring of essay items tends to be faster than holistic scoring.T F
14. Analytic scoring of essay items tends to be more objective than holistic scoring.
T F
15. There is a tendency to score longer responses to essay items more highly than
shorter response.
T F
16. Including optional items in an essay exam is a desirable practice.
T F
-
8/8/2019 Educational and Measurement Testing
20/84
-5-
NORM-REFERENCED MEASUREMENT
-
8/8/2019 Educational and Measurement Testing
21/84
Norm-groups are they referent groups for norm-referenced interpretations of test
scores. Such groups must be appropriate for the individuals tested and the purposes at
hand.
Norm should be representative, relevant, and recent.
Representativeness of the norm group depends on the size of the sample and the
sampling method. The latter has numerous factors associated with it and is the most
likely source of producing biased norms.
Relevance depends on the degree to which the norm group is comparable to the group
under consideration.
National, local, and subgroup norms provide different perspectives for interpreting the
result of tests.
Norms are measure of the actual performance of a group on a test. They are not meantto be standards of what performance levels should be.
Descriptive statistics are used to summarize characteristics of sets of test scores. The
level of statistics commonly used in measurement is quite basic, requiring only simple
arithmetic operations.
Frequency distributions summarize sets of test scores by listing the number of people
who received each test score. All of the test scores can be listed separately, or the
scores can be grouped in a frequency distribution.
The mean, the median, and the mode all describe central tendency:
1. The mean is the arithmetic average.
2. The median divides the distribution in half.
3. The mode is the most frequent score.
Descriptive statistics that indicate dispersion are the range, the variance, and the
standards deviation. The range is the difference between the highest and lowest scores
in the distribution plus one. The standard deviation is a unit of measurement that shows
by how much the separate score tend to differ from the mean. The variance is thesquare of the standard deviation. Most scores are within two standard deviations from
the mean.
Key Terms and Concepts
-
8/8/2019 Educational and Measurement Testing
22/84
Norm group Measure of central tendency
Norms Measures of dispersion
Representiveness Mean
Recency Median
National norms Mode
Local norms Range
Grade equivalent score Variance
Descriptive statistics Standard deviation
Frequency distribution
Review Items
1. When using a norm-referenced interpretation, the students score on a test is
compared.
a. A minimum score for passing the test.
b. The score of others taking the test.
c. As expected score based on the students ability.
d. A predetermined percentage of correct responses.
2. It is not important that the norm group for a nationally used achievement test:a. Is large.
b. Is representative.
c. Is from at least three grade levels.
d. Has persons from all states.
3. The extent to which a norm group is comparable to the group being tested
determines the norm groups
a. Relevance.
b. Representativeness.
c. Recensy.
d. Reliability.
-
8/8/2019 Educational and Measurement Testing
23/84
4. The Lake Wobegon Phenomenon in testing is the situation of:
a. Norm groups scoring unusually high on standardized tests.
b. Students scoring below the national average on standardized tests.
c. Students scoring above average on locally normed.
d. All states reporting above-average performance on nationally normed tests.
5. Norms for published standardized tests are commonly based on the performance
of:
a. Individual students who will perform well.
b. One or more groups of students.
c. Students in a typical school system.
d. A random sample of students from one state.
6. Local achievement and aptitude norms might be more important than national
norms is decisions about:
a. Future occupation
b. The likelihood of success in certain colleges.
c. Selection into special high-school programs.
d. Allocations among different school district.
7. The central administration of a school district sets a goal of having all elementary-
school students reading at or above the average on a nationally normed test. The
mistake being made is:
a. Making the assumption that the norm group is relevant to the local school.
b. Using the norm as a standard.
c. Attempting to have consistent reading performance in all schools.
d. Establishing too modest a goal.
8. Which of the following is not a measure of central tendency?
a. Mean
b. Variance.
-
8/8/2019 Educational and Measurement Testing
24/84
c. Mode.
d. Median
9. When a distribution has a small number of scores, some of which are very extreme,
the preferred measure of central tendency is the:
a. Median
b. Mean
c. Range
d. Mode
10. A measure of dispersion for a distribution, whose computation involves only the
extreme scores, is:
a. Standard deviation
b. Variance
c. Mode
d. Range
11. Which of the following provides a mesure of dispersion in the same units as the
original scores?a. Variance
b. Median
c. Standard deviation
d. Correlation
12. Measure of central tendency are to location as measures of dispersion are to:
a. Points
b. Spread
c. Average
d. Frequencies
-
8/8/2019 Educational and Measurement Testing
25/84
13. If the mode and median of a distribution of scores are equal, the mean will also
have to be equal to the median.
T F
14. When establishing national norms, size of the norm group is a major concern.
T F
15. Generally, the larger the numerical value of the median, the larger the value of the
standard deviation.
T F
-6-
COMPARING SCORES TO NORM GROUPS
-
8/8/2019 Educational and Measurement Testing
26/84
When comparing an individuals score to the scores of the norm group, the point is to
determine where the individuals score locates in the norm group distribution.
Percentiles indicate the percentage of students in the norm group who are at or below aparticular score.
The standard normal distribution has a mean of 0 and a standard deviation of 1.0. The
area in the Appendix 4 table is given from the mean to the z-score and it is the
proportion of the total area.
Standard score and transformed standard scores express the relative position of a
score in a distribution in terms of standard deviation units from the mean.
Stanines provide equal units of measurement. There are nine stanine scores and the
name comes from standard nine. Each stanine contains a band of scores, each bandequal to one-half standard deviation in width.
The NCE score is a normalized standard score with a mean of 50 and a standard
deviation of 21.06. Scores ranga from 1 through 99 and an equal unit is retained in the
scale.
Grade equivalent scores are intended to indicate the average level of performance for
students in each month of each grade. Unfortunately, grade equivalents do not from an
equal interval scale.
Key Term and Concepts
Percentile Stanines
Percentile rank Normalized T-score
Standard score Normal curve equivalents score
Standard normal distribution Grade equivalent score
Transformd standard score
Review Items
1. A student who scores at the 45th percentile on a test:
a. Answered 45 percent if the items correctly.
b. Is above average in performance.
-
8/8/2019 Educational and Measurement Testing
27/84
c. Equaled or surpassed 45 percent of the other examinees.
d. Had at least 45 percent of the right answer.
2. Standard scores express an individuals position in the distribution of scores in
terms of:
a. Standards of performance.
b. Standard deviations from the mean.
c. Standard deviation from the maximum possible score.
d. Deviation from a standard of performance.
3. A test score that is at the 42nd percentile could also be said to be at which stanine?
a. 3rd
b. 4th
c. 5th
d. 6th
4. A z-score of 1.5 would have what value if it were converted to a t-score?
a. 120
b. 50c. 35
d. 65
5. Which of these cannot be meaningfully average because the scores are ordinal
rather than interval?
a. Percentiles
b. Stanines
c. Standard scores
d. None of the above
6. A student receives a z-score of +1.25 on an exam. This mean the students
performance is:
-
8/8/2019 Educational and Measurement Testing
28/84
a. Below the mean performance of the group
b. One-quarter of a standard deviation above the mean performance of the group
c. At the average for the group
d. Around the 89th percentile of the group
7. The standard normal distribution has:
a. a mean of 0 and a standard deviation of 1
b. a mean of 50 and a standard deviation of 1
c. a mean of 0 and a standard deviation of 10
d. a mean of 50 and a standard deviation of 10
8. A students t-score in a distribution of transformed standard scores is 40. This
students performance is:
a. Above average
b. At the 40th percentile
c. At the 7th stanine
d. Below average
9. Stanines divide a distribution into nine parts so that each part:a. Contains about 11 percent of the scores
b. Is one-half standard deviation wide
c. Represent 10 percentile ranks
d. Contains the mode of distribution
10. In a normal distribution, which of the following indicates the highest relative position
in the distribution of scores?
a. Z = 1.5
b. Percentile rank = 90
c. T = 65
d. Stanine = 8
-
8/8/2019 Educational and Measurement Testing
29/84
11. Joe has a stanine score of an exam. Hid performance is:
a. Below the 6th percentile
b. Between the 60th and 77th percentile
c. At the 50th percentile
d. Above the 80th percentile
12. Normal curve equivalent score (NCEs) range from:
a. -3.00 to +3.00
b. 1 to 9
c. 30 to 80
d. 1 to 99
13. Percentiles are more of an ordinal scale than an equal interval scale.
T F
14. When score are converted to percentiles, a specified gain in achievement will result
in a larger increase in percentile rank if the gain is near the high end of the
distribution than near the middle of the distribution.
T F
15. Grade equivalent scores are on an equal interval scale.
T F
-7-
-
8/8/2019 Educational and Measurement Testing
30/84
ITEM STATISTICS FOR NORM-REFERENCED
TESTS
This chapter introduced the concepts of analyzing individual items of norm-referenced
tests, that is, how well the items are performing in the total test. The correlation
coefficient was introduced as a descriptive statisticone that can be used to indicate
the direction and strength of the relationship between two variables. In testing
applications, the variables are often scores on individual items, score on tests or other
measuring instruments, and scores on external criteria that we try to predict, such as
future grade-point average. The correlation coefficient will also be used in future
chapters to develop the concepts of validity and reliability.
Two items statisticsdifficulty and discriminationwere introduced and shown to be
useful in evaluating the performance of individual items on norm-referenced tests. The
difficulty index indicates the percentage of persons who answered an item correctly,
whereas the discrimination index shown how well the item separated those who had
high and low scores on the total test. The discrimination index is based on the
correlation between scores on an individual item and those on the total test.
Constructing a perfect test is not likely, especially for the initial draft of the best, even
when we follow the guidelines for good test construction. Confusion, ambiguity, and
poorly constructed options may enter into an item. Students may perceive items
differently than intended by the teacher. Item analysis provides empirical data about
how individual items are performing in a real test situation. Item statistics do not reveal
specifically the deficiencies in the content of items, but they indicate when an item is
deficient. Checking the item difficulty index and the discrimination index may give some
clues as to what is wrong. A careful inspection of the item content and response
patterns of students is often quite revealing.
Key Terms and Concepts
Correlation coefficient Coefficient of determination
Scatterplot Difficulty index
Pearson product-moment coefficient Discrimination index
Review Items
-
8/8/2019 Educational and Measurement Testing
31/84
1. To compute the correlation between attitude and achievement, one must have:
a. Achievement score from one group of people and attitude scores from another
group.
b. Achievement and attitude scores on the same group of people.
c. Achievement scores from two points in time and attitude scores from two points
in time.
d. The same tests given twice to the same group of people.
2. The correlation coefficient is a number that can range of correlation:
a. 0 to +1.00
b. -1.00 to +1.00
c. Minus infinity to plus infinity
d. 0 to 100
3. Which of the following indicates the greatest degree of correlation?
a. 52
b. 61
c. + 23
d. + 42
4. The variance of a distribution is a measure of:
a. Dispersion
b. Central tendency
c. Relationship
d. Location
5. Students who can high on an ability measure were found to be able to solve a
learning task much faster than students scoring low on the ability measure. If scores
on the ability measure and time to compete the learning task are correlated, we
would expect:
a. Zero correlation
-
8/8/2019 Educational and Measurement Testing
32/84
b. A zero coefficient of determination
c. Positive correlation
d. Negative correlation
6. An exam given to 40 students; 35 of the students respond correctly to an item. The
difficulty index for the item is close to:
a. 0
b. 1
c. 87
d. 40
7. Which difficulty index is indicative of the most difficult item?
a. 90
b. 50
c. 25
d. 12
8. The preferred difficulty index for items of norm-referenced tests in:
a. Close to 1b. Close to 0
c. Close to 80
d. Close to 50
9. If an item has a high discrimination index, it means that scores on the item have:
a. No correlation with total test scores
b. High correlation with total test scores
c. Low correlation with total test scores
d. Negative correlation with total test scores
10. An item has a negative discrimination index. Thus, if the student responds correctly
to this item, for this student we would expect a:
-
8/8/2019 Educational and Measurement Testing
33/84
a. Low total test score
b. High total test score
c. Total test score around the middle
d. Total test score of zero
11. Of the following, which provides information about the distribution of total test
score?
a. Difficulty index
b. Correlation coefficient
c. Discrimination index
d. Standard deviation
12. If we want to identify who is getting the test item correct, low-scores of high-scores,
we would check the difficulty index.
T F
13. An item has a discrimination index around 8. This means that scores on the test are
getting the item correct.
T F
14. An item has a difficulty index close to zero. This means that high-scores on the test
are getting the item correct.
T F
15. The ideal situation for a test is to have high difficulty levels and high discrimination
indices for the items.
T F
-8-
RELIABILITY OF NORM-REFERENCED TESTS
-
8/8/2019 Educational and Measurement Testing
34/84
Reliability of measurement is consistencyconsistency in measuring whatever the
instrument is measuring.
Stability reliability is consistency of measurement across time.
Test-retest, with the same test administered at different times, provides the estimate of
estimate of stability reliability. The reliability coefficient is the correlation between the
scores of the two test administrations.
Equivalence reliability is consistency of measurement across two parallel forms of a
test.
The split-half procedure divides the test into two parallel halves; the score of the two
halves are then correlated. The reliability of the total test is then estimated using the
Spearman-Brown formula.
The KR-20 formula gives an estimate of internal consistency reliability (r 20), which, in
essence, is the mean of all possible split-half coefficient.
The r21 may be substituted forr20 if item difficulty levels are similar; r21 is computationally
easier, but it underestimates reliability if the items vary in difficulty.
The alpha coefficient provides an estimate of internal consistency reliability, based on
two or more parts of a test. If each item is considered a part, the r is equivalent to r20.
The length affects reliability in such a way that, the longer the test, the greater thereliability, assuming other factors remain constant.
The Spearman-Brown formula is used for estimating the reliability of increased length. It
is applied when using the split-half procedure since the total test is twice as long as the
individual halves.
Difference scores tend to be less reliable than scores on individual tests. As the
correlation between the scores on the two tests creases, the reliability of the difference
scores decrease.
An observed test score may be considered as consisting of two parts, the truecomponent and error component.
In considering the distribution of the observed, true, and error scores:
Xo = Xt + Xe and so2 = st
2 + se2
-
8/8/2019 Educational and Measurement Testing
35/84
Reliability is the proportion of the variance in the observed scores that is true or
nonerror variance.
The standard error of measurement is the standard deviation of the distribution of errorscores. As reliability increase, the standard error of measurement decrease.
We can use the concepts of reliability and standard error of measurement in making
inferences about how an individuals score would fluctuate on repeated use of the same
test. The distribution of scores would have a mean approaching the individuals true
score and a standard deviation equal to the standard error of measurement.
Increased range of performance of the students being tested tends to enhance
reliability.
Item similarity enhances reliability and item difficulty affect reliability such that items ofmoderate difficulty, around 50 percent correct responses per item, enhance reliability.
Key Terms and Concepts
Reliability Kuder-Richardson formula-20
Reliability coefficient Kuder-Ricardson formula-21
Test-retest Cronbach alpha
Stability reliability Difference score
Parallel forms Error variance
Equivalence reliability True variance
Interval consistency reliability Standard error of measurement
Spearman-Brown formula
Review Items
1. The reliability coefficient can take on values:
a. From 0 to +1.00, inclusive
b. From 1.00 to + 1.00, inclusive
c. Of any positive number
d. From 1.00 to 0, inclusive
-
8/8/2019 Educational and Measurement Testing
36/84
2. If a group of student was measured in September using a mathematics
achievement test and then tested again in October using the same test, the
correlation coefficient between the scores of the two test administrations would be
a measure of:
a. Stability reliability
b. Equivalent reliability
c. Interval consistency reliability
d. Both stability and equivalence reliability
3. Reliability estimates of a test:
a. May be based on content or logical analysis of the test
b. Require some correlation coefficient
c. Increase with repeated test usage
d. Are the same for all applications of the test
4. If a reliability estimate is based on a single administration of a test, the reliability of
interest is not:
a. Stability reliability
b. Split-half reliabilityc. Equivalence reliability
d. Internal consistency reliability
5. On a given test, the observed standard deviation of the scores is 20, and the
reliability of the test is 84. The standard error of measurement is:
a. 8.00
b. 18.33
c. 3.20
d. 16.80
6. A test of 40 items has reliability of 70. If the test is increased to 80 items, the
reliability will be:
-
8/8/2019 Educational and Measurement Testing
37/84
a. 0.99
b. 0.54
c. 0.82
d. 0.90
7. If the reliability of a test is 1.0, the standard error of measurement is:
a. 1.0 also
b. Greater than 1.0
c. Undeterminable
d. 0
8. Identify the reliability estimation procedure appropriate for determining stability
reliability of a test:
a. Split-half
b. Kuder-Richardson formula-20
c. Parallel forms administered at the same time
d. Parallel forms administered at difference times
9. A reading test is given to two groups of sixth-grade students: Group A consists ofhigh-ability (IQ 120 or greater) students; Group B consists of students of
heterogeneous ability (IQ range 90 to 150). The most likely reliability situation is:
a. Test reliability will be the same for both groups
b. Test reliability will be greater for the group A than group B
c. Test reliability will be greater for group B than group A
d. No inference can be made about test reliability
10. In applying the split-half procedure for estimating reliability, the reliability coefficient
for one-half the test is computed. To estimate the reliability of the entire test, we use
the:
a. Kuder-Richardson formula-20
b. Kuder-Richardson formula-21
-
8/8/2019 Educational and Measurement Testing
38/84
c. Spearman-Brown formula
d. Cronbach alpha procedure
11. The Kuder-Richardson 20 procedure (KR-20) is a procedure for estimating reliability
that provides:
a. An internal consistency coefficient
b. The mean of all possible split-half coefficients
c. Both and b
d. Neither a nor b
12. A test of 100 items is divided into five subtests of 20 items each. If we are interested
in internal consistency reliability, the most appropriate procedure for estimating
reliability is:
a. Kuder-Richardson formula-20
b. Kuder-Richardson formula-21
c. Cronbach alpha
d. Parallel forms
13. A mathematics test is given to a class of gifted students and also to a regularungrouped class. The reliability of the test would likely:
a. Greater for the gifted class
b. Greater for the ungrouped class
c. About the sane for both classes
d. Unable to infer anything until the reliability coefficient is computed
14. The standard error of measurement is a measure of:
a. Location
b. Central tendency
c. Variability
d. Association
-
8/8/2019 Educational and Measurement Testing
39/84
15. As the standard error of measurement increase, the reliability of a test:
a. Also increase
b. Decreases
c. Remains unchanged
d. May increase or decrease
16. Theoretically, with respect to variance, reliability can be considered the ratio of:
a. Observed variance to true variance
b. Error variance to observed variance
c. True variance to error variance
d. True variance to observed variance
17. Conceptually, the true component and the error component of a test score are such
that:
a. The greater the true component, the greater the error component
b. The greater the true component, the smaller the error component
c. The component are equal
d. The component are independent
18. In conceptualizing the distributions of observed, true, and error scores, the following
is true for the means:
a. The observed mean equals the true mean
b. The error mean equals zero
c. The observed mean equals the true mean plus the error mean
d. All of the above
19. Conceptually, the variances of the distributions of the observed, true, and error
score are such that:
a. The variance of the error scores is zero
b. The error variance plus the true variance equal the observed variance
c. The observed variance is less than the true variance
-
8/8/2019 Educational and Measurement Testing
40/84
d. The observed variance and the true variance are equal
20. A difference score is generated by subtracting a pretest score from a posttest score.
In order to obtain a high reliability for the difference score, we require:
a. Low correlation between pretest and posttest scores
b. High reliability for both pretest and posttest scores
c. Both a and b
d. Neither a nor b
-9-
VALIDITY OF NORM-REFERENCED TESTS
-
8/8/2019 Educational and Measurement Testing
41/84
Validity is the correct to which a test measure what it is intended to measure.
Content validity is concerned with the extent to which the test is representative of a
defined body of contact consisting of topics and process.
Content validity is based on a logical analysis. It does not general a validity coefficient,as is obtained with some other types of validity.
Standardized achievement tests tend to have broad content coverage so they will have
wide application. However, when used in a specific situation, the content validity of e
prospective test should always be considered.
Criterion validity is based on the correlation between scores on the test and scores on a
criterion. The correlation coefficient is the criterion validity coefficient.
Concurrent validity is involved if the scores on the criterion are obtained at the same
time as the test scores. Predictive validity is involved if the scores on the criterion areobtained after an intervening period from those of the best.
Concurrent validity applies if it is desirable to substitute a shorter test for a longer one.
In that case, the score on the longer test is the criterion, and validity is that of the
shorter test.
The construct validity of a measure or test is the extent to which scores can be
interpreted in terms of specified traits or construct.
Factors analysis is a procedure for analyzing a set of correlation coefficient between
measures; the procedure analytically identifies the number and nature of the constructsunderlying the measures. Different types of factors are general, group, and specific
factors.
For test validated through correlation with a criterion measure, validity can be expressed
as the proportion of the observed test variance that is common variance with the
criterion. The validity coefficient is the square root of this proportion or ratio.
A test cannot be valid (either conceptually or practically) if it is not reliable; however. A
reliable test could lack validity. Thus, reliability is a necessary but not sufficient condition
for test validity.
A well- constructed test with items of proper difficulty level will enhance validity. Validity
tends to increase with test length. Low-item inter correlation may tend to enhance
criterion validity if we have a complex criterion.
-
8/8/2019 Educational and Measurement Testing
42/84
Increased heterogeneity of the group measure tends to enhance validity. Subtle,
individual factors may also affect validity. Tests should be properly administrated, since
any procedures that impede performance also lower validity.
Key Terms and Concepts
Validity Factor analysis
Content validity Factor loading
Criterion validity General factor
Validity coefficient Group factor
Concurrent validity Specific factor
Predictive validity Covariation
Construct validity
Review Items
1. Which of the following types of validity does not yield a validity coefficient?
a. Predictive
b. Concurrent
c. Content
d. Criterion
2. When considering the terms reliability and validity, as applied to a test, we can say:
a. A valid test ensures some degree of reliability
b. A reliable test ensure some degree of validity
c. Both a and b
d. Neither a nor b
3. If a test is representative of the skills and topics covered by a specific unit of
instruction, the test has:
a. Construct validity
b. Concurrent validity
c. Predictive validity
-
8/8/2019 Educational and Measurement Testing
43/84
-
8/8/2019 Educational and Measurement Testing
44/84
8. Which characteristic is true of criterion validity?
a. It is based on a logical correspondence between two tests
b. It includes two types, concurrent and predictive validity
c. It is based on two administrations of the same test
d. All of the above
9. A school system uses a test considered to be valid for measuring student
achievement, but the test requires three hours of administration time. The principals
and teachers are considering substituting a shorter test for the longer one. The
validity of concern here is:
a. Concurrent
b. Content
c. Construct
d. Predictive
10. The testing division of a school system is attempting to analyze the traits that are
inherent in the six sub scores of an academic achievement test. The validity of
concern here is:
a. Concurrentb. Content
c. Construct
d. Predictive
11. Construct validity is establish through:
a. Logical analysis
b. Statistical analysis
c. Both logical and statistical analysis
d. Neither logical nor statistical analysis
-
8/8/2019 Educational and Measurement Testing
45/84
12. Factor analysis is a procedure often used in establishing construct validity of a set
of tests. In the analysis, the factor loadings that are computed are correlation
coefficients between:
a. Scores on two or more tests of the set
b. Factors and test scores
c. Two or more factor scores
d. None of the above
13. A factor analysis is constructed on the scores from six different IQ tests. One of the
factors has a large loading with a single IQ test and very small loading with the
other five tests. This is a:
a. General factor
b. Specific factor
c. Group factor
d. None of the above
14. If a validity coefficient is computed for a test, and the test has been used a very
homogeneous group of student, we expect that the validity coefficient will be:
a. Moderate, around 55b. High
c. Low
d. Unable to make an inference
15. Which of the following is least like the others?
a. Criterion validity
b. Construct validity
c. Concurrent validity
d. Predictive validity
16. A test is found to have reliability but low validity. In order for this to occur, the test
has:
-
8/8/2019 Educational and Measurement Testing
46/84
a. Little true variance
b. Large error variance
c. Large specific variance
d. Little observed variance
17. When using criterion measures for establishing validity, a validity coefficient is
computed. Theoretically, in terms of variance, the validity coefficient is the square
root of the ratio of:
a. Variance common with the criterion to observed variance
b. Observed variance to variance common with the criterion
c. True variance in the criterion to observed variance
d. True variance in the criterion to true variance in the best test being validated
18. In order to enhance validity, given a criterion consisting of several abilities, we
would want a test with low-item inter correlations.
T F
19. Predictive validity of a test is increased as the groups tested become more
homogeneous.T F
20. Construct validity refers to the adequacy of item construction for a test.
T F
-10-
-
8/8/2019 Educational and Measurement Testing
47/84
CRITERION-REFERENCED TESTS
A criterion-referenced test score indicates the level of performance on a well-specified
domain of content.
When the test items are not representative of a well-specified domain we cannot
generalize our results beyond the specific items on the test.
Item forms contain enough detail about how the items should be constructed so that
they represent a well-specified domain.
Instructional objectives are usually too terse to provide an adequate description of a
domain. Objectives and test specifications are needed before criterion-referenced tests
are appropriate.
Teachers can construct criterion-referenced tests through the use of objectives and item
specifications.
A standard of minimal acceptable performance is required whenever a decision about
mastery is to be made. There are several methods for setting such standard, none of
them perfect, that can be used with criterion-referenced and other kinds of tests.
Key Terms and Concepts
Norm-referenced Test specification
Criterion-referenced Stimulus attributes
Domain Response attributes
Item form Mastery decision
Objective Standard setting
Review Items
1. Items forms are seldom used by classroom teachers because the forms:
a. Lack validity
b. Are complex and unwieldy
c. Require extensive pilot testing
d. Are appropriate only for standardized tests
-
8/8/2019 Educational and Measurement Testing
48/84
2. Item forms refer to:
a. Item-writing rules
b. Response types (e.g., true-false, multiple-choice)
c. Parallel forms if items for reliability
d. Patterns of responses to sets of items
3. Which of the following is most critical in a criterion-referenced test?
a. A prespecified standard of performance
b. Objectively scored items
c. Specific behavior objectives
d. A well-scored objectives
4. A major strength of criterion-referenced testing is the ability to:
a. Generalize the results to a large set of items
b. Compare individuals in terms of relative standing
c. Establish objective performance criteria
d. Measure difficult to define constructs
5. A poorly defined domain results in items that are:
a. Too difficult
b. Dissimilar
c. Ambiguous to the examinees
d. Unreliable
6. Teachers often prefer criterion-referenced tests to norm-referenced tests because:
a. They are very concerned with which student is best in the class
b. They need to compare the learning in their class to that of their classes
c. Of the specific, discrete knowledge or skills that are assessed rather than global
constructs
d. They are usually easier to construct
-
8/8/2019 Educational and Measurement Testing
49/84
7. The test construction concept that is more details than instructional objectives but
less cumbersome than item forms is (are):
a. Item calibrations
b. Test blueprints
c. Item objectives
d. Test specifications
8. When a test has a presets standard of minimum acceptable performance, it is a
criterion-referenced test.
a. Always true
b. Always false
c. Sometimes true
9. The method of setting standards that is most likely to be used in classroom setting
is the:
a. Professional judgment method
b. Nedelsky method
c. Angoff methodd. Constructing groups method
10. A panel of qualified experts is not used in which of the following methods of setting
standards?
a. Professional judgment
b. Nedelsky
c. Angoff
d. Constructing group method
11. Teachers often prefer criterion-referenced measures to norm-referenced measure
because criterion-referenced measures:
a. Are more reliable
-
8/8/2019 Educational and Measurement Testing
50/84
b. Are less intimidating to students
c. Indicates what the student can do
d. Indicate who in class has done the best
12. Critics of criterion-referenced test are correct when they characterize standard-
setting procedures as:
a. Vague
b. Subjective
c. Inconsistent
d. Sophisticated
13. Most likely tests that are liked to brief, specific instructional objectives are good
examples of criterion-referenced tests.
T F
14. The item formats (e.g., multiple-choice, essay, etc.) should be different for norm-
referenced tests than for criterion-referenced tests.
T F
15. The panel of experts that is used in some standard-setting methods in school
setting consist of:
a. Academically talented students
b. Classroom teachers
c. Parent volunteers
d. Students from higher grades
-11-
-
8/8/2019 Educational and Measurement Testing
51/84
ITEM STATISTICS FOR CRITERION-
REFERENCED TESTS
A test score is determined by the performance of the student on each of the items on
the test. In order to understand the test score it is essential that we understand how
each item contributes to that score. The quality of the test depends on the quality items
that comprise it. The procedures that are described in this chapter are ways to look at
the quality of the test items.
Items should be subjected to a content review before the test is given. Experts and
colleagues can help us by reviewing the test items for their match with the domain
specifications or the objectives, for any potentially biased wording, and for any
observable flaws in the items construction.
After the tests have been administered and scored, there should be a review of the
kinds of errors that were made so that remediation can focus on these errors. Statistical
analysis should be done so that we have evidence about the difficulty levels of the items
and about the degree to which the items are discriminating between masters and non-
masters or between students before and after instruction.
The difficulty levels of test items often turn out to be quite different from what the
teacher expected. Difficulty levels are clear measure of how the students performed on
a specific taskthe test item. As such, they provide very useful information to theteacher.
The discrimination index provides information that is directly related to the purpose of
the test. A discrimination index can be seen as an analogy to the sport of rowing. If all of
the items on a test are likened to the crew members, we see that things work best when
they are all pulling together. This is the case when all of the discrimination indexes are
positive. If one of the crew lifts his or her oars out of the water and does nothing, it is
like an item with a zero discrimination index. A negative discrimination index would be
the situation of the crew member (items) rowing in the opposite direction as the rest of
the crew. Clearly this latter case requires some correction action.
We cannot merely assume that we create high-quality test items. We need to subject
those items to item analysis in order to convince ourselves and others that the item
analysis is time well spent. The information that is provided will help us to understand
the quality of our tests so that we can base decisions on those test scores with
confidence.
-
8/8/2019 Educational and Measurement Testing
52/84
Key Terms and Concepts
Items analysis Pre and post-discrimination index
Content review Mastery/ non-mastery
Pilot testing Item discrimination index
Error patterns Item difficulty index
Review Items
1. An item with ap value near 1.0 is quite:
a. Easy
b. Difficult
c. Discriminating
d. Reliable
2. Analysis of test result at the item level is useful for:
a. Decisions about individual students
b. Decisions about instruction
c. Decision about the test items
d. All of the above
3. Panels with diverse background can be used to examine test items for:
a. Items bias
b. Difficulty
c. Discrimination
d. Continuity
4. It is critically important in criterion-referenced tests that the test items:
a. Are difficult when used on a pretest
b. Match the domain or objective
c. Discriminate between competent and less competent students
d. Not be difficult
-
8/8/2019 Educational and Measurement Testing
53/84
5. The difficulty index refers to:
a. values of student ratings of whether an item was easy or difficult
b. the percentage of examinees who answered an item correctly
c. teacher judgment of how well students are likely to do on an item
d. the likelihood of guessing the correct answer to a test item
6. Pre and post-discrimination refers to whether a test item.
a. Is easier on the posttest than on the pretest
b. Adequately discriminates pre-posts from post-posts
c. Discriminate unfairly against certain ethnic group
d. Would be better placed on a pretest than on a posttest
7. If 30 students are tested and 20 answer item 4 correctly, the difficulty index for
items 4 would be:
a. 10
b. -10
c. 33
d. 67
8. The item statistic that would indicate the most serious concern would be:
a. Difficulty equal to 85
b. Difficulty equal to 05
c. Discrimination equal to -50
d. Discrimination equal to 00
9. Items that match a well-specified domain should have difficulty levels that:
a. Are exactly equal
b. Are very similar
c. Range from 0 to 1
d. Match the domain specification
-
8/8/2019 Educational and Measurement Testing
54/84
10. The higher the value of the difficulty index, the:
a. Easier the item
b. More discriminating the item
c. Lower the percentage correct on the item
d. More biased the item
11. If an item has a positive discrimination index:
a. The item should be received
b. The item is biased
c. The item appears to be effective
d. The test will not be valid
12. Item analysis is:
a. A content analysis
b. A statistical analysis
c. Both a and b
13. When can statistical item analysis be done?a. Before the test is given
b. While the test is being given
c. After the test is given
d. Both b and c
14. Other teachers would be most needed when determining:
a. Item difficulty
b. Item discrimination
c. Item reliability
d. Item bias
-
8/8/2019 Educational and Measurement Testing
55/84
15. A test item that is positively discriminating for third-graders would be positively
discriminating for second-graders.
a. Definitely true
b. Possibly true
c. Definitely false
-12-
-
8/8/2019 Educational and Measurement Testing
56/84
RELIABILITY OF CRITERION-REFERENCEDTESTS
A test is reliable if it provides consistence information about examinees. This can meanthat a criterion-referenced test provides consistent estimates of performance on a
domain or that the test provides consistent placement of an examinee in a mastery or
non-mastery category. Different kinds of reliability evidence are needed for each of
these uses of criterion-referenced tests.
Whether a test is consistent relative to mastery decisions is shown by giving the test on
two occasions to the same group of examinees and finding the percentage of
examinees whose mastery/ non-mastery classifications were both the same on the two
test occasions. This procedure could also be used when a parallel form of the test is
given on the second testing. A reliable test would have a high percentage of examineeswith the same mastery/ non-mastery classification on the two tests.
When performance on a domain is to be estimated from the test scores, the standard
error of measurement can be used to form an interval estimate. An interval estimate
suggests the degree of imprecision that is in our test scores. The standard error of
measurement gives us an idea about how much we can expect test scores to fluctuate
across repeated testing.
The reliability of the test can be increased by careful attention to the test items, the test
setting, and the examinees. A reliable test would have items, the test are
homogeneous. The more similar the items are, the more consistent will be students
approach to those items. The items should be free of flaws or vagueness of wording so
that inconsistencies are reduced. And, because there is a direct relationship between
the length of the test and the reliability of the test, there should be a sufficient number of
items.
Inconsistencies in student performance can be reduced by making sure that the testing
conditions are appropriate. There should be adequate light and quite so that the student
can concentrate on the task. Interruption or distractions should be eliminated and the
test items and directions about how to answer them should be clear.
Reliable scores depend on the students being motivated to apply themselves to the
task. This is promoted when the teacher encourages the students to do well and
explains how the test scores will be used. The teacher should be alert for individual
student problems such as fatigue or anxiety that might be affecting the reliability of the
test scores.
-
8/8/2019 Educational and Measurement Testing
57/84
-
8/8/2019 Educational and Measurement Testing
58/84
5. Which one of the following computed for the reliability of a test would indicate that
the test is totally unreliable?
a. 10
b. 00
c. 50
d. 100
6. Exactly 100 students took a criterion-referenced test twice. The test had a mastery
cut-off score; 70 students were above the cut-off score on both tests and 15
students were below the cut-off score on both tests. The reliability of the test for
mastery decisions would be:
a. 40
b. 55
c. 70
d. 85
7. When estimating the reliability of a domain score, it is appropriate to use the:
a. Standard error of measurement
b. Average score for the classc. Range of possible domain scores
d. Measurement error coefficient
8. Other things being equal, the longer a test is, the____________will be its reliability.
a. Higher
b. Lower
c. Less ambiguous
d. More valid
9. The reliability coefficients that were developed for norm-referenced test can also be
used effectively with criterion-referenced tests.
T F
-
8/8/2019 Educational and Measurement Testing
59/84
10. The same criterion-referenced test was given to 30 children on consecutive days.
Of the children, 10 who surpassed the mastery cut-off score, the first day failed to
do so on the second day. The test could be said to be:
a. Unfair
b. Biased
c. Unreliable
d. Discriminating
11. A test is either reliable or it isnt
T F
12. Test reliability is primarily determined by the test itself. The test setting and the
examinee have a minimal impact on test reliability.
T F
13. Which of the following is most related to high criterion-referenced reliability?
a. Item difficulty near 50
b. Item discrimination near 50c. Short tests
d. A wide range of item types
14. When estimating a domain score, the reliability would increase if:
a. The items were more difficult
b. The items were somewhat dissimilar
c. The test had a cut-off score for mastery decisions
d. The test was longer
15. The longer the test, the smaller the___________.
a. Time between pre- and posttesting
b. Standard error of measurement
-
8/8/2019 Educational and Measurement Testing
60/84
c. Difficulty index
d. Reliability discrepancy
-13-
VALIDITY OF CRITERION-REFERENCED TESTS
-
8/8/2019 Educational and Measurement Testing
61/84
A test that adequately serves the purpose for which it is used is considered to be a valid
test. Validity is always defined in terms of the purpose for which the test scores will be
used. Validity is a matter of degree. One test may be more valid than another but tests
are not usually totally lacking in validity and they are never perfectly valid.
Because criterion-referenced tests are used for several different purposes, including
estimating performance on a domain and determining whether students have achieved
mastery, it is not surprising that different kinds of logical and statistical evidence should
be presented to support the validity claims. The three kinds of test validity that were
introduced are content validity, criterion validity, and construct validity.
Content validity is a determination of the extent that the test items match the domain
specifications or objectives. Validity is established by having qualified persons, a panel
of expert, review the test items for appropriateness and congruence with the domain.
Criterion validity is concerned with whether the test would be an adequate predictor of
performance on some other variable. Validity evidence is established by finding the
correlation coefficient that links the test with the criterion that is to be predicted. The
choice between two competing test would be based on which test has the higher
correlation with the criterion. When we are concerned about mastery decision on two
measures, the degree of validity is shown by the percentage of persons for which the
mastery/ non-mastery decision is consistent.
Construct validity is shown by making predictions about the test scores and then
conducting analyses to see whether the predictions are confirmed. Some of thereasonable predictions are: (1) the test scores should be positively correlated with other
measures of the same thing, (2) groups that are known to differ on the domain should
have test scores that are significantly different, and (3) we should not find different
patterns of responses across distracters for persons of different races, grades, or other
characteristics.
We cannot merely assume that are valid. We need to conduct careful analyses to show
that our tests have sufficient content, criterion or construct validity so that we can justify
the use of the tests.
Key Terms and Concepts
Validity Construct validity
Content validity Logical analysis
-
8/8/2019 Educational and Measurement Testing
62/84
Criterion validity Statistical analysis
Correlation validity Distractor analysis
Review Items
1. If a test is valid it is certainly also reliable.
T F
2. Which of the following is nota validity that is described in the technical standards for
test publisher?
a. Criterion-referenced validity
b. Content validity
c. Construct validity
d. Criterion validity
3. Whether the items on a test match the domain of the criterion-referenced test is
primarily a concerned about:
a. Cut-off score validity
b. Item validity
c. Content validity
d. Criterion validity
4. Essentially the same processes are used to establish the validity of criterion-
referenced tests as are used with norm-referenced tests.
T F
5. A well-specified domain for a criterion-referenced test should enhance the tests:
a. Reliability
-
8/8/2019 Educational and Measurement Testing
63/84
b. Content validity
c. Discrimination
d. Criterion validity
6. A panel of experts will sometimes be used to rate items in order to promote:
a. Content validity
b. Test sales
c. Construct validity
d. User validity
7. If students who surpass the mastery cut-off score for the addition of three-digit
numbers also tend to be those students who achieve master
a. Content validity
b. Criterion validity
c. Convergent validity
d. Mathematical validity
8. The correlation coefficient is a statistical way of expressing:
a. The standard error of measurement
b. Mathematical validity
c. Content validity
d. Criterion validity
9. Which validity requires both a logical process and a statistical process?
a. Content validity
-
8/8/2019 Educational and Measurement Testing
64/84
b. Convergent validity
c. Construct validity
d. Criterion validity
10. If items on a criterion-referenced test do not match a well0defined domain, the test
lacks adequate:
a. Construct validity
b. Content validity
c. Criterion-referenced validity
d. Criterion validity
11. If two writers, working from the same test specifications, created test items that
quite different from each other, the test would have inadequate:
a. Criterion validity
b. Item validity
c. Specification validity
d. Content validity
12. In order to know whether a test is valid, it is most important to know:
a. The purpose for which the test scores will be used
b. A description the persons who will take the test
c. As estimate of the reliability of the test
d. Whether the test has ever been used before
13. When a test does not achieve the purpose for which it was designed, the test lacks:
a. Validity
-
8/8/2019 Educational and Measurement Testing
65/84
b. Reliability
c. Purposefulness
d. Discrimination
14. Lengthening a test will make it more valid.
a. True, if it is somewhat valid to begin with
b. False, test length affects reliability, not validity
c. True, but only for older students
d. False, validity is related to purpose rather than length
15. The primary validity for most criterion-referenced tests is:
a. Construct validity
b. Criterion validity
c. Content validity
d. Criterion-referenced validity
-
8/8/2019 Educational and Measurement Testing
66/84
-
8/8/2019 Educational and Measurement Testing
67/84
Key Terms and Concepts
Test wiseness Separate answer sheets
Correction for guessing Testing arrangement
Positional preference Take-home examBluffing Oral exam
Test anxiety
Review Items
1. Programs for teaching test-taking skills tend to be:
a. Equally effective throughout grades 1-8
b. More effective with lower grades than upper elementary grades
c. More effective with upper grades than lower elementary grades
d. Of no effect throughout grades 1-8
2. A student who guesses on every test item will have the highest score on which kind
of test? (Assume the tests have equal length)
a. True-false
b. Multiple-choice (four-option)
c. Multiple-choice (five-option)
d. Fill-in-the-blank
3. If a correction rather than a penalty for guessing is used on a multiple-choice test,
students should be urged to:
a. Guess when they are unsure of an answer
b. Not guess when they are unsure of an answer
c. Guess only on items for which they have some partial knowledge
4. What is the expected score of a student who guesses on all items of a 50 item
multiple-choice test with five options per item?
a. 0
b. 5
-
8/8/2019 Educational and Measurement Testing
68/84
c. 10
d. 15
5. The relationship between test anxiety and test performance is generally:
a. String and positive
b. Strong and negative
c. Weak and positive
d. Weak and negative
6. A limit is set on the length of response (in words) to an item. This is an attempt to
limit the effect of:
a. Guessing
b. Bluffing
c. Positional preference
d. Changing answer
7. Which of the following is least related to the others? The effect of:
a. Test anxiety
b. Bluffingc. Penmanship and spelling
d. Positional preference
8. A major with oral examinations is that they tend to be:
a. Very time consuming
b. Very anxiety producing
c. Unreliable
d. Formal rather than informal
9. Students should be shown how to take tests so that the tests provide:
a. Enriched diagnostic information for the future informational planning
b. Information about the test setting, as well as the test content
-
8/8/2019 Educational and Measurement Testing
69/84
c. Information about wrong answers, as well as right answer
d. A more accurate picture of what the student is able to do
10. Take-home tests are subject to the following problem:
a. Lack of control over the testing situation
b. They are not appropriate for evaluation purposes
c. Time spent on the test varies among students
d. All of the above
11. Generally, more test answers are changed from wrong to right than right to wrong.
T F
12. Applying the correction for guessing raises the score on a test.
T F
13. Goo penmanship and spelling tend to be positively correlated with the grades
assigned to essay responses.
T F
14. The physical arrangement of the testing situation is as important for enhancing
student performance as is establishing control and rapport.
T F
15. Separate answer sheets can be used effectively for students beginning with those in
second grade.
T F
-
8/8/2019 Educational and Measurement Testing
70/84
-15-
THE USE OF STANDARDIZED ACHIEVEMENT
TESTS
Standardized achievement tests are widely used in our schools. There are many tests
on the market, available in a variety of forms, including norm-referenced and criterion-
referenced tests.
Test results can be reported for individual students, classroom, or even school
buildings. In addition, local and national norms are available for many tests.
Publisher of standardized achievement tests are guided by Standards for Educational
and Psychological Testing(1985). These guidelines can also be used by consumers to
evaluate the test information that publisher provide.
Teachers play an important role in standardized achievement testing. They need to take
this role seriously and make sure that the physical and psychological settings promote a
positive testing environment.
Despite their popular use, standardized achievement tests do have certain limitations.
Some of these deal with the time required for the test administration and the processing
of test scores. Other limitations concern the usefulness and accuracy of the reported
scores. Some people worry that the average performance has become a standard of
performance and that achievement tests may have too much influence over school
curricula.
High-quality standardized achievement tests are available; they do the job that they
were designed to do. However, when tests are used for other purposes, their
effectiveness will be limited. Therefore, those who select standardized achievementtests must do a careful and complete job of comparing alternative.
The major determinant of the appropriateness of a standardized achievement test is the
match between the items and what was actually taught in the schools. Technical
adequacy and cost are also factors.
-
8/8/2019 Educational and Measurement Testing
71/84
Standardized achievement test are perhaps our best example of high-quality testing in
education. Careful item preparation, extensive reliability and validity studies, well-design
norm-groups, and clear reporting formats combine to make standardized achievement
tests useful, accurate measure of student performance.
In the future we can expect to see efficient, computerized achievement testing.However, the types of standardized testing programs that we know now are likely to be
maintained in the 1990s.
Key Terms and Concepts
Standardizedachievement test
Standards for Educational and Psychological Testing
RelevanceTechnical adequacy
Usability
Computerized adaptive testing
Review Items
1. What is standardized on a standardized achievement test?
a. The anticipated level of performance
b. The conditions for test administration
c. The test validity
d. The purpose for which the test is given
2. Which of the following uses of standardized achievement test scores is closest to
the main purpose for which such tests are constructed?
a. Assessing the achievement level of an individual student
b. Measuring school, class, and district wide achievement levels
c. Assessing whether students have adequate levels of achievement for promotion
to the next grade
d. Providing objective evidence about the teachers competence
-
8/8/2019 Educational and Measurement Testing
72/84
3. Standardized achievement tests are norm-referenced rather than criterion-
referenced tests.
T F
4. The Standards for Educational and Psychological Testingare:
a. Guidelines about the use of the test scores
b. Summaries of legal cases concerning the use of test scores
c. Example, including case studies, of the misuse of test score
d. The legal minimum requirements for corporations that sell tests
5. Which of the following factors should be the most important consideration when
selecting a standardized achievement test?
a. Reliability
b. Cost
c. Publishers reputation
d. Relevance
6. Mr. Jones allows his student an extra 20 minutes on a standardized achievement
test. What is the major consequence of his actions?a. The reliability coefficient will increase
b. The content validity will decrease
c. Norm-referenced interpretations of the score will not be meaningful
d. Criterion-referenced interpretations of the scores will not be meaningful
7. Most teachers find that the results of standardized achievement test are helpful to
them when planning instruction for individual students.
T F
8. Most commercially available standardized achievement tests are so excellent that
they can form the sole basis for many decisions about a students academic
progress.
-
8/8/2019 Educational and Measurement Testing
73/84
T F
9. Achievement-at-grade-levelis:
a. A meaningless term statistically
b. An expectation that we should have for all students
c. Average performance for students in that grade
d. An arbitrary assessment based on standardized achievement test scores
10. What is a form of testing in which the items that are presented to the student
depend on the students answers to previous items?
a. Non-standardized testing
b. Response-dependent testing trials
c. Step-by-step testing
d. Computerized adaptive testing
11. If the items on a standardized achievement test did not match what was taught in a
particular school, the test would lack:
a. Technical adequacy
b. Relevancec. Utility
d. Reliability
12. The verb that is frequently used in Standards for Educational and Psychological
Testingand that shows the orientation of the Standards is:
a. Should
b. Must
c. Might
d. Shall
13. Some major standardized achievement tests interpret the same test performance in
both norm-referenced and criterion-referenced formats.
-
8/8/2019 Educational and Measurement Testing
74/84
T F
14. Standardized achiev