Educational and Measurement Testing

8/8/2019 Educational and Measurement Testing

1/84

-1-

INTRODUCTION

National trends have pretty well been set for the 1990s. They indicate continued

emphasis on testing, at least as far as standardized test are concerned, with expanded

uses of test result.

The local school continues to be the implementing agency for assessment programs,

typically those mandated at the state level but also influenced by national trends.

Technically, measurement is the assignment of numerals to objects or events according

to rules that give numerals quantitative meaning.

Measurements may differ in the amount of information the numbers contain. These

differences are distinguished by the terms nominal, ordinal, interval, and ratio scales of

measurements.

The four levels of measurement can be summarized as follows:

1. Nominal scales categorize but do not order.

2. Ordinal scales categorize and order.

3. Interval scales categorize, order, and establish an equal unit in the scale.

4. Ratio scales categorize, order, establish an equal unit, and contain a true zero

point.

Norm-referenced interpretation is a relative interpretation based on an individuals

position with respect to some group, often called the normative group. Norms consist of

the scores, usually in some form of descriptive statistics, of the normative group.

Criterion interpretation is an absolute rather than relative interpretation, referenced to a

defined body of learner behaviors, or, as is commonly done, to some specified level of

performance.

Diagnostic tests are intended to identify student deficiencies, weaknesses, or problem

and to locate the source of the difficulty. If related learning activities are prescribes, the

termprescriptive tests may be used.


2/84

Formative testing occurs over a period of time and monitors students progress.

Summative testing is done at the conclusion of instruction and measures the extent to

which students have attained the desired outcomes.

Key terms and concepts

Tailored Test Assessments

Measurement Test

Levels of Measurement Criterion-referenced

Nominal Scale Diagnostic test

Ordinal Scale Prescriptive test

Interval Scale Formative testing

Ratio Scale Summative testing

Nor-referenced

Review Items

1. Traditionally, the use of test in the schools has been:

a. Predominately norm-referenced.

b. Predominately criterion-referenced.

c. About evenly split, norm-referenced and criterion-referenced.

d. Neither norm-referenced nor criterion referenced.

2. When state legislators have mandated testing, such legislation has been focused

primarily on:

a. Increased classroom testing to improve instruction.

b. Inherent ability testing, for example, using IQ tests.

c. Minimum competency testing.

d. Increased testing of student motivation.


3/84

3. Norm-referenced interpretations of test result are directed primarily to the purposes

of:

a. Discriminating among individualism.

b. Discriminating among groups.

c. Discriminating among programs.

d. Discriminating between a program and a standard.

4. Criterion-referenced interpretations of test result are directed to:

a. Relative interpretations of individual scores.

b. Absolute interpretations of individual scores.

c. Relative interpretations of group scores.

d. Absolute interpretations of group scores.

5. Of the terms below, the one having the narrowest meaning is:

a. Evaluation.

b. Assessment.

c. Measurement.

d. Test.

6. What characteristic distinguishes evaluation from measurements?

a. Evaluation requires quantification.

b. Measurements include testing; evaluation does not.

c. Evaluation involves a value judgment.

d. Evaluation includes assessment, measurement does not.


4/84

7. Of the four levels of measurement, the one that contains the most information in the

number is:

a. Interval.

b. Nominal.

c. Ordinal.

d. Ratio.

8. If, when reading information about a class, we assign a 1 to girls and a 0 to boys,

this level of measurement is:

a. Interval.

b. Nominal.

c. Ordinal.

d. Ratio.

9. The scores from a typical classroom test are probably better that

____________measurement, but not quite___________.

a. Ordinal, nominal.

b. Interval, ratio.

c. Nominal, ordinal.

d. Ordinal, interval.

10. When using a performance test, the difference between scores of 65 and 70 equals

the difference between scores of 85 and 90, but a score of 80 is not twice a score of

40. The level of measurement is:

a. Interval

b. Nominal.


5/84

c. Ordinal.

d. Ratio.

11.Grading on the curve is a criterion-referenced interpretation of test scores.

T F

12. Whether a test is norm-referenced or criterion-referenced depends on the format of

the items included in the test.

T F

13.Most standardized achievement tests are designed for norm-referenced

interpretations.

T F

14.Criterion-referenced tests tend to be more general and comprehensive than

norm0referenced tests.

T F


6/84

-2-

PLANNING THE TEST

Tests are given for many different reasons. In order to achieve such diverse purposes,

they need to be carefully planned. In classroom settings, this planning usually entails

instructional objectives and/ or a table of specifications.

When teachers use specific instructional objectives, it becomes very clear what should

be on the test. The desired student behaviors from the objectives translate directly into

the items for the test.

Objectives can be classified in terms of the kind of understanding that is required.

Bloom et al.s taxonomy is useful in sorting objectives into six hierarchical levels of


7/84

understanding. When test are based on such a taxonomy, they are more likely to asses

higher levels of reasoning.

A table of specifications is another tool that is used in test design. A two-dimensional

grid, content by cognitive process, is used to plan the number and kinds of items that

will be on the test.

The use of objectives or a table of specifications may imply strict guidelines for

constructing tests, but much of testing is determined by practical considerations. Test

constructors must consider such things as how much testing time is available, what item

formats the students can handle, and the developmental level of the examinee. A well-

planned test is no accident.

Key terms and concepts

Test purpose Taxonomy of educational objectives

Objectives Table of specifications

Knowledge

Review Items

1. The bloom et al. taxonomy for educational objectives in the cognitive domain is a

hierarchy based on the:

a. Extent of recall required to attain an objective.

b. Level of understanding required to attain an objective.

c. Reading level required to attain an objective.

d. Aptitude required to attain an objective.

2. Tests are constructed for many purposes. Which of these is not a common

purpose of tests?

a. Affirmation.b. Prediction.

c. Evaluation.

d. Description.


8/84

3. A test is useful for one purpose is also likely to be effective for many other

purposes.

T F

4. Test items should be evaluated in terms of how well they match the tests:

a. Purpose.

b. Reading level.

c. Reliability.

d. Assessment.

5. Explaining in ones own words what a compromise entails is task at what level of

Blooms taxonomy?

a. Application.

b. Knowledge.

c. Comprehension.

d. Analysis.

6. An assignment to design a politically acceptable solution to nuclear waste

disposal would be at which taxonomic level?a. Synthesis.

b. Knowledge.

c. Comprehension.

d. Analysis.

7. Matching authors names and titles of their books is a task at what taxonomic

level?

a. Application.

b. Knowledge.

c. Comprehension.

d. Analysis.


9/84

8. A knowledge-level understanding of reproduction is necessary but not sufficient

for a comprehension level of understanding of reproduction.

T F

9. Which is the following is not a part of a well-written instructional objective

according to Mager?

a. A description of the learner.

b. The behavior that is to be observed.

c. The conditions under which the behavior will occur.

d. Criterion of acceptable performance.

10.Which of the following educational goals is not stated in behavioral terms?

a. Read.

b. Understand.

c. List.

d. Count.

11.A table of specifications categories test items by:

a. Content and reading level.b. Content and cognitive process.

c. Cognitive process and reading level.

d. Item type and cognitive process.

12.A table of specifications is used for:

a. Standards for evaluating a test.

b. Converting test scores to evaluation.

c. Listing instructional objectives.

d. Planning a test.

13.A table of specification is not useful when instructional objectives are available.

T F


10/84

14.A table of specifications would be appropriate for an achievement test but not for

a test that is used to predict future academic performance.

T F

15.Almost all important educational outcomes can be expressed in terms of

behavioral objectives.

T F

-3-

SELECTED-RESPONSE ITEMS

Test items can be distinguished by the response required-the response is selected from

two or more options or constructed by the test taker.

The three commonly used selected-response it formats are true-false, multiple-choise,and matching.

Selected-response items have three general qualities:

1. They can be reliably scored.

2. They can sample the domain of content extensively.


11/84

3. They tend to measure memorization of unimportant facts unless care is taken in

constructing the items.

True-false can be effective when a few guidelines are followed in the construction:

1. Statements must be clearly true or false.

2. Statements should not be lifted directly from the text.

3. Specific determiners should be avoided.

4. Trick questions should not be used.

5. Some statements should be written at higher cognitive levels.

6. True and false items should be of the same frequency and length.

Multiple-choice items can be improved by following these guidelines:

1. Avoid grammatical clues.

2. Keep option length uniform.

3. Use plausible distracters.

4. Do not repeat key words from the stem in the options.

Correct answer should be randomly ordered the response-option position. The position

of the correct response should not provide a clue about the correctness of the response.

Complex options (e.g., all of the above; none of the above; or a and b, but not c) should

be used sparingly, if at all.

Multiple-choice items are versatile:

1. They can measure higher cognitive outcomes.

2. They can provide diagnostic information.

Matching items are usually presented in a two-column format: one column consists of

premises and the other consists of responses.

Matching items should contain homogeneous content so that all responses must be

considered plausible answers.

The following guidelines apply to selected-response items formats:

1. Teachers should be aware of the appropriate number of items on the test that can be

guessed correctly.


12/84

2. Test items should be independent: The content of one item should not provide the

answers to others, nor should correctly answering one question be a prerequisite to

correctly answering another.

3. Reading level of the test should be lower than the grade level, unless reading is

being tested.

Key Terms and Concepts

Objective item Content sampling

Selected-response item Barrier

True-false item Stem

Multiple-choice Distractors

Matching item Clue

Options Premiere

Review Items

1. Objective items are objective only in their:

a. Item content.

b. Scoring.

c. Distracters.

d. Wording.

2. Many selected-response items can be asked in each testing session; thus, they can

provide good:

a. Levels of difficulty.

b. Objectivity.

c. Content sampling.

d. Time sampling.

3. The objective scoring of selected-response items enhances the tests:

a. Reliability.

b. Validity.


13/84

c. Reliability and validity.

d. Validity but not reliability.

4. When constructing true-false items, it is best to:

a. Use specific determiners.

b. Reproduce statements directly from the text.

c. Include about equal numbers of true statements and false statements.

d. Vary the length of false statements and true statements.

5. Which of the following is not a strength of multiple-choice items?

a. Effective testing of higher cognitive levels.

b. Content sampling.

c. Scorer reliability.

d. Allowing for educated guesses

6. When constructing multiple-choice items, it is best to:

a. Make all options the same length.

b. Put the main idea of the options.

c. Use options such as a and b, but not c.d. Repeat key words from the stem in the options.

7. A matching item includes six events to be matches with nine responses consisting

of dates, cities, and states. The error of item constructing is:

a. Too many premises.

b. Too few premises.

c. Responses contain heterogeneous content.

d. Responses contain homogeneous content.

8. When constructing a matching item, the numbers of premises and responses

should be:

a. Equal, with all responses used only once.


14/84

b. Equal, but having the option of using responses more than once.

c. Unequal, with a greater number of responses and any response being used only

once.

d. Unequal, with a greater number of responses and having the option of using

response more than once.

9. The tendency for true-false items to measure trivia is a weakness in the item writer

more that the item format:

T F

10. A good true-false is clearly true or false.

T F

11. Increasing the length of a matching item tends to enhance the homogeneity of

content.

T F

12. Multiple-choice items generally require about the same response time per item as

matching items.T F

13. The items format most appropriate for measuring knowledge of paired associates,

such as symbols and their meaning, is multiple-choice.

T F

14. The column of a matching item that contains the items stems is called the_______.

15. For usual classroom testing, the most desirable length for a matching item is

between ___________and___________premises.


15/84

-4-CONSTRUCTED-RESPONSE ITEMS

For a short-answer item, the student supplies the answer to a question, association, or

completion form.


16/84

In constructing short-answer items, each item should have a unique, correct answer be

structures so the student can clearly recognize its intent.

An essay item is one for which the student structures the response. He or she selects

ideas and then presents them according to his or her own organization and wording.

Essay items are used quite effectively to measure higher-level learning outcomes, such

as analysis, synthesis, and evaluation. Essay testing is not, however, an effective

means of measuring lower-level learning outcomes.

Essay items can be used to measure writing and self-expression skills. Although this

may not be the primary purpose of a given test, it is certainly worthwhile.

Scoring inconsistencies are the primary disadvantage of essay items. In addition,

irrelevant factors, such as neatness and penmanship, may also influence the score.

The extent of response is basically on a continuum, from restricted to extended. Writingitems with the response geared toward the restricted end tends to provide more focus

for the item.

The student must be directed to the desired response. This can be enhanced by

identifying the intended student behaviors and including them in the essay item.

The suggested time for responding to each test item should be provided to the students.

This designates the weight or value of each item and also helps students budget their

time.

Analytic scoring focuses on individual points or components of the response; holisticscoring considers the response in its entirety, as a whole.

If possible, responses to items should be scored anonymously. In addition, all

responses to one item should be scored before moving on to the next item, rather than

scoring an entire test at a time. Also, the papers should be reordered before scoring the

next time.


Constructed response Completion form

Short-answer item Restricted response

Essay item Extended response


17/84

Objective scoring Modal answer

Question form Analytic scoring

Association form Holistic scoring

Review Items

1. In an association form, short-answer item, the spaces for the responses should:

a. Vary according to the length of the correct response.

b. All be the same size.

c. Vary in size, but not according to any order.

d. Vary in size according to some system of ordering.

2. Completion, short-answer items should have the blank(s) placed:

a. At or near the beginning of the item.

b. Between the beginning and the middle of the item.

c. As close to the middle of the item as possible.

d. At or near the end of the item.

3. A Swiss cheese completion item has:a. The blanks evenly spaced throughout the item.

b. Too many blanks.

c. Blanks of unequal size.

d. None of the above.

4. Essay items are popular in teacher-constructed test because:

a. Of the subjectivity in their scoring.

b. They are perceived to be more effective in measuring higher-level outcomes than

objective items.

c. They tend to have greater content sampling than objective items.

d. They tend to have greater reliability than objective items.


18/84

5. When scoring essay items, all responses to one item should be scored before

scoring the next time, rather than scoring one entire test before scoring the next.

This procedures:

a. Increases the test validity.

b. Enhances the consistency of scoring.

c. Reduces bias against individual students.

d. Enhances the objectivity of scoring.

6. The test scorer reads the response to one essay item after already reading several

other responses to the same item. The score of this response will tend to be:

a. Higher, if the earlier responses were of poor quality.

b. Higher, if the earlier responses were of high quality.

c. Lower, if the earlier responses were of poor quality.

d. Unaffected by the quality of earlier responses.

7. The halo effect in scoring items is a tendency is to score more highly those

response:

a. Read later in the scoring process.

b. Read earlier in the scoring process.c. Of students known to be good students.

d. That are technically well written.

8. A student receives a high score on an essay item, due, in part, to the quality of

responses to the item read earlier. This is:

a. A context effect.

b. A hallo effect.

c. A reader-agreement effect.

d. None of the above.

9. Anonymous scoring of essay item responses tends to reduce:

a. Reader agreement.


19/84

b. The halo effect.

c. Order effect.

d. Effect due to technical characteristic, such as penmanship.

10. From a measurement standpoint, using classroom test consisting entirely of essay

items is undesirable because:

a. Content sampling tends to be limited.

b. Scoring requires too much time.

c. It is difficult to construct the items.

d. Structuring model responses is too time consuming.

11. Short-answer items are generally easier to construct than matching items.

T F

12. The use of essay items is an effective means of measuring lower-level learning

outcomes.

T F

13. Analytic scoring of essay items tends to be faster than holistic scoring.T F

14. Analytic scoring of essay items tends to be more objective than holistic scoring.

T F

15. There is a tendency to score longer responses to essay items more highly than

shorter response.

T F

16. Including optional items in an essay exam is a desirable practice.

T F


20/84

-5-

NORM-REFERENCED MEASUREMENT


21/84

Norm-groups are they referent groups for norm-referenced interpretations of test

scores. Such groups must be appropriate for the individuals tested and the purposes at

hand.

Norm should be representative, relevant, and recent.

Representativeness of the norm group depends on the size of the sample and the

sampling method. The latter has numerous factors associated with it and is the most

likely source of producing biased norms.

Relevance depends on the degree to which the norm group is comparable to the group

under consideration.

National, local, and subgroup norms provide different perspectives for interpreting the

result of tests.

Norms are measure of the actual performance of a group on a test. They are not meantto be standards of what performance levels should be.

Descriptive statistics are used to summarize characteristics of sets of test scores. The

level of statistics commonly used in measurement is quite basic, requiring only simple

arithmetic operations.

Frequency distributions summarize sets of test scores by listing the number of people

who received each test score. All of the test scores can be listed separately, or the

scores can be grouped in a frequency distribution.

The mean, the median, and the mode all describe central tendency:

1. The mean is the arithmetic average.

2. The median divides the distribution in half.

3. The mode is the most frequent score.

Descriptive statistics that indicate dispersion are the range, the variance, and the

standards deviation. The range is the difference between the highest and lowest scores

in the distribution plus one. The standard deviation is a unit of measurement that shows

by how much the separate score tend to differ from the mean. The variance is thesquare of the standard deviation. Most scores are within two standard deviations from

the mean.



22/84

Norm group Measure of central tendency

Norms Measures of dispersion

Representiveness Mean

Recency Median

National norms Mode

Local norms Range

Grade equivalent score Variance

Descriptive statistics Standard deviation

Frequency distribution

Review Items

1. When using a norm-referenced interpretation, the students score on a test is

compared.

a. A minimum score for passing the test.

b. The score of others taking the test.

c. As expected score based on the students ability.

d. A predetermined percentage of correct responses.

2. It is not important that the norm group for a nationally used achievement test:a. Is large.

b. Is representative.

c. Is from at least three grade levels.

d. Has persons from all states.

3. The extent to which a norm group is comparable to the group being tested

determines the norm groups

a. Relevance.

b. Representativeness.

c. Recensy.

d. Reliability.


23/84

4. The Lake Wobegon Phenomenon in testing is the situation of:

a. Norm groups scoring unusually high on standardized tests.

b. Students scoring below the national average on standardized tests.

c. Students scoring above average on locally normed.

d. All states reporting above-average performance on nationally normed tests.

5. Norms for published standardized tests are commonly based on the performance

of:

a. Individual students who will perform well.

b. One or more groups of students.

c. Students in a typical school system.

d. A random sample of students from one state.

6. Local achievement and aptitude norms might be more important than national

norms is decisions about:

a. Future occupation

b. The likelihood of success in certain colleges.

c. Selection into special high-school programs.

d. Allocations among different school district.

7. The central administration of a school district sets a goal of having all elementary-

school students reading at or above the average on a nationally normed test. The

mistake being made is:

a. Making the assumption that the norm group is relevant to the local school.

b. Using the norm as a standard.

c. Attempting to have consistent reading performance in all schools.

d. Establishing too modest a goal.

8. Which of the following is not a measure of central tendency?

a. Mean

b. Variance.


24/84

c. Mode.

d. Median

9. When a distribution has a small number of scores, some of which are very extreme,

the preferred measure of central tendency is the:

a. Median

b. Mean

c. Range

d. Mode

10. A measure of dispersion for a distribution, whose computation involves only the

extreme scores, is:

a. Standard deviation

b. Variance

c. Mode

d. Range

11. Which of the following provides a mesure of dispersion in the same units as the

original scores?a. Variance

b. Median

c. Standard deviation

d. Correlation

12. Measure of central tendency are to location as measures of dispersion are to:

a. Points

b. Spread

c. Average

d. Frequencies


25/84

13. If the mode and median of a distribution of scores are equal, the mean will also

have to be equal to the median.

T F

14. When establishing national norms, size of the norm group is a major concern.

T F

15. Generally, the larger the numerical value of the median, the larger the value of the

standard deviation.

T F

-6-

COMPARING SCORES TO NORM GROUPS


26/84

When comparing an individuals score to the scores of the norm group, the point is to

determine where the individuals score locates in the norm group distribution.

Percentiles indicate the percentage of students in the norm group who are at or below aparticular score.

The standard normal distribution has a mean of 0 and a standard deviation of 1.0. The

area in the Appendix 4 table is given from the mean to the z-score and it is the

proportion of the total area.

Standard score and transformed standard scores express the relative position of a

score in a distribution in terms of standard deviation units from the mean.

Stanines provide equal units of measurement. There are nine stanine scores and the

name comes from standard nine. Each stanine contains a band of scores, each bandequal to one-half standard deviation in width.

The NCE score is a normalized standard score with a mean of 50 and a standard

deviation of 21.06. Scores ranga from 1 through 99 and an equal unit is retained in the

scale.

Grade equivalent scores are intended to indicate the average level of performance for

students in each month of each grade. Unfortunately, grade equivalents do not from an

equal interval scale.

Key Term and Concepts

Percentile Stanines

Percentile rank Normalized T-score

Standard score Normal curve equivalents score

Standard normal distribution Grade equivalent score

Transformd standard score

Review Items

1. A student who scores at the 45th percentile on a test:

a. Answered 45 percent if the items correctly.

b. Is above average in performance.


27/84

c. Equaled or surpassed 45 percent of the other examinees.

d. Had at least 45 percent of the right answer.

2. Standard scores express an individuals position in the distribution of scores in

terms of:

a. Standards of performance.

b. Standard deviations from the mean.

c. Standard deviation from the maximum possible score.

d. Deviation from a standard of performance.

3. A test score that is at the 42nd percentile could also be said to be at which stanine?

a. 3rd

b. 4th

c. 5th

d. 6th

4. A z-score of 1.5 would have what value if it were converted to a t-score?

a. 120

b. 50c. 35

d. 65

5. Which of these cannot be meaningfully average because the scores are ordinal

rather than interval?

a. Percentiles

b. Stanines

c. Standard scores

d. None of the above

6. A student receives a z-score of +1.25 on an exam. This mean the students

performance is:


28/84

a. Below the mean performance of the group

b. One-quarter of a standard deviation above the mean performance of the group

c. At the average for the group

d. Around the 89th percentile of the group

7. The standard normal distribution has:

a. a mean of 0 and a standard deviation of 1

b. a mean of 50 and a standard deviation of 1

c. a mean of 0 and a standard deviation of 10

d. a mean of 50 and a standard deviation of 10

8. A students t-score in a distribution of transformed standard scores is 40. This

students performance is:

a. Above average

b. At the 40th percentile

c. At the 7th stanine

d. Below average

9. Stanines divide a distribution into nine parts so that each part:a. Contains about 11 percent of the scores

b. Is one-half standard deviation wide

c. Represent 10 percentile ranks

d. Contains the mode of distribution

10. In a normal distribution, which of the following indicates the highest relative position

in the distribution of scores?

a. Z = 1.5

b. Percentile rank = 90

c. T = 65

d. Stanine = 8


29/84

11. Joe has a stanine score of an exam. Hid performance is:

a. Below the 6th percentile

b. Between the 60th and 77th percentile

c. At the 50th percentile

d. Above the 80th percentile

12. Normal curve equivalent score (NCEs) range from:

a. -3.00 to +3.00

b. 1 to 9

c. 30 to 80

d. 1 to 99

13. Percentiles are more of an ordinal scale than an equal interval scale.

T F

14. When score are converted to percentiles, a specified gain in achievement will result

in a larger increase in percentile rank if the gain is near the high end of the

distribution than near the middle of the distribution.

T F

15. Grade equivalent scores are on an equal interval scale.

T F

-7-


30/84

ITEM STATISTICS FOR NORM-REFERENCED

TESTS

This chapter introduced the concepts of analyzing individual items of norm-referenced

tests, that is, how well the items are performing in the total test. The correlation

coefficient was introduced as a descriptive statisticone that can be used to indicate

the direction and strength of the relationship between two variables. In testing

applications, the variables are often scores on individual items, score on tests or other

measuring instruments, and scores on external criteria that we try to predict, such as

future grade-point average. The correlation coefficient will also be used in future

chapters to develop the concepts of validity and reliability.

Two items statisticsdifficulty and discriminationwere introduced and shown to be

useful in evaluating the performance of individual items on norm-referenced tests. The

difficulty index indicates the percentage of persons who answered an item correctly,

whereas the discrimination index shown how well the item separated those who had

high and low scores on the total test. The discrimination index is based on the

correlation between scores on an individual item and those on the total test.

Constructing a perfect test is not likely, especially for the initial draft of the best, even

when we follow the guidelines for good test construction. Confusion, ambiguity, and

poorly constructed options may enter into an item. Students may perceive items

differently than intended by the teacher. Item analysis provides empirical data about

how individual items are performing in a real test situation. Item statistics do not reveal

specifically the deficiencies in the content of items, but they indicate when an item is

deficient. Checking the item difficulty index and the discrimination index may give some

clues as to what is wrong. A careful inspection of the item content and response

patterns of students is often quite revealing.


Correlation coefficient Coefficient of determination

Scatterplot Difficulty index

Pearson product-moment coefficient Discrimination index

Review Items


31/84

1. To compute the correlation between attitude and achievement, one must have:

a. Achievement score from one group of people and attitude scores from another

group.

b. Achievement and attitude scores on the same group of people.

c. Achievement scores from two points in time and attitude scores from two points

in time.

d. The same tests given twice to the same group of people.

2. The correlation coefficient is a number that can range of correlation:

a. 0 to +1.00

b. -1.00 to +1.00

c. Minus infinity to plus infinity

d. 0 to 100

3. Which of the following indicates the greatest degree of correlation?

a. 52

b. 61

c. + 23

d. + 42

4. The variance of a distribution is a measure of:

a. Dispersion

b. Central tendency

c. Relationship

d. Location

5. Students who can high on an ability measure were found to be able to solve a

learning task much faster than students scoring low on the ability measure. If scores

on the ability measure and time to compete the learning task are correlated, we

would expect:

a. Zero correlation


32/84

b. A zero coefficient of determination

c. Positive correlation

d. Negative correlation

6. An exam given to 40 students; 35 of the students respond correctly to an item. The

difficulty index for the item is close to:

a. 0

b. 1

c. 87

d. 40

7. Which difficulty index is indicative of the most difficult item?

a. 90

b. 50

c. 25

d. 12

8. The preferred difficulty index for items of norm-referenced tests in:

a. Close to 1b. Close to 0

c. Close to 80

d. Close to 50

9. If an item has a high discrimination index, it means that scores on the item have:

a. No correlation with total test scores

b. High correlation with total test scores

c. Low correlation with total test scores

d. Negative correlation with total test scores

10. An item has a negative discrimination index. Thus, if the student responds correctly

to this item, for this student we would expect a:


33/84

a. Low total test score

b. High total test score

c. Total test score around the middle

d. Total test score of zero

11. Of the following, which provides information about the distribution of total test

score?

a. Difficulty index

b. Correlation coefficient

c. Discrimination index

d. Standard deviation

12. If we want to identify who is getting the test item correct, low-scores of high-scores,

we would check the difficulty index.

T F

13. An item has a discrimination index around 8. This means that scores on the test are

getting the item correct.

T F

14. An item has a difficulty index close to zero. This means that high-scores on the test

are getting the item correct.

T F

15. The ideal situation for a test is to have high difficulty levels and high discrimination

indices for the items.

T F

-8-

RELIABILITY OF NORM-REFERENCED TESTS


34/84

Reliability of measurement is consistencyconsistency in measuring whatever the

instrument is measuring.

Stability reliability is consistency of measurement across time.

Test-retest, with the same test administered at different times, provides the estimate of

estimate of stability reliability. The reliability coefficient is the correlation between the

scores of the two test administrations.

Equivalence reliability is consistency of measurement across two parallel forms of a

test.

The split-half procedure divides the test into two parallel halves; the score of the two

halves are then correlated. The reliability of the total test is then estimated using the

Spearman-Brown formula.

The KR-20 formula gives an estimate of internal consistency reliability (r 20), which, in

essence, is the mean of all possible split-half coefficient.

The r21 may be substituted forr20 if item difficulty levels are similar; r21 is computationally

easier, but it underestimates reliability if the items vary in difficulty.

The alpha coefficient provides an estimate of internal consistency reliability, based on

two or more parts of a test. If each item is considered a part, the r is equivalent to r20.

The length affects reliability in such a way that, the longer the test, the greater thereliability, assuming other factors remain constant.

The Spearman-Brown formula is used for estimating the reliability of increased length. It

is applied when using the split-half procedure since the total test is twice as long as the

individual halves.

Difference scores tend to be less reliable than scores on individual tests. As the

correlation between the scores on the two tests creases, the reliability of the difference

scores decrease.

An observed test score may be considered as consisting of two parts, the truecomponent and error component.

In considering the distribution of the observed, true, and error scores:

Xo = Xt + Xe and so2 = st

2 + se2


35/84

Reliability is the proportion of the variance in the observed scores that is true or

nonerror variance.

The standard error of measurement is the standard deviation of the distribution of errorscores. As reliability increase, the standard error of measurement decrease.

We can use the concepts of reliability and standard error of measurement in making

inferences about how an individuals score would fluctuate on repeated use of the same

test. The distribution of scores would have a mean approaching the individuals true

score and a standard deviation equal to the standard error of measurement.

Increased range of performance of the students being tested tends to enhance

reliability.

Item similarity enhances reliability and item difficulty affect reliability such that items ofmoderate difficulty, around 50 percent correct responses per item, enhance reliability.


Reliability Kuder-Richardson formula-20

Reliability coefficient Kuder-Ricardson formula-21

Test-retest Cronbach alpha

Stability reliability Difference score

Parallel forms Error variance

Equivalence reliability True variance

Interval consistency reliability Standard error of measurement

Spearman-Brown formula

Review Items

1. The reliability coefficient can take on values:

a. From 0 to +1.00, inclusive

b. From 1.00 to + 1.00, inclusive

c. Of any positive number

d. From 1.00 to 0, inclusive


36/84

2. If a group of student was measured in September using a mathematics

achievement test and then tested again in October using the same test, the

correlation coefficient between the scores of the two test administrations would be

a measure of:

a. Stability reliability

b. Equivalent reliability

c. Interval consistency reliability

d. Both stability and equivalence reliability

3. Reliability estimates of a test:

a. May be based on content or logical analysis of the test

b. Require some correlation coefficient

c. Increase with repeated test usage

d. Are the same for all applications of the test

4. If a reliability estimate is based on a single administration of a test, the reliability of

interest is not:

a. Stability reliability

b. Split-half reliabilityc. Equivalence reliability

d. Internal consistency reliability

5. On a given test, the observed standard deviation of the scores is 20, and the

reliability of the test is 84. The standard error of measurement is:

a. 8.00

b. 18.33

c. 3.20

d. 16.80

6. A test of 40 items has reliability of 70. If the test is increased to 80 items, the

reliability will be:


37/84

a. 0.99

b. 0.54

c. 0.82

d. 0.90

7. If the reliability of a test is 1.0, the standard error of measurement is:

a. 1.0 also

b. Greater than 1.0

c. Undeterminable

d. 0

8. Identify the reliability estimation procedure appropriate for determining stability

reliability of a test:

a. Split-half

b. Kuder-Richardson formula-20

c. Parallel forms administered at the same time

d. Parallel forms administered at difference times

9. A reading test is given to two groups of sixth-grade students: Group A consists ofhigh-ability (IQ 120 or greater) students; Group B consists of students of

heterogeneous ability (IQ range 90 to 150). The most likely reliability situation is:

a. Test reliability will be the same for both groups

b. Test reliability will be greater for the group A than group B

c. Test reliability will be greater for group B than group A

d. No inference can be made about test reliability

10. In applying the split-half procedure for estimating reliability, the reliability coefficient

for one-half the test is computed. To estimate the reliability of the entire test, we use

the:

a. Kuder-Richardson formula-20



38/84

c. Spearman-Brown formula

d. Cronbach alpha procedure

11. The Kuder-Richardson 20 procedure (KR-20) is a procedure for estimating reliability

that provides:

a. An internal consistency coefficient

b. The mean of all possible split-half coefficients

c. Both and b

d. Neither a nor b

12. A test of 100 items is divided into five subtests of 20 items each. If we are interested

in internal consistency reliability, the most appropriate procedure for estimating

reliability is:

a. Kuder-Richardson formula-20


c. Cronbach alpha

d. Parallel forms

13. A mathematics test is given to a class of gifted students and also to a regularungrouped class. The reliability of the test would likely:

a. Greater for the gifted class

b. Greater for the ungrouped class

c. About the sane for both classes

d. Unable to infer anything until the reliability coefficient is computed

14. The standard error of measurement is a measure of:

a. Location

b. Central tendency

c. Variability

d. Association


39/84

15. As the standard error of measurement increase, the reliability of a test:

a. Also increase

b. Decreases

c. Remains unchanged

d. May increase or decrease

16. Theoretically, with respect to variance, reliability can be considered the ratio of:

a. Observed variance to true variance

b. Error variance to observed variance

c. True variance to error variance

d. True variance to observed variance

17. Conceptually, the true component and the error component of a test score are such

that:

a. The greater the true component, the greater the error component

b. The greater the true component, the smaller the error component

c. The component are equal

d. The component are independent

18. In conceptualizing the distributions of observed, true, and error scores, the following

is true for the means:

a. The observed mean equals the true mean

b. The error mean equals zero

c. The observed mean equals the true mean plus the error mean

d. All of the above

19. Conceptually, the variances of the distributions of the observed, true, and error

score are such that:

a. The variance of the error scores is zero

b. The error variance plus the true variance equal the observed variance

c. The observed variance is less than the true variance


40/84

d. The observed variance and the true variance are equal

20. A difference score is generated by subtracting a pretest score from a posttest score.

In order to obtain a high reliability for the difference score, we require:

a. Low correlation between pretest and posttest scores

b. High reliability for both pretest and posttest scores

c. Both a and b

d. Neither a nor b

-9-

VALIDITY OF NORM-REFERENCED TESTS


41/84

Validity is the correct to which a test measure what it is intended to measure.

Content validity is concerned with the extent to which the test is representative of a

defined body of contact consisting of topics and process.

Content validity is based on a logical analysis. It does not general a validity coefficient,as is obtained with some other types of validity.

Standardized achievement tests tend to have broad content coverage so they will have

wide application. However, when used in a specific situation, the content validity of e

prospective test should always be considered.

Criterion validity is based on the correlation between scores on the test and scores on a

criterion. The correlation coefficient is the criterion validity coefficient.

Concurrent validity is involved if the scores on the criterion are obtained at the same

time as the test scores. Predictive validity is involved if the scores on the criterion areobtained after an intervening period from those of the best.

Concurrent validity applies if it is desirable to substitute a shorter test for a longer one.

In that case, the score on the longer test is the criterion, and validity is that of the

shorter test.

The construct validity of a measure or test is the extent to which scores can be

interpreted in terms of specified traits or construct.

Factors analysis is a procedure for analyzing a set of correlation coefficient between

measures; the procedure analytically identifies the number and nature of the constructsunderlying the measures. Different types of factors are general, group, and specific

factors.

For test validated through correlation with a criterion measure, validity can be expressed

as the proportion of the observed test variance that is common variance with the

criterion. The validity coefficient is the square root of this proportion or ratio.

A test cannot be valid (either conceptually or practically) if it is not reliable; however. A

reliable test could lack validity. Thus, reliability is a necessary but not sufficient condition

for test validity.

A well- constructed test with items of proper difficulty level will enhance validity. Validity

tends to increase with test length. Low-item inter correlation may tend to enhance

criterion validity if we have a complex criterion.


42/84

Increased heterogeneity of the group measure tends to enhance validity. Subtle,

individual factors may also affect validity. Tests should be properly administrated, since

any procedures that impede performance also lower validity.


Validity Factor analysis

Content validity Factor loading

Criterion validity General factor

Validity coefficient Group factor

Concurrent validity Specific factor

Predictive validity Covariation

Construct validity

Review Items

1. Which of the following types of validity does not yield a validity coefficient?

a. Predictive

b. Concurrent

c. Content

d. Criterion

2. When considering the terms reliability and validity, as applied to a test, we can say:

a. A valid test ensures some degree of reliability

b. A reliable test ensure some degree of validity

c. Both a and b

d. Neither a nor b

3. If a test is representative of the skills and topics covered by a specific unit of

instruction, the test has:

a. Construct validity

b. Concurrent validity

c. Predictive validity


43/84


44/84

8. Which characteristic is true of criterion validity?

a. It is based on a logical correspondence between two tests

b. It includes two types, concurrent and predictive validity

c. It is based on two administrations of the same test

d. All of the above

9. A school system uses a test considered to be valid for measuring student

achievement, but the test requires three hours of administration time. The principals

and teachers are considering substituting a shorter test for the longer one. The

validity of concern here is:

a. Concurrent

b. Content

c. Construct

d. Predictive

10. The testing division of a school system is attempting to analyze the traits that are

inherent in the six sub scores of an academic achievement test. The validity of

concern here is:

a. Concurrentb. Content

c. Construct

d. Predictive

11. Construct validity is establish through:

a. Logical analysis

b. Statistical analysis

c. Both logical and statistical analysis

d. Neither logical nor statistical analysis


45/84

12. Factor analysis is a procedure often used in establishing construct validity of a set

of tests. In the analysis, the factor loadings that are computed are correlation

coefficients between:

a. Scores on two or more tests of the set

b. Factors and test scores

c. Two or more factor scores


13. A factor analysis is constructed on the scores from six different IQ tests. One of the

factors has a large loading with a single IQ test and very small loading with the

other five tests. This is a:

a. General factor

b. Specific factor

c. Group factor


14. If a validity coefficient is computed for a test, and the test has been used a very

homogeneous group of student, we expect that the validity coefficient will be:

a. Moderate, around 55b. High

c. Low

d. Unable to make an inference

15. Which of the following is least like the others?

a. Criterion validity

b. Construct validity

c. Concurrent validity

d. Predictive validity

16. A test is found to have reliability but low validity. In order for this to occur, the test

has:


46/84

a. Little true variance

b. Large error variance

c. Large specific variance

d. Little observed variance

17. When using criterion measures for establishing validity, a validity coefficient is

computed. Theoretically, in terms of variance, the validity coefficient is the square

root of the ratio of:

a. Variance common with the criterion to observed variance

b. Observed variance to variance common with the criterion

c. True variance in the criterion to observed variance

d. True variance in the criterion to true variance in the best test being validated

18. In order to enhance validity, given a criterion consisting of several abilities, we

would want a test with low-item inter correlations.

T F

19. Predictive validity of a test is increased as the groups tested become more

homogeneous.T F

20. Construct validity refers to the adequacy of item construction for a test.

T F

-10-


47/84

CRITERION-REFERENCED TESTS

A criterion-referenced test score indicates the level of performance on a well-specified

domain of content.

When the test items are not representative of a well-specified domain we cannot

generalize our results beyond the specific items on the test.

Item forms contain enough detail about how the items should be constructed so that

they represent a well-specified domain.

Instructional objectives are usually too terse to provide an adequate description of a

domain. Objectives and test specifications are needed before criterion-referenced tests

are appropriate.

Teachers can construct criterion-referenced tests through the use of objectives and item

specifications.

A standard of minimal acceptable performance is required whenever a decision about

mastery is to be made. There are several methods for setting such standard, none of

them perfect, that can be used with criterion-referenced and other kinds of tests.


Norm-referenced Test specification

Criterion-referenced Stimulus attributes

Domain Response attributes

Item form Mastery decision

Objective Standard setting

Review Items

1. Items forms are seldom used by classroom teachers because the forms:

a. Lack validity

b. Are complex and unwieldy

c. Require extensive pilot testing

d. Are appropriate only for standardized tests


48/84

2. Item forms refer to:

a. Item-writing rules

b. Response types (e.g., true-false, multiple-choice)

c. Parallel forms if items for reliability

d. Patterns of responses to sets of items

3. Which of the following is most critical in a criterion-referenced test?

a. A prespecified standard of performance

b. Objectively scored items

c. Specific behavior objectives

d. A well-scored objectives

4. A major strength of criterion-referenced testing is the ability to:

a. Generalize the results to a large set of items

b. Compare individuals in terms of relative standing

c. Establish objective performance criteria

d. Measure difficult to define constructs

5. A poorly defined domain results in items that are:

a. Too difficult

b. Dissimilar

c. Ambiguous to the examinees

d. Unreliable

6. Teachers often prefer criterion-referenced tests to norm-referenced tests because:

a. They are very concerned with which student is best in the class

b. They need to compare the learning in their class to that of their classes

c. Of the specific, discrete knowledge or skills that are assessed rather than global

constructs

d. They are usually easier to construct


49/84

7. The test construction concept that is more details than instructional objectives but

less cumbersome than item forms is (are):

a. Item calibrations

b. Test blueprints

c. Item objectives

d. Test specifications

8. When a test has a presets standard of minimum acceptable performance, it is a

criterion-referenced test.

a. Always true

b. Always false

c. Sometimes true

9. The method of setting standards that is most likely to be used in classroom setting

is the:

a. Professional judgment method

b. Nedelsky method

c. Angoff methodd. Constructing groups method

10. A panel of qualified experts is not used in which of the following methods of setting

standards?

a. Professional judgment

b. Nedelsky

c. Angoff

d. Constructing group method

11. Teachers often prefer criterion-referenced measures to norm-referenced measure

because criterion-referenced measures:

a. Are more reliable


50/84

b. Are less intimidating to students

c. Indicates what the student can do

d. Indicate who in class has done the best

12. Critics of criterion-referenced test are correct when they characterize standard-

setting procedures as:

a. Vague

b. Subjective

c. Inconsistent

d. Sophisticated

13. Most likely tests that are liked to brief, specific instructional objectives are good

examples of criterion-referenced tests.

T F

14. The item formats (e.g., multiple-choice, essay, etc.) should be different for norm-

referenced tests than for criterion-referenced tests.

T F

15. The panel of experts that is used in some standard-setting methods in school

setting consist of:

a. Academically talented students

b. Classroom teachers

c. Parent volunteers

d. Students from higher grades

-11-


51/84

ITEM STATISTICS FOR CRITERION-

REFERENCED TESTS

A test score is determined by the performance of the student on each of the items on

the test. In order to understand the test score it is essential that we understand how

each item contributes to that score. The quality of the test depends on the quality items

that comprise it. The procedures that are described in this chapter are ways to look at

the quality of the test items.

Items should be subjected to a content review before the test is given. Experts and

colleagues can help us by reviewing the test items for their match with the domain

specifications or the objectives, for any potentially biased wording, and for any

observable flaws in the items construction.

After the tests have been administered and scored, there should be a review of the

kinds of errors that were made so that remediation can focus on these errors. Statistical

analysis should be done so that we have evidence about the difficulty levels of the items

and about the degree to which the items are discriminating between masters and non-

masters or between students before and after instruction.

The difficulty levels of test items often turn out to be quite different from what the

teacher expected. Difficulty levels are clear measure of how the students performed on

a specific taskthe test item. As such, they provide very useful information to theteacher.

The discrimination index provides information that is directly related to the purpose of

the test. A discrimination index can be seen as an analogy to the sport of rowing. If all of

the items on a test are likened to the crew members, we see that things work best when

they are all pulling together. This is the case when all of the discrimination indexes are

positive. If one of the crew lifts his or her oars out of the water and does nothing, it is

like an item with a zero discrimination index. A negative discrimination index would be

the situation of the crew member (items) rowing in the opposite direction as the rest of

the crew. Clearly this latter case requires some correction action.

We cannot merely assume that we create high-quality test items. We need to subject

those items to item analysis in order to convince ourselves and others that the item

analysis is time well spent. The information that is provided will help us to understand

the quality of our tests so that we can base decisions on those test scores with

confidence.


52/84


Items analysis Pre and post-discrimination index

Content review Mastery/ non-mastery

Pilot testing Item discrimination index

Error patterns Item difficulty index

Review Items

1. An item with ap value near 1.0 is quite:

a. Easy

b. Difficult

c. Discriminating

d. Reliable

2. Analysis of test result at the item level is useful for:

a. Decisions about individual students

b. Decisions about instruction

c. Decision about the test items

d. All of the above

3. Panels with diverse background can be used to examine test items for:

a. Items bias

b. Difficulty

c. Discrimination

d. Continuity

4. It is critically important in criterion-referenced tests that the test items:

a. Are difficult when used on a pretest

b. Match the domain or objective

c. Discriminate between competent and less competent students

d. Not be difficult


53/84

5. The difficulty index refers to:

a. values of student ratings of whether an item was easy or difficult

b. the percentage of examinees who answered an item correctly

c. teacher judgment of how well students are likely to do on an item

d. the likelihood of guessing the correct answer to a test item

6. Pre and post-discrimination refers to whether a test item.

a. Is easier on the posttest than on the pretest

b. Adequately discriminates pre-posts from post-posts

c. Discriminate unfairly against certain ethnic group

d. Would be better placed on a pretest than on a posttest

7. If 30 students are tested and 20 answer item 4 correctly, the difficulty index for

items 4 would be:

a. 10

b. -10

c. 33

d. 67

8. The item statistic that would indicate the most serious concern would be:

a. Difficulty equal to 85

b. Difficulty equal to 05

c. Discrimination equal to -50

d. Discrimination equal to 00

9. Items that match a well-specified domain should have difficulty levels that:

a. Are exactly equal

b. Are very similar

c. Range from 0 to 1

d. Match the domain specification


54/84

10. The higher the value of the difficulty index, the:

a. Easier the item

b. More discriminating the item

c. Lower the percentage correct on the item

d. More biased the item

11. If an item has a positive discrimination index:

a. The item should be received

b. The item is biased

c. The item appears to be effective

d. The test will not be valid

12. Item analysis is:

a. A content analysis

b. A statistical analysis

c. Both a and b

13. When can statistical item analysis be done?a. Before the test is given

b. While the test is being given

c. After the test is given

d. Both b and c

14. Other teachers would be most needed when determining:

a. Item difficulty

b. Item discrimination

c. Item reliability

d. Item bias


55/84

15. A test item that is positively discriminating for third-graders would be positively

discriminating for second-graders.

a. Definitely true

b. Possibly true

c. Definitely false

-12-


56/84

RELIABILITY OF CRITERION-REFERENCEDTESTS

A test is reliable if it provides consistence information about examinees. This can meanthat a criterion-referenced test provides consistent estimates of performance on a

domain or that the test provides consistent placement of an examinee in a mastery or

non-mastery category. Different kinds of reliability evidence are needed for each of

these uses of criterion-referenced tests.

Whether a test is consistent relative to mastery decisions is shown by giving the test on

two occasions to the same group of examinees and finding the percentage of

examinees whose mastery/ non-mastery classifications were both the same on the two

test occasions. This procedure could also be used when a parallel form of the test is

given on the second testing. A reliable test would have a high percentage of examineeswith the same mastery/ non-mastery classification on the two tests.

When performance on a domain is to be estimated from the test scores, the standard

error of measurement can be used to form an interval estimate. An interval estimate

suggests the degree of imprecision that is in our test scores. The standard error of

measurement gives us an idea about how much we can expect test scores to fluctuate

across repeated testing.

The reliability of the test can be increased by careful attention to the test items, the test

setting, and the examinees. A reliable test would have items, the test are

homogeneous. The more similar the items are, the more consistent will be students

approach to those items. The items should be free of flaws or vagueness of wording so

that inconsistencies are reduced. And, because there is a direct relationship between

the length of the test and the reliability of the test, there should be a sufficient number of

items.

Inconsistencies in student performance can be reduced by making sure that the testing

conditions are appropriate. There should be adequate light and quite so that the student

can concentrate on the task. Interruption or distractions should be eliminated and the

test items and directions about how to answer them should be clear.

Reliable scores depend on the students being motivated to apply themselves to the

task. This is promoted when the teacher encourages the students to do well and

explains how the test scores will be used. The teacher should be alert for individual

student problems such as fatigue or anxiety that might be affecting the reliability of the

test scores.


57/84


58/84

5. Which one of the following computed for the reliability of a test would indicate that

the test is totally unreliable?

a. 10

b. 00

c. 50

d. 100

6. Exactly 100 students took a criterion-referenced test twice. The test had a mastery

cut-off score; 70 students were above the cut-off score on both tests and 15

students were below the cut-off score on both tests. The reliability of the test for

mastery decisions would be:

a. 40

b. 55

c. 70

d. 85

7. When estimating the reliability of a domain score, it is appropriate to use the:

a. Standard error of measurement

b. Average score for the classc. Range of possible domain scores

d. Measurement error coefficient

8. Other things being equal, the longer a test is, the____________will be its reliability.

a. Higher

b. Lower

c. Less ambiguous

d. More valid

9. The reliability coefficients that were developed for norm-referenced test can also be

used effectively with criterion-referenced tests.

T F


59/84

10. The same criterion-referenced test was given to 30 children on consecutive days.

Of the children, 10 who surpassed the mastery cut-off score, the first day failed to

do so on the second day. The test could be said to be:

a. Unfair

b. Biased

c. Unreliable

d. Discriminating

11. A test is either reliable or it isnt

T F

12. Test reliability is primarily determined by the test itself. The test setting and the

examinee have a minimal impact on test reliability.

T F

13. Which of the following is most related to high criterion-referenced reliability?

a. Item difficulty near 50

b. Item discrimination near 50c. Short tests

d. A wide range of item types

14. When estimating a domain score, the reliability would increase if:

a. The items were more difficult

b. The items were somewhat dissimilar

c. The test had a cut-off score for mastery decisions

d. The test was longer

15. The longer the test, the smaller the___________.

a. Time between pre- and posttesting

b. Standard error of measurement


60/84

c. Difficulty index

d. Reliability discrepancy

-13-

VALIDITY OF CRITERION-REFERENCED TESTS


61/84

A test that adequately serves the purpose for which it is used is considered to be a valid

test. Validity is always defined in terms of the purpose for which the test scores will be

used. Validity is a matter of degree. One test may be more valid than another but tests

are not usually totally lacking in validity and they are never perfectly valid.

Because criterion-referenced tests are used for several different purposes, including

estimating performance on a domain and determining whether students have achieved

mastery, it is not surprising that different kinds of logical and statistical evidence should

be presented to support the validity claims. The three kinds of test validity that were

introduced are content validity, criterion validity, and construct validity.

Content validity is a determination of the extent that the test items match the domain

specifications or objectives. Validity is established by having qualified persons, a panel

of expert, review the test items for appropriateness and congruence with the domain.

Criterion validity is concerned with whether the test would be an adequate predictor of

performance on some other variable. Validity evidence is established by finding the

correlation coefficient that links the test with the criterion that is to be predicted. The

choice between two competing test would be based on which test has the higher

correlation with the criterion. When we are concerned about mastery decision on two

measures, the degree of validity is shown by the percentage of persons for which the

mastery/ non-mastery decision is consistent.

Construct validity is shown by making predictions about the test scores and then

conducting analyses to see whether the predictions are confirmed. Some of thereasonable predictions are: (1) the test scores should be positively correlated with other

measures of the same thing, (2) groups that are known to differ on the domain should

have test scores that are significantly different, and (3) we should not find different

patterns of responses across distracters for persons of different races, grades, or other

characteristics.

We cannot merely assume that are valid. We need to conduct careful analyses to show

that our tests have sufficient content, criterion or construct validity so that we can justify

the use of the tests.


Validity Construct validity

Content validity Logical analysis


62/84

Criterion validity Statistical analysis

Correlation validity Distractor analysis

Review Items

1. If a test is valid it is certainly also reliable.

T F

2. Which of the following is nota validity that is described in the technical standards for

test publisher?

a. Criterion-referenced validity

b. Content validity

c. Construct validity

d. Criterion validity

3. Whether the items on a test match the domain of the criterion-referenced test is

primarily a concerned about:

a. Cut-off score validity

b. Item validity

c. Content validity


4. Essentially the same processes are used to establish the validity of criterion-

referenced tests as are used with norm-referenced tests.

T F

5. A well-specified domain for a criterion-referenced test should enhance the tests:

a. Reliability


63/84

b. Content validity

c. Discrimination


6. A panel of experts will sometimes be used to rate items in order to promote:

a. Content validity

b. Test sales


d. User validity

7. If students who surpass the mastery cut-off score for the addition of three-digit

numbers also tend to be those students who achieve master

a. Content validity

b. Criterion validity

c. Convergent validity

d. Mathematical validity

8. The correlation coefficient is a statistical way of expressing:

a. The standard error of measurement

b. Mathematical validity

c. Content validity


9. Which validity requires both a logical process and a statistical process?

a. Content validity


64/84

b. Convergent validity



10. If items on a criterion-referenced test do not match a well0defined domain, the test

lacks adequate:


b. Content validity

c. Criterion-referenced validity


11. If two writers, working from the same test specifications, created test items that

quite different from each other, the test would have inadequate:

a. Criterion validity

b. Item validity

c. Specification validity

d. Content validity

12. In order to know whether a test is valid, it is most important to know:

a. The purpose for which the test scores will be used

b. A description the persons who will take the test

c. As estimate of the reliability of the test

d. Whether the test has ever been used before

13. When a test does not achieve the purpose for which it was designed, the test lacks:

a. Validity


65/84

b. Reliability

c. Purposefulness

d. Discrimination

14. Lengthening a test will make it more valid.

a. True, if it is somewhat valid to begin with

b. False, test length affects reliability, not validity

c. True, but only for older students

d. False, validity is related to purpose rather than length

15. The primary validity for most criterion-referenced tests is:


b. Criterion validity

c. Content validity

d. Criterion-referenced validity


66/84


67/84


Test wiseness Separate answer sheets

Correction for guessing Testing arrangement

Positional preference Take-home examBluffing Oral exam

Test anxiety

Review Items

1. Programs for teaching test-taking skills tend to be:

a. Equally effective throughout grades 1-8

b. More effective with lower grades than upper elementary grades

c. More effective with upper grades than lower elementary grades

d. Of no effect throughout grades 1-8

2. A student who guesses on every test item will have the highest score on which kind

of test? (Assume the tests have equal length)

a. True-false

b. Multiple-choice (four-option)

c. Multiple-choice (five-option)

d. Fill-in-the-blank

3. If a correction rather than a penalty for guessing is used on a multiple-choice test,

students should be urged to:

a. Guess when they are unsure of an answer

b. Not guess when they are unsure of an answer

c. Guess only on items for which they have some partial knowledge

4. What is the expected score of a student who guesses on all items of a 50 item

multiple-choice test with five options per item?

a. 0

b. 5


68/84

c. 10

d. 15

5. The relationship between test anxiety and test performance is generally:

a. String and positive

b. Strong and negative

c. Weak and positive

d. Weak and negative

6. A limit is set on the length of response (in words) to an item. This is an attempt to

limit the effect of:

a. Guessing

b. Bluffing

c. Positional preference

d. Changing answer

7. Which of the following is least related to the others? The effect of:

a. Test anxiety

b. Bluffingc. Penmanship and spelling

d. Positional preference

8. A major with oral examinations is that they tend to be:

a. Very time consuming

b. Very anxiety producing

c. Unreliable

d. Formal rather than informal

9. Students should be shown how to take tests so that the tests provide:

a. Enriched diagnostic information for the future informational planning

b. Information about the test setting, as well as the test content


69/84

c. Information about wrong answers, as well as right answer

d. A more accurate picture of what the student is able to do

10. Take-home tests are subject to the following problem:

a. Lack of control over the testing situation

b. They are not appropriate for evaluation purposes

c. Time spent on the test varies among students

d. All of the above

11. Generally, more test answers are changed from wrong to right than right to wrong.

T F

12. Applying the correction for guessing raises the score on a test.

T F

13. Goo penmanship and spelling tend to be positively correlated with the grades

assigned to essay responses.

T F

14. The physical arrangement of the testing situation is as important for enhancing

student performance as is establishing control and rapport.

T F

15. Separate answer sheets can be used effectively for students beginning with those in

second grade.

T F


70/84

-15-

THE USE OF STANDARDIZED ACHIEVEMENT

TESTS

Standardized achievement tests are widely used in our schools. There are many tests

on the market, available in a variety of forms, including norm-referenced and criterion-

referenced tests.

Test results can be reported for individual students, classroom, or even school

buildings. In addition, local and national norms are available for many tests.

Publisher of standardized achievement tests are guided by Standards for Educational

and Psychological Testing(1985). These guidelines can also be used by consumers to

evaluate the test information that publisher provide.

Teachers play an important role in standardized achievement testing. They need to take

this role seriously and make sure that the physical and psychological settings promote a

positive testing environment.

Despite their popular use, standardized achievement tests do have certain limitations.

Some of these deal with the time required for the test administration and the processing

of test scores. Other limitations concern the usefulness and accuracy of the reported

scores. Some people worry that the average performance has become a standard of

performance and that achievement tests may have too much influence over school

curricula.

High-quality standardized achievement tests are available; they do the job that they

were designed to do. However, when tests are used for other purposes, their

effectiveness will be limited. Therefore, those who select standardized achievementtests must do a careful and complete job of comparing alternative.

The major determinant of the appropriateness of a standardized achievement test is the

match between the items and what was actually taught in the schools. Technical

adequacy and cost are also factors.


71/84

Standardized achievement test are perhaps our best example of high-quality testing in

education. Careful item preparation, extensive reliability and validity studies, well-design

norm-groups, and clear reporting formats combine to make standardized achievement

tests useful, accurate measure of student performance.

In the future we can expect to see efficient, computerized achievement testing.However, the types of standardized testing programs that we know now are likely to be

maintained in the 1990s.


Standardizedachievement test

Standards for Educational and Psychological Testing

RelevanceTechnical adequacy

Usability

Computerized adaptive testing

Review Items

1. What is standardized on a standardized achievement test?

a. The anticipated level of performance

b. The conditions for test administration

c. The test validity

d. The purpose for which the test is given

2. Which of the following uses of standardized achievement test scores is closest to

the main purpose for which such tests are constructed?

a. Assessing the achievement level of an individual student

b. Measuring school, class, and district wide achievement levels

c. Assessing whether students have adequate levels of achievement for promotion

to the next grade

d. Providing objective evidence about the teachers competence


72/84

3. Standardized achievement tests are norm-referenced rather than criterion-

referenced tests.

T F

4. The Standards for Educational and Psychological Testingare:

a. Guidelines about the use of the test scores

b. Summaries of legal cases concerning the use of test scores

c. Example, including case studies, of the misuse of test score

d. The legal minimum requirements for corporations that sell tests

5. Which of the following factors should be the most important consideration when

selecting a standardized achievement test?

a. Reliability

b. Cost

c. Publishers reputation

d. Relevance

6. Mr. Jones allows his student an extra 20 minutes on a standardized achievement

test. What is the major consequence of his actions?a. The reliability coefficient will increase

b. The content validity will decrease

c. Norm-referenced interpretations of the score will not be meaningful

d. Criterion-referenced interpretations of the scores will not be meaningful

7. Most teachers find that the results of standardized achievement test are helpful to

them when planning instruction for individual students.

T F

8. Most commercially available standardized achievement tests are so excellent that

they can form the sole basis for many decisions about a students academic

progress.


73/84

T F

9. Achievement-at-grade-levelis:

a. A meaningless term statistically

b. An expectation that we should have for all students

c. Average performance for students in that grade

d. An arbitrary assessment based on standardized achievement test scores

10. What is a form of testing in which the items that are presented to the student

depend on the students answers to previous items?

a. Non-standardized testing

b. Response-dependent testing trials

c. Step-by-step testing

d. Computerized adaptive testing

11. If the items on a standardized achievement test did not match what was taught in a

particular school, the test would lack:

a. Technical adequacy

b. Relevancec. Utility

d. Reliability

12. The verb that is frequently used in Standards for Educational and Psychological

Testingand that shows the orientation of the Standards is:

a. Should

b. Must

c. Might

d. Shall

13. Some major standardized achievement tests interpret the same test performance in

both norm-referenced and criterion-referenced formats.


74/84

T F

14. Standardized achiev

Educational and Measurement Testing

Documents

Transcript of Educational and Measurement Testing