Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of...

31
Evidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented at the Annual Conference of the British Educational Research Association, University of Leeds, England, September 13-15, 2001 Abstract It has been suggested that some students may benefit from particular formats of assessment, notably those with limited proficiency in English, those with poor reading skills, those from low-income families, and girls. This study examined the different effects on measured student achievement in mathematics and science, of four task formats. These formats were: one-to-one interview tasks, station tasks which required physical performance as well as written responses, paper-and-pencil tasks requiring written responses, and multiple-choice tasks. Parallel versions of twenty-seven tasks were created in two or more of the four task formats. Very few tasks gave the same picture of

Transcript of Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of...

Page 1: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

Evidence About the Effects of Assessment Task Format on Student Achievement

Robyn Caygill and Liz Eley

Paper presented at the Annual Conference of the British Educational Research Association, University of Leeds,

England, September 13-15, 2001

AbstractIt has been suggested that some students may benefit from particular formats of assessment, notably those with limited proficiency in English, those with poor reading skills, those from low-income families, and girls. This study examined the different effects on measured student achievement in mathematics and science, of four task formats. These formats were: one-to-one interview tasks, station tasks which required physical performance as well as written responses, paper-and-pencil tasks requiring written responses, and multiple-choice tasks. Parallel versions of twenty-seven tasks were created in two or more of the four task formats. Very few tasks gave the same picture of student achievement in all of the formats in which they were administered. This paper will outline some general trends and specific exceptions found in the results of this study. Results from sub-group analyses are used to explore the impacts of students’ reading ability and gender on their performance.

Page 2: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

IntroductionReform in education in recent years, led by changing theories of learning, has included changes in curricula, changes in technology, and changes in assessment practices (Niss, 1993). Assessment plays an important part in the learning process, having both formative and summative aspects. Formative assessment involves the use of assessment as a diagnostic tool so teachers may appropriately cater for the individual needs of their students and so students can determine their areas of strengths and weaknesses, celebrating their strengths and giving greater attention to improving their weaknesses. Summative assessment involves the use of assessment to report progress, for certification, for accountability or for monitoring (New Zealand Ministry of Education, 1994; Resnick and Resnick, 1996). With these various purposes in mind, any assessment activity used should be appropriate to the purposes of the assessment, to the curricula objectives, and to the students being assessed. Changing theories of learning, and the subsequent changes in curricula, have led to an examination of the formats used to conduct assessments. In the recent past, assessments in mathematics and science consisted of pencil-and-paper activities, particularly for large-scale assessments. The three main formats included in these pencil-and-paper tests were multiple-choice questions, written open-ended questions requiring short answers and written open-ended tasks requiring more extended responses or essays (Forbes, 1996). However there is now a move to include other task formats such as interviews, portfolios, the use of equipment, and tasks completed in teams (Linn, Baker, and Dunbar, 1991; New Zealand Ministry of Education, 1994; Resnick and Resnick, 1996).In 1995, some students in New Zealand first participated in the National Educational Monitoring Project (NEMP). This project was designed to obtain a detailed picture of the educational achievements and attitudes of primary and middle school students, with the intended outcome of improving the education which students receive (Flockton, 1999; Crooks, 2001). Preliminary investigations had suggested the desirability of including a wide range of formats to assess the students. Several different formats were included in the project. In 2000, NEMP researchers undertook a probe study to compare the effectiveness and appropriateness of four of the formats, namely interview, station, short answer and multiple-choice. Two curriculum areas, mathematics and science, were selected as areas in which to base this investigation. The researchers aimed to find out if some formats gave markers a better understanding of students’ abilities

Page 3: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

and knowledge, and whether some students performed better when they were able to give their explanation verbally rather than in a written form.Arguments about formats for assessment activities take two forms: either criticism of, or positive arguments for, particular formats. Multiple-choice tasks are often criticised for the limited information they give about the thought processes used by the student, for covering only parts of the curriculum, and for focussing on easily tested simple skills rather than higher-order thinking (Eley, Caygill & Crooks, 2001; Haney & Madaus, 1992; Resnick & Resnick, 1996; Wiley & Haertel, 1996). However they are also praised for being cheap, easy, and quick to administer allowing coverage of a large area of the curriculum in a short time span, and for being psychometrically sound, allowing for consistency and ease of marking, and reliability of administration (Adams & Wilson, 1996). Short answer questions are sometimes included with multiple-choice questions under the title of “traditional pencil-and-paper tests” and are also criticised for testing only a narrow range of the curriculum, both in content and process. Another feature of many traditional pencil-and-paper tests that is criticised is the focus on the product of the student’s work (Rowntree, 1991): often, in the case of mathematics and science, an answer to a specified problem. A correct answer may be misleading because the student has: recalled an answer previously worked out, copied without understanding, guessed wildly or intelligently, meant something different, or believed something different (Clements & Ellerton, 1995; Gay & Thomas, 1993; Rowntree, 1991). As a result of this sort of misclassification of student response, students themselves may be misclassified when grouping or promoting students (Carr & Ritchie, 1991) particularly where a small number of items have been used on a test. Alternative assessment formats, such as station and interview, can allow us to examine not only the product, but also the conceptions students bring to questions and the strategies used by the student to arrive at the answer (Bateson, Nicol, & Schroeder, 1991; Gronlund, 1993). However, Rowntree (1991) cautions that this is easier said than done, as a change in assessment method may still not give a complete picture of the student’s processes.Some authors (Griffin and Nix, 1991; Gronlund, 1993; Hambleton and Murphy, 1992; New Zealand Ministry of Education, 1992) have criticised traditional methods of assessment as being biased against ethnic minority groups, girls, students with poor reading skills, and low socio-economic status students. The Third International Mathematics and Science Study (Harmon, 1999) was a large-scale international study which included multiple-choice, short answer and station formats (station format called performance assessment in

Page 4: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

documents relating to that study). Harmon found that whilst there were significant gender differences in the written tests (including both multiple-choice and short answer formats), there were practically no gender differences in the results of the station tasks. Linn, Baker, and Dunbar (1991) however, caution that a shift to performance-based assessments may not necessarily lead to equity in assessment.The short answer format is sometimes included with the station and interview formats, praising these formats for allowing students to create their own responses, recognising them as active learners (Khattri & Sweet, 1996; Romberg, 1992; Shepard, 1992). The short answer format is also praised for the ease and reliability of administration.Interviews and station tasks are seen as giving the opportunity to assess higher-order thinking skills (Gipps, 1994; Stiggins, 1991), that is, the ability to solve problems, to reason, to think critically, to interpret and refine ideas, as well as to apply them creatively (Bateson, et. al., 1991).Teaching within most mathematics and science classrooms is not limited to pencil-and-paper tasks. Students use equipment, undertake investigations, measure, experiment, explore, and work in groups in classrooms. Hambleton and Murphy (1992) state that the motivation of advocates of alternative assessments, including station and interview formats, is to bring testing more in line with classroom instruction. Rowntree (1991) questions the educational relevance of some assessment formats, calling on developers to ask themselves - “does a particular assessment method seem to ‘go with’ the content and style of the teaching and learning experienced by our students?” (p. 162). Both the station and interview formats align with classroom instruction, allowing the use of equipment, and exploration.Station and interview tasks are often criticised for being expensive to administer both in terms of time and money. They are also criticised for requiring subjective judgements, being difficult to mark because of being prone to unexpected responses and because there may be a number of different interpretations of responses possible (Brown, 1992; de Lange, 1992; Peressini & Bassett, 1996). With all these praises and criticisms in mind, the researchers in this study aimed to examine the four formats, with a view to further detailing the strengths and weaknesses of each format, and finding the effect each format had on student performance. This paper presents the results of this study. Readers are urged to read Eley, Caygill & Crooks (2001) for details of the method, particularly the selection and development of the tasks, the experimental design, the

Page 5: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

administration processes, and the marking and the analysis of the student responses.

Results

General FindingsEach student undertook only one version of each task, so comparisons relate to the groups of students who completed differing versions of each task. Student groups were equated for ability, using a stratified random assignment process, so that any differences noted could be attributed to the task format rather than student characteristics. The results demonstrated that different formats gave students differing opportunities to demonstrate their knowledge and skills. There are two ways that the results may be examined in order to compare the formats on a question by question basis. Firstly all four formats may be compared together. This means that, for example, it could be said: “for this question, students did best in the interview format when compared with the other formats”. Secondly each format may be compared with each other format individually so that, for example, it could be said: “for this question, students did better in the interview format than the multiple-choice format.” No matter which method is used to examine each question, in general, the interview format had a higher proportion of students achieving success. The two tables below summarise the results of each analysis. Table 1 shows the results when all available formats were compared for each question, with the statistical significance level set at 5% (so that we can be 95% certain that statistically significant differences did not occur by chance alone).

Page 6: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

Table 1: Results when all four formats compared for each question

Result Observed Number of questions fitting stated pattern

Number of questions allowing this comparison

Proportion of questions fitting stated pattern

Interview higher than others (p<.05) 46 113 41%

Station higher than others (p<.05) 4 112 4%

Short answer higher than others (p<.05) 2 62 3%

Multiple-choice higher than others (p<.05) 13 62 21%

Interview and multiple-choice similar and both higher than others (p<.05)

3 47 6%

Interview and station similar and both higher than others (p<.05)

3 87 3%

Station and multiple-choice similar and both higher than others (p<.05)

2 50 4%

No statistically significant difference between formats (p>.05)

54 127 43%

Total number of questions 127

Table 2 shows the results when each format is compared with only one other at a time. Again the statistical significance level is 5% so that it is 95% certain the results did not occur by chance.

Page 7: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

Table 2: Results when each format is compared with each other format for each question

Formats compared Interview higher

(p<.05)

Multiple-choice higher (p<.05)

Station higher

(p<.05)

Short answer higher (p<.05)

No difference

(p>.05)

Interview versus multiple-choice 38% 15% N/a N/a 47%

Interview versus station 40% N/a 2% N/a 58%

Interview versus short answer 67% N/a N/a 5% 28%

Multiple-choice versus station N/a 33% 16% N/a 51%

Multiple-choice versus short answer N/a 37% N/a 3% 60%

Station versus short answer N/a N/a 15% 5% 80%

The Interview formatThe results presented in both tables show that the interview format leads to success for the greatest proportion of students. Several features of the interview format allowed students to demonstrate their knowledge and understandings and achieve greater success than in the other formats. Markers observed that students listed more features and gave fuller explanations in the interview format. There are two reasons for this. Firstly, responding orally is less time consuming and requires less commitment than responding in writing where substantial responses are required. Secondly, the presence of a teacher encouraged students to keep answering beyond one or two ideas. Teachers encouraged students to give extended answers either by verbal prompts such as “Is there anything else you could add?” or by non-verbal cues such as nodding, smiling or pausing. One caution for this is that some children developed an answering style that the authors have labelled a “shotgun” approach to answering the questions. In these cases the students were asked a question and gave multiple answers, only some of which were correct. It appears that the students gave as many responses as possible hoping that one was right, without taking the time to synthesise or judge their responses. This “shotgun” approach was most likely to occur in the interview format because there was less effort required to produce a verbal answer than to give a written response. A task called Weigh-up demonstrates another situation where students gave better verbal than written responses. There were two versions of this task: one in the interview format and one in the

Page 8: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

station format. Students were required to order four boxes by weight using a set of balance scales. As part of the task students were asked to explain their method. In the interview situation, 49% of the students gave a clear and appropriate explanation of the method they used to order the four boxes, compared to 23% in the stations version. Another advantage of having a teacher present to administer the tasks was that there were less “wasted” answers in the interview format than in other formats. In a task where students were asked to define “floating”, 13% of students in the stations format and 21% in the short answer format restated the question in their answer, for example “floating means it floats” or “it doesn’t sink”. In the interview setting, the teacher was able to prompt for a definition – “what do you mean by ‘floats’? ” – which enabled the gathering of more useful information. In the interview format the teacher was available to probe responses where students gave answers that were unexpected. In an algebra task, students were asked to complete a pattern to show how many jugs holding four glasses of juice would be needed to give 46 drinks. In two cases students correctly completed the pattern up to 12 jugs, then recorded an unexpected answer for the total number of jugs required. When asked to explain their answer, they indicated that they had allowed for spilling some juice in pouring the drinks. The interview format was not, however, immune to problems. In this format there were two people interpreting the task, not one. This meant that a teacher’s misunderstanding of the task (either requirements of how to administer the task or their own understanding of how to do it) could in turn mislead student responses through the teacher’s prompting, requests for information or intonations.

The Multiple-choice formatMultiple-choice questions produced higher success rates than the other formats when the answer was very complex. It appears that a complex explanation was easier to recognize than construct. For example, students were asked the question “why do earthquakes happen?” An answer such as “it is something to do with tectonic plates”, would not be considered a complete explanation for any constructed-response version. However the knowledge that it was something to do with tectonic plates may enable a student to recognise the correct multiple-choice option. In this particular case, the scientific concept was intellectually complex requiring specialist knowledge to give a complete explanation.Other occasions where students obtained greater success on multiple-choice questions than other formats, were for questions whose

Page 9: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

multiple-choice options included one that was more detailed, rather than intellectually more complex, than the verbal answers given by students. For example, when students were asked how they might measure a wiggly line representing a river on a map, a common answer was that they would use a ruler. Superior answers indicated the need to use something flexible to follow the course of the river, and then place the flexible item against a ruler to measure the length. The multiple-choice options given included one with just a ruler, and one with a ruler and a piece of string. Students were more likely to select the option with the ruler and the piece of string in the multiple-choice version than to create this response in any other format. There are two reasons for this tendency. Firstly, when presented with a more detailed answer than they would have thought of, students are inclined to select that because it seems better so must be correct. Secondly some students, upon reading the correct option, realise that a more accurate answer could be obtained than they initially thought of.Researchers have noted that some items are more difficult in the constructed response format than in the multiple-choice format (Katz et. al., 2000). In a task that required students to find the mean (average) temperature from a list of six temperatures, one-third of students selected the right answer in the multiple-choice version. However, only 17% of students in the interview format and 8% in the short answer format gave the correct answer. A further 6% of students in the interview format and 8% in the short answer format used the correct method but obtained the wrong answer. Even when allowing for students’ computational errors, having the answer as a multiple-choice option was an aid for students working out the problem.When we interviewed children about their solution strategies for some of the multiple-choice questions, as many as one quarter of the students reported that they had made a guess. Guessing was a strategy consistently reported by students; indeed, there was only one question where none of the students identified this strategy. This result is similar to that of Clements and Ellerton (1995) who found that one quarter of the responses to the multiple-choice and short answer questions they analysed inadequately assessed student understanding. It is clear that a correct answer on a multiple-choice question does not equate with a correct and complete understanding of the underlying concept.It is also important to note that markers often looked for an answer to the interview, station and short answer formats that included more detail than was given in the multiple-choice option considered correct. Despite the finding that students were more likely to get the multiple-choice question correct when the answer required was very complex,

Page 10: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

multiple-choice questions can be seen as having the effect of “dumbing-down” the content of the curriculum to bite-size pieces.

The Station format and the Short Answer formatIn both the station and short answer format, students had to construct their own response in a written form. This appears to be a disadvantage of these formats as students were generally awarded lower scores for the quality of their explanations in these two formats when compared to the interview format. Explanations often require more extended answers than other types of questions. The more writing that is required for an answer, the more factors other than a student’s scientific or mathematical knowledge and skills can effect their answer. The length of a response does not necessarily indicate the quality of the answer, but in many cases a brief written response did not cover the number of aspects given in a verbal response. Students who have difficulty with being able to accurately record their ideas or are unsure of the spelling or grammar to be used can be reluctant to submit a full answer. However included in the objectives of mathematics and science instruction is the objective that students should be able to communicate their ideas. Thus these formats, station and short answer, allow us to examine mathematics and science content, as well the student’s ability to communicate that content in a written form.An advantage of the station format over the short answer format was the availability of equipment. Having equipment in the station format helped students to think through the problem. It also helped make the task more concrete. The use of equipment is discussed further in the Equipment section. The short answer format was the format where students were least likely to correctly answer the question. This is not that surprising, as the short answer format offered the least support for the students. This format does not have supporting equipment like the station and interview formats, the psychological and intellectual support of the teacher in the interview format, or the support of the responses provided to select from in the multiple-choice format.

Mathematical and scientific processes From the section above we can see that there were quite a number of questions where a similar proportion of students correctly answered the question. In order to determine what types of questions suited specific task formats, a method of categorising the questions was developed. Initially, questions were examined using a version of Bloom’s Taxonomy (Bloom, 1956). However, we found this taxonomy

Page 11: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

was not specific enough to categorise mathematical and scientific processes and skills. We then developed our own categories, which were more specific to processes used in science and mathematics activities. These were:

1. Recall knowledgeFor knowledge-based questions where students recall specific facts or information. It also includes some occasions where students might have to apply their general knowledge to a specific situation.

2. Calculate/Follow FormulaeFor questions that require the students to apply an algorithm or formula to perform calculations or solve problems. There could be some recall required particularly the recall of formulae and how to apply them.

3. Experiment/InvestigateFor activities that require students to work through a series of scientific or mathematical procedures in order to come to a solution. It includes observing, using tools, communicating the experimental technique, following set procedures, and applying information in a multi-step procedure.

4. Compare or ContrastFor situations where there is more than one method, object, result or idea to compare. They may have to experiment or to investigate, to calculate or to follow formulae, but the key feature is that a comparison is required.

5. Conclude/Explain/JustifyFor responses that require ideas to be synthesised in order to provide an appropriate explanation, conclusion or justification.

Knowledge RecallAbout half of the fifteen questions in this category showed no statistically significant difference among the proportions of students who correctly answered the questions in each format. In most of the cases where there was a statistical significance, the interview format had a higher proportion of students correctly completing the question. There were only three exceptions where students did better in other formats, one involving a multiple-choice question and two involving short answer questions. These three questions were very simple factual questions (you either knew or you did not know the answer).

Page 12: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

There was little difference in student scores between the formats where students had to create their own answer when the item of knowledge to be recalled did not require an extended written answer. Where more students succeeded with the interview version, this appeared to be due to the teacher being able to clarify misunderstandings of the questions or probe incomplete answers. When the answer required extended written answers, however, students sometimes performed better in the interview versions of the questions. Students were also more likely to use technical terms such as “tectonic plates” when giving answers verbally, possibly because spelling was not an issue. Thus for simple recall knowledge questions, it does not matter which format is used, so short answer or multiple-choice format would be suitable. However for more complex knowledge questions the interview format is more appropriate.

Calculate or Follow FormulaeExactly half of the twenty-six questions in this category showed no statistically significant difference among the proportions of students who correctly answered the questions in each format. Where differences were found, any version in interview format had a higher proportion of students correctly completing the question. The teachers’ presence may have helped students to persevere and feel confident with the calculations they were doing. Overall, however, there were not many questions in the interview format where a higher proportion of students correctly answered the question. Therefore it seems that for this category, it is not necessary to ask questions in the interview format to ensure students get the opportunity to show all they know and can do. If there was no interview version, the version offering the greatest support for students (ie. the greatest number of clues to the answer), either with equipment or with multiple-choice options, had the highest proportion of students getting the answer correct.

Experiment or InvestigateThere were thirty-two questions in this category. About half of these had parallels written in the short-answer or multiple-choice format without the equipment to aid the students to form their answer. Nearly all of these questions (with only 2 exceptions) were answered correctly by a higher proportion of students who were given the interview version of the task. The two exceptions did not require equipment to complete (one was an estimate of a length and the other

Page 13: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

could be solved either by a mathematical calculation or by counting). An inspection of the two natural formats for this category, interview and station format, show that quite a few questions (63%) showed no statistically significant difference between the two formats. However, where there was a difference, the interview format always had a higher proportion of students getting the question correct.

Compare or ContrastAbout half of the fifteen questions that fell in this category showed no statistically significant differences among the formats. Of the remaining questions, all but one question were answered correctly by a higher proportion of students who were given the interview version of the task. The exception was a task where the multiple-choice version was not exactly parallel, containing information in a written form that students given the other versions would have to determine for themselves from the equipment provided. Note that students could not be given a written version of this task without providing this information. This task demonstrates a case where a written version should not be used for a surrogate of a version with equipment as it would not give examiners a correct idea of a student’s abilities.

Conclude/Explain/JustifyAbout one third of the thirty-nine questions that fell in this category showed no statistical significance among the formats. Most of the remaining questions were answered correctly by a higher proportion of students who were given the interview version of the task. There were seven exceptions where students did better in other formats: one short answer, two stations, and four multiple-choice questions. In particular, the questions answered correctly by a higher proportion of students given the multiple-choice version were ones requiring complex explanations which were easier to recognise than explain. Also, markers perhaps looked for a higher standard in the answers given by students in the interview version than could comfortably be written about in one line for a multiple-choice answer. The interview format generally seems to be better for these types of questions as students seemed to struggle with the writing down complex ideas, finding it easier to explain something orally, especially with the use of gestures, than to record a grammatically correct written answer. The five different categories of task demands are associated with different types of questions and required different types of marking rubric. When students were asked to recall knowledge, to calculate or follow formulae or to follow instructions for an investigation or experiment, a relatively straight forward marking schedule that

Page 14: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

allowed for correct, partially correct or incorrect answers was used. When students were asked to design an experiment, to compare or contrast or to conclude, justify or explain, a different approach to marking was required that allowed for widely varied responses and required sophisticated judgements about the quality of the answers given. This made these types of questions more difficult, slower, and, therefore, more expensive to mark.

EquipmentOne of the major differences between the written task formats (multiple-choice and short-answer) and the other two formats (interview and station) was the availability and use of equipment in the latter formats. According to the mathematics curriculum document in New Zealand, “Teachers know that students are capable of solving quite difficult problems when they are free to use concrete apparatus to help them think the problems through” (New Zealand Ministry of Education, 1992, p. 13). So what impact did having equipment available have on the performance of students in the different formats?The Circuits task is a clear example of students being better able to do a task when they had the equipment. Two thirds of students were able to correctly draw a diagram showing how a bulb and battery should be connected to make the bulb glow when they had the equipment, that is in the stations and interview version. Only one third of students completing the multiple-choice version selected the correct diagram, and one sixth of students completing the short answer version drew an appropriate diagram. In this task, equipment was being used to test a valued skill, rather than the surrogate measure used in the multiple-choice and short answer versions. Using equipment in this task brought it more into line with the way these skills would be used in the classroom and in the real world.While there was equipment available in most of the tasks, the reasons for including it and the use made of it varied greatly. Some tasks required students to use the equipment to aid an explanation. Sometimes this showed up their confused thoughts, or exposed the lack of underlying knowledge to back up a glib response. For example in Maths Adviser students were asked if 5x2 was the same as 2x5. In the interview version, students were subsequently asked to demonstrate their reasoning using some blocks. This demonstration often exposed the fact that they didn’t fully understand the concepts involved.For some tasks, the provision of equipment led to greater variety in the answers as students moved from mathematical ways of thinking into practical or more common-sense ways of thinking (Cooper, 1994).

Page 15: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

For example in Jugs of juice, when asked how many cups of water could be poured from the jug, some students found 4 and some found 5. Hence their subsequent answers varied depending on their first answer. They also included other common-sense ideas in their answers because of this context. For example when students were asked how many jugs of juice would be needed to obtain 46 glasses of juice, some children said eleven and a half, some said twelve, and some said twelve with two glasses remaining. As mentioned earlier, some students also allowed for spillage in their calculations in this task.In some tasks, students were given equipment that modelled the real world situation to help them think through the problem. It also helped make the task more concrete. Some of the questions could not be asked in a format without equipment, but where parallels were created without equipment, fewer students did well in this version.It seems clear that having the equipment available in a task usually aided the students and generally gave examiners a clearer understanding of the student’s knowledge, abilities and ideas. However it is important to think about the influence the equipment could have upon a student’s ability to get the “correct” or accepted answer if the equipment is faulty, or causes students to move onto ideas on a tangent with the main idea to be examined.

Student OpinionsStudents were asked in two ways about their enjoyment of the tasks. Firstly, after each session they were asked to indicate how they felt about each task. Once they had completed all of the sessions, they were then asked which type of task format they preferred. After each session, students were given a list of tasks they had completed. Next to the title of each task were three faces, a smiley face, a neutral face, and a face with a frown. They were asked to circle the face that best showed how they liked the task. Students can approach this question in two ways. They can give a smile for ones they liked and a frown for ones they hated. Alternatively they can compare the tasks to each other and a frown means the task was the least liked. In this second approach they may have still thought the task was ok but just not as good as the others and hence given it a frown. As a result of this possible ambiguity it is safest to examine the proportion of smiley faces given to a task.In order to compare the versions of tasks, we used a measure of popularity. For the sake of this comparison, a task is called popular if more than 70% of the students circled the smiley face, and unpopular if less than 30% circled the smiley face. In general, students enjoyed being involved in the study. More interview and station tasks were

Page 16: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

classified as popular than written versions. The most popular tasks required experimentation, investigation or the use of interesting equipment. Once the students had completed all their tasks they were asked which type of task format they liked the most, and which format they thought they were better at. A higher proportion of students liked the stations format the most (54%), and similarly a higher proportion of students thought they were best at the stations format (49%). In comparison 30% of students preferred the interview format and 26% thought they were better at interview tasks. Few students preferred or thought they were better at the written versions (11% preferring short answer and 5% preferring the multiple-choice; 18% thought they were better at multiple-choice and 7% at short answer questions).The fact that students thought they were better at the stations format is surprising as we found that more students did well in the interview format. It could be that they were more conscious of the gaps in their knowledge in the interview. They also probably felt less “under the spot light” when they were quietly doing the stations tasks on their own, had opportunity for extended experimentation with the equipment and were less restricted by the close supervision occurring in the interview situation.

Gender DifferencesBecause the number of boys and girls in each group was typically about 30, it was essential that statistical tests be employed, so that the interpretation of comparisons of the performance of girls and boys was not unduly influenced by one or two students (two students is about 7% of each group). Very few questions in each format, only 5%, showed significant gender differences. This level could arise randomly, given the level set for statistical significance (5%). Half of the few statistically significant differences favoured boys, and half favoured girls. No task showed a consistent pattern of gender difference for all questions, although the task on Natural Disasters looking at earthquakes and volcanoes accounted for three of the gender differences in favour of boys. In the Night and Day task, one question had a gender difference in favour of boys in the interview version, but in the multiple-choice version there was a question with a gender difference favouring girls. No question format appeared to enhance or diminish the performance of girls or boys. It seems fair to conclude that the gender of the students did not unduly influence their performance regardless of the format in which the questions were asked.

Page 17: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

Reading AbilityAs mentioned earlier, traditional methods of assessment have been criticised as being difficult for students with poor reading skills. If the intention of a task is to test mathematics or science skills, but a student cannot read or decode the questions, then it is not a fair test (Clements & Ellerton, 1996). The students in this study were assessed for their reading ability and divided into three groups based on this assessment: those who read below their expected chronological age-level; those who read at their expected chronological age-level; and those who read above their expected chronological age-level.

Table 3: Proportion of questions in each format with statistically significant differences between groups of students based on reading ability.

Proportion of questions with statistically significant

differences between reading groups

Proportion of questions with poorest readers doing least

well.

Interview 18% 8%

Station 15% 11%

Short answer 26% 18%

Multiple-choice 18% 6%

If any format could be expected to be easier for students with poor reading ability, it would be the interview format. When each question was assessed based on reading classification, it was found that while 18% of the interview questions showed statistically significant differences between the three reading groups, only 8% of those questions had the poorest readers “doing the worst” as a group. For the other formats, a similar pattern was found, as shown in Table 3. A higher proportion of short answer questions had the poorest readers doing least well as a group, and this format was the most difficult overall, as shown earlier. These results seem to show that reading ability may have some influence on student results depending on format. Interestingly, the multiple-choice format had the least proportion of questions showing a statistically significant difference against the poorest readers. However, reading ability may be less of an issue at higher levels of schooling and more of an issue for earlier years where reading ability within one classroom may be quite varied. There were few really poor readers in this study with at most twenty percent of students reading below expected chronological age-level.

Page 18: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

The rest of the students read at or above expected chronological age -level.The analysis in this study indicates that there is scope for further research here and that fairness in assessment needs to be investigated further, particularly with younger children.

SummaryThe four assessment formats discussed in this study are not the only formats that should be included in an assessor’s repertoire. Rather they represent four formats from the spectrum of possible formats. This study shows that it is unwise to rely on one format only, particularly if that format is multiple-choice, and that the interview format will, for many tasks, elicit the most comprehensive response from students. If an examiner uses only one format, particularly one of the written methods, they may find themselves testing related but different objectives from those they really care about because the related objectives are easier to assess.The interview format elicited fuller explanations, more features, and the teacher presence helped clarify unexpected, difficult to understand, or single-word responses. The interview format was found to be more suitable for complex knowledge recall questions, tasks requiring experimentation or investigation, and tasks requiring the use of higher-order thinking skills such as comparing, contrasting, explaining, or justifying. For many students, complex ideas were easier to explain orally rather than in a written form. One negative consequence of using the interview format, however, was the tendency of some students to give lots of responses to a single question in the hope that at least one would be correct. A further negative of the interview format is the cost of administration, requiring a teacher and equipment for one student. Any formats seem suitable for simple recall knowledge questions and for questions or tasks that required students to calculate or follow formulae. Since a further negative of the use of the interview format was the cost of administration, questions requiring simple recall or calculations did not require the presence of a teacher and could be asked in another format. The multiple-choice format had benefits for students in that it was easier for them to recognise an answer than to create it for themselves; and if this failed the multiple-choice format gave them a chance to guess. The multiple-choice format therefore did not give as accurate a picture of the students’ abilities and knowledge. Also multiple-choice questions did not accurately work as surrogates for

Page 19: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

tasks requiring equipment or experimentation because extra information sometimes had to be given that was not available to, or desirable to give to, students in other formats.It was found that having equipment available in a task, either as a station or an interview version, had several benefits. Having equipment in a task often meant that desired skills rather than surrogates were tested and the task became more educationally relevant as well as more like what would be encountered in the real world. Having equipment in a task aided student explanations and exposed lack of understanding. However, if equipment is used examiners should be aware that it may cause students to move into more common-sense ways of thinking or lead them onto a tangent. Examiners should also be aware that faults in equipment will lead to faults in results.Students enjoyed the stations format most and thought they were best at this format. This study revealed almost no gender differences and no discernible gender bias. There was some evidence that students with poor reading skills may be disadvantaged by the written formats. This area needs further investigation.

RecommendationsSeveral recommendations for assessment procedures arose from this study. Starting with the multiple-choice format, these recommendations are summarised.

Multiple-choice FormatThe multiple-choice format should not be used in formative assessments because the responses do not necessarily reflect the knowledge and abilities of the student. The multiple-choice format should only be used for summative assessments when (i) it is suitable for students to have opportunity to guess and (ii) it is not important that students articulate the answer for themselves. The multiple-choice format is suitable for questions where the process used to find the answer is not being tested or where no process is involved, for example where questions have unambiguous, factual answers. The multiple-choice format should ideally be used in conjunction with other formats, to minimise narrowing of curricula. The multiple-choice format should be used when no equipment is necessary or would be used in real life contexts to determine the answer.

Page 20: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

Short Answer FormatThe short answer format should only be used for formative and summative assessments when questions have unambiguous, factual answers or ideas have simple explanations. The short answer format can also be used when comparisons can be reasonably made between pictorial representations of objects. The short answer format is suitable for questions where the process used to obtain an answer can simply and easily be recorded. The short answer format should be used preferably in conjunction with other formats where students use equipment, investigate, experiment or work in groups. The short answer format should not be used when equipment is necessary or would normally be used to determine the answer.

Station FormatThe station format is most suitable for formative and summative assessments that require equipment or physical investigation, where the assessment objectives call for creation of a physical product or where the process used to obtain an answer is to be assessed and can be recorded by student or divined from the product. However, where explanation is required, the station format should only be used when relatively brief explanations are required, especially in the earlier years of schooling.

Interview FormatThe interview format is most suitable for formative and summative assessments when complex explanations are required, multiple responses are required for a single question or when process as well as product is to be assessed. It is an important factor in the interview format that the assessment is able to be conducted where one examiner is available for each student. This has the benefit of allowing the pace of the task or the availability of equipment to be controlled, as well as allowing for flexible questioning depending on a previous answer found. The interview format should also be used when the manipulation of equipment along with an explanation may reveal a lack of understanding or enhance an explanation in a way that a written answer may not. The interview format should also be used when an excessive reading or writing load may hamper a student’s ability to show what they know and can do. Finally the interview format should be used when an important objective cannot be assessed in any other format.

ReferencesAdams, R. J., and Wilson, M., (1996). Evaluation progress with alternative assessments: a model for title 1. In Kane, M. B., and

Page 21: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

Mitchell, R., (eds), Implementing performance assessment: promises, problems, and challenges, 39 – 59. New Jersey: Lawrence Erlbaum Associates, Inc.Bateson, D. J., Nicol, C. C., and Schroeder, T. L., (1991). Alternative assessment and tables of specifications for the Third International Mathematics and Science Study. Third International Mathematics and Science Study (TIMSS) Working Paper.Bloom, B.S. (Ed.), (1956). Taxonomy of educational objectives: The classification of educational goals. Handbook 1, Cognitive domain. New York: McKay.Brown, R., (1992). Testing and thoughtfulness. In Burke, K., (ed), Authentic assessment: a collection, 53 – 58. Australia: Hawker Brownlow Education.Carr, K., and Ritchie, G., (1991). Evaluating learning in mathematics. SET: research information for teachers, 1, 15.Clements, M. A., and Ellerton, N. F., (1995). Assessing the effectiveness of pencil-and-paper tests for school mathematics. Eighteenth annual conference of the Mathematics Education Research Group of Australia (MERGA).Clements, M. A., and Ellerton, N. F., (1996). Mathematics Education Research: past present and future. Bangkok: UNESCO Principal Regional Office for Asia and the Pacific.Cooper, B., (1994). Authentic testing in mathematics? The boundary between everyday and mathematical knowledge in national curriculum testing in English schools. Assessment in Education 1(2): 143 – 166.Crooks, T., (2001). New Zealand’s National Education Monitoring Project: Rich information from multiple task formats. Paper presented at the British Educational Research Association conference, Leeds, UK, September 13-15.de Lange, J., (1992). Assessment: no change without problems. In Stephens, W. M., and Izard, J. F., (eds), Reshaping assessment practices: assessment in mathematical sciences under challenge, 46 – 76. Australia: Australian Council for Educational Research Ltd.Eley, L., Caygill, R., and Crooks, T., (2001). Designing and implementing a study to examine the effects of task format on student achievement. Paper presented at the British Educational Research Association conference, Leeds, UK, September 13-15.Flockton, L. C., (1999). School-wide assessment: National Education Monitoring Project. Wellington, New Zealand: New Zealand Council for Educational Research.

Page 22: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

Forbes, S. D., (1996). Curriculum and assessment: hitting girls twice? In Hanna, G., (ed), Towards gender equity in mathematics education, 71 – 92. Dordrecht: Kluwer Academic Publishers.Gay, S., and Thomas, M., (1993). Just because they got it right, does it mean they know it? In Webb, N. L., and Coxford, A. F., (eds), Assessment in the mathematics classroom, Reston, Va.: National Council of Teachers of Mathematics.Gipps, C., (1994). Beyond testing: towards a theory of educational assessment. London: Falmer Press.Griffin, P., and Nix, P. (1991). Educational Assessment and Reporting: a new approach. Sydney: Harcourt Brace Jovanovich.Gronlund, N. E., (1993). How to make achievement tests and assessments. Massachusetts: Allyn and Bacon.Hambleton, R. K., and Murphy, E., (1992). A psychometric perspective on authentic measurement. Applied Measurement in Education 5(1): 1 – 16.Haney, W., and Madaus, G., (1992). Searching for alternatives to standardized tests: whys, whats, and whithers. In Burke, K., (ed), Authentic assessment: a collection, 87 – 99. Australia: Hawker Brownlow Education.Harmon, M., (1999). Performance assessment in the Third International Mathematics and Science Study – an international perspective. Studies in Educational Evaluation, 25: 243 – 262.Katz, I. R., Bennett, R. E., and Berger, A. E., (2000). Effects of response format on difficulty of SAT-mathematics items: it’s not the strategy. Journal of Educational Measurement, 37(1), 39 – 57. Khattri, N., and Sweet, D., (1996). Assessment reform: promises and challenges. In Kane, M. B., and Mitchell, R., (eds), Implementing performance assessment: promises, problems and challenges: 1 – 21. New Jersey: Lawrence Erlbaum Associates, Inc.Linn, R. L., Baker, E. L., and Dunbar, S. B., (1991). Complex, performance-based assessment: expectations and validation criteria. Educational Researcher 20(8), 15 – 21.New Zealand Ministry of Education, (1992). Mathematics in the New Zealand Curriculum. Wellington, New Zealand: Learning Media.New Zealand Ministry of Education, (1994). Assessment policy to practice. Wellington, New Zealand: Learning Media.Niss, M. (1993). Investigations into assessment in mathematics education. Dordrecht: Kluwer Academic Publishers.

Page 23: Evidence About the Effects of Assessment Task … · Web viewEvidence About the Effects of Assessment Task Format on Student Achievement Robyn Caygill and Liz Eley Paper presented

Peressini, D., and Bassett, J., (1996). Mathematical communication in students’ responses to a performance-assessment task. In Elliot, P. C., and Kenney, M. J, (eds), Communication in mathematics, K-12 and beyond, 146 – 158. Reston, Va: The National Council of Teachers of Mathematics.Resnick, D. P., and Resnick, L. B., (1996). Performance assessment and the multiple functions of educational measurement. In Kane, M. B., and Mitchell, R., (eds) Implementing performance assessment: promises, problems, and challenges: 23 – 38. New Jersey: Lawrence Erlbaum Associates.Romberg, T. (1992). Concerns about mathematics assessment in the United States. In Stephens, W. M., and Izard, J. F., (eds), Reshaping assessment practices: assessment in mathematical sciences under challenge: 35 – 45. Australia: Australian Council for Educational Research Ltd.Rowntree, D., (1991). Assessing students: how shall we know them. New York: Nichols Publishing Company.Shepard, L. A., (1992). Why we need better assessments. In Burke, K., (ed), Authentic assessment: a collection, 37 – 47. Australia: Hawker Brownlow Education.Stiggins, R. J., (1991). Facing the challenges of a new era of educational assessment. Applied Measurement in Education 4(4): 263 – 273.Wiley, D. E., and Haertel, E. H., (1996). Extended assessment tasks: purposes, definitions, scoring, and accuracy. In Kane, M. B., and Mitchell, R., (eds), Implementing performance assessment: promises, problems, and challenges, 61 – 89. New Jersey: Lawrence Erlbaum Associates, Inc.