akademik.uhn.ac.idakademik.uhn.ac.id/portal/public_html/FKIP/Bertaria... · Web view2016. 11....
Transcript of akademik.uhn.ac.idakademik.uhn.ac.id/portal/public_html/FKIP/Bertaria... · Web view2016. 11....
A COMPILATION MATERIAL OF
ENGLISH LANGUAGE
TESTING
Compiled By:
BERTARIA SOHNATA HUTAURUK
Prodi PendidikanBahasaInggrisFAKULTAS KEGURUAN DAN ILMU PENDIDIKAN
UNIVERSITAS HKBP NOMMENSEN
PEMATANGSIANTAR
2015
English Language Testing 1
INTRODUCTION
This book is a compilation material for English Language
Testing. General outlines of material as an introduction to
English Language Testing that has been compiled for the
students in S1 degree. This collection of material consists
of Testing, Assessment, Meassurement, Evaluation, Kinds
of Testing, Validity and Reliability of the tests, and
Interpreting the test Score. Hopefully, this compilation
will be useful for the students and yet not perfect, so any
critism is welcomed.
Compiled by:
Bertaria Sohnata Hutauruk
English Language Testing 2
CONTENS
1. What is the difference between assessment and evaluation?…………. 1
2. Testing, Assessment, Measurement and Evaluation…………………... 4
3. Informal vs. Formal Assessments: Tests are not the only end-all-be-all
of how we assess.………………………………………………………….. 6
4. Norm-referenced test and Criterion-referenced test…………………... 11
5. Discrete Point Testing and Integrative Testing………………………… 19
6. Communicative Language Testing……………………………………… 22
7. Testing Communicative Competence…………………………………… 24
8. Testing Reading and Writing……………………………………………. 30
9. Performance-Based Assessment……………………………………….... 32
10. Validity and Reliability…………………………………………………... 42
11. Constructing Tests……………………………………………………….. 61
12. Types of Listening Testing………………………………………………. 75
13. Testing Grammar………………………………………………………… 97
14. Interpreting Test Score…………………………………………………... 103
English Language Testing 3
1
What is the difference between
assessment and evaluation?
There is a lot of confusion over these two terms as well as other terms associated with
assessment, testing, and evaluation. The big difference can be summarized as this:
assessment is information gathered by the teacher and student to drive instruction, while
evaluation is when a teacher uses some instrument (such as the CMT or an end-of-unit
test) to rate a student so that this information can be used to compare or sort students.
Assessment is for the student and the teacher in the act of learning while evaluation is
usually for others.
“If mathematics teachers were to focus their efforts on classroom assessment that is
primarily formative in nature, students’ learning gains would be impressive. These
efforts would include gathering data through classroom questioning and discourse,
using a variety of assessment tasks, and attending primarily to what students know and
understand” (Wilson & Kenney, page 55).
Assessment is a lot more important because it is integral to instruction. Unfortunately, it
is being hampered by the demands of evaluation. The biggest demand for evaluation is
grading or report cards. There shouldn’t be a problem with that, except historically
evaluation (grades) were determined exclusively by computing a student’s numeric
average on paper and pencil assessments called quizzes or tests.
“Most experienced teachers will say that they know a great deal about their students in
terms of what the students know, how they perform in different situations, their attitudes
and beliefs, and their various levels of skill attainment. Unfortunately, when it comes to
grades, they often ignore this rich storehouse of information and rely on test scores and
rigid averages that tell only a small fraction of the story.
English Language Testing 4
The myth of grading by statistical number crunching is so firmly ingrained in schooling
at all levels that you may find it hard to abandon. But it is unfair to students, to parents,
and to you as the teacher to ignore all of the information you get almost daily from a
problem-based approach in favor of a handful of numbers based on tests that usually
focus on low-level skills” (Van de Walle and Lovin, page 35).
The reason this is a problem is that students learn what is valued and they strive to do
well on those things. If the end-of-unit tests are what are used to determine your grade,
guess what kids want to do well on, the end-of-unit test! You can do all the great
activities you want, but if the bottom line is the test, then that is what is going to be
valued most by everyone: teachers, students, and parents, alike.
What we need to get better at is valuing the day-to-day
activities we do and learn how to use them for both
assessment and evaluation.
This will not be an easy task.
It is very different from what we are used to doing. We are used to teaching and then
assessing. In reality, the line between teaching and assessment should be blurred
(NCTM, 2000). “Interestingly, in some languages, learning and teaching are the same
word”(Fosnot and Dolk, page 1). We need to assess on a daily basis to give us the
information to make choices about what to teach the next day. If we just teach the
whole unit and wait until the end-of-unit test to find out what the kids know, we may be
very unhappily surprised. On the other hand, if we are assessing on a daily basis
throughout the unit, we do not need to average all those assessments to come up with a
final evaluation. Instead, we could just use the most recent assessments to make that
evaluation. In this way, we do not penalize the student that did not know much at the
beginning of the unit and worked really hard to learn what you felt were the big ideas.
Instead we rate them on where they are when you finished the unit. This gives a more
accurate report or evaluation of where they are performing when the evaluation is made.
English Language Testing 5
English Language Testing 6
2
Testing, Assessment, Measurement and
Evaluation
The definition for each are:
Test: a method to determine a student’s ability to complete certain tasks or demonstrate
mastery of a skill or knowledge of content. Some types would be multiple choice tests,
or a weekly spelling test. While it is commonly used interchangeably with assessment
or even evaluation, it cab be distinguished by the fact that a test is one form of an
assessment.
Assessment: The process of gathering information to monitor progress and make
educational decisions if necessary. As noted an assessment may include a test, but also
includes methods such as obeservations, interviews, behavior monitoring etc.
Evaluation: Procedured used to determine whether the subject (i.e. student) meets a
preset criteria such as qualifying for special education services. This uses assessment
(remember that an assessement may be a test) to make a determination of qualification
in accordance with a predetrmined criteria.
Meassurement, beyond its general definition, refers to the set of procedures and the
principles for how to use the procedures in educational evaluations would be raw
scores, percentiles ranks, derrived scores, standard scores etc.
English Language Testing 7
English Language Testing 8
3
Informal vs. Formal Assessments: Tests
are not the only end-all-be-all of how
we assess.
Formal assessment
Formal assessment uses formal tests or structured continuous assessment to evaluate a
learner's level of language. It can be compared to informal assessment, which involves
observing the learners' performance as they learn and evaluating them from the data
gathered.
Example
At the end of the course, the learners have a final exam to see if they pass to the next
course or not. Alternatively, the results of a structured continuous assessment process
are used to make the same decision.
In the classroom Informal and formal assessments are both useful for making valid and
useful assessments of learners' knowledge and performance. Many teachers combine the
two, for example by evaluating one skill using informal assessment such as observing
group work, and another using formal tools, for example a discrete item grammar test.
Formative assessment
Formative assessment is the use of assessment to give the learner and the teacher
information about how well something has been learnt so that they can decide what to
do next. It normally occurs during a course. Formative assessment can be compared
English Language Testing 9
with summative assessment, which evaluates how well something has been learnt in
order to give a learner a grade.
Example
The learners have just finished a project on animals, which had as a language aim better
understanding of the use of the present simple to describe habits. The learners now
prepare gap-fill exercises for each other based on some of their texts. They analyse the
results and give each other feedback.
In the classroom ,One of the advantages of formative feedback is that peers can do it.
Learners can test each other on language they have been learning, with the additional
aim of revising the language themselves. It has been once said that ““Everybody is a
genius. But if you judge a fish by its ability to climb a tree, it will live its whole life
believing that it is stupid.” Our students must be assessed relative to what their skills
are. It could be done by doing formal assessments or informal assessments or
combination of both.
I realized that beyond giving formal assessments (i.e. Summative assessments:
Quizzes, long tests, periodical exams, etc.), our main role as teachers is determined by
English Language Testing 10
how we recognize our students’ progress/stagnation through informal assessments (i.e.
formative assessments: port folios, role play, record tracking, etc.) These methods allow
the teacher to easily maneuver where and how his/her instuction is going.
The result of a formal test (e.g. long test) alone would not necessarily dictate the entire
academic ability of our students. It does not mean that when a student fails a formal test
(e.g. periodical test), we could already conclude that he’s entire learning capabilities for
that subject failed as well.
Assessing students is not monopolized by just doing it formally (e.g. giving out tests,
quizzes, summative exams, etc.), but rather depends on the other informal assessments
(e.g. coaching sessions, reflective logs, fly-by-question and answers, etc.) that reinforce
formal ones.
There are many factors why a student could fail from a test (e.g. lack of sleep,
emotional and family distress, etc.), but there would only be few factors why he/she
would not be able to provide a reflective insight on the lesson. But how do we separate
formal assessments from informal ones?
The table and concept map I incorporated below could give some help (you could click
the picture or open it in a new tab to see it clearer =).
English Language Testing 11
When are informal assessments useful (versus formal assessments)?
The most applicable time to use informal assessments is when:
1. We want to gauge the students cognitive, affective and manipulative skills in the
simplest way possible. We ask students to recite or write down essays to easily
determine if they understood a specific lesson well or poorly, if they are enthusiastic or
bored with the lesson, if they are already familiar or completely unfamiliar with the
topic, etc.
2. We deem that the results of the formal examinations are not enough to give a
concluding mark for the students’ performance. If a specific student performs
excellently in class activities but suddenly failed a summative test, it could tell us that
there could be a deviation between our formal against our informal assessments, or
other factors might have been involved with such event (e.g. student factor: did not
review, physically/emotionally troubled, etc.)
How valuable are informal assessments? Can informal assessments be
good replacements for formal assessments?
Although informal assessments provide teachers with solid bases of how the students
are performing, it would not imply that it could already replace formal assessments.
They should work hand-in-hand and interdependently. One should complement the
other.For instance, if we opt to use role plays and recitals in assessing students’
communications skills informally, we should also align our formal exams with the
activities our students previously engaged on. In this way, we could ensure validity and
fairness of our assessments. Moreover, we could find that these methods relieve our
burdens with analyzing, comparing, and understanding our students “true” abilities.
We cannot just give (formal) tests or quizzes in the same manner as we cannot just
consume course-time with just giving out (informal) class activities. Arriving at a valid
and reliable grades for our students is a combination of maximing both formal and
informal assessments.
English Language Testing 12
Tosummarize:Informal assessment being a systematic observation = knowing what,
when and where we are going to assess + (How) Establishing criteria for assessing
students
4
English Language Testing 13
Norm-referenced test and Criterion-
referenced test
A norm-referenced test (NRT) is a type of test, assessment, or evaluation which yields
an estimate of the position of the tested individual in a predefined population, with
respect to the trait being measured. The estimate is derived from the analysis of test
scores and possibly other relevant data from a sample drawn from the population.That
is, this type of test identifies whether the test taker performed better or worse than other
test takers, not whether the test taker knows either more or less material than is
necessary for a given purpose.The term normative assessment refers to the process of
comparing one test-taker to his or her peers.Norm-referenced assessment can be
contrasted with criterion-referenced assessment and ipsative assessment. In a criterion-
referenced assessment, the score shows whether or not test takers performed well or
poorly on a given task, not how that compares to other test takers; in an ipsative system,
test takers are compared to previous performance.Alternative to normative testing, tests
can be ipsative, in which individuals' assessment is compared to themselves through
time.
By contrast, a test is criterion-referenced when provision is made for translating the test
score into a statement about the behavior to be expected of a person with that score. The
same test can be used in both ways. Robert Glaser originally coined the terms norm-
referenced test and criterion-referenced test.
Standards-based education reform is based on the belief that public education should
establish what every student should know and be able to do.Students should be tested
against a fixed yardstick, rather than against each other or sorted into a mathematical
bell curve.
By assessing that every student must pass these new, higher standards, education
officials believe that all students will achieve a diploma that prepares them for success
in the 21st century.Most state achievement tests are criterion-referenced. In other words,
English Language Testing 14
a predetermined level of acceptable performance is developed and students pass or fail
in achieving or not achieving this level. Tests that set goals for students based on the
average student's performance are norm-referenced tests. Tests that set goals for
students based on a set standard (e.g., 80 words spelled correctly) are criterion-
referenced tests.
Many college entrance exams and nationally used school tests use norm-referenced
tests. The SAT, Graduate Record Examination (GRE), and Wechsler Intelligence Scale
for Children (WISC) compare individual student performance to the performance of a
normative sample. Test takers cannot "fail" a norm-referenced test, as each testtaker
receives a score that compares the individual to others that have taken the test, usually
given by a percentile. This is useful when there is a wide range of acceptable scores that
is different for each college.
By contrast, nearly two-thirds of US high school students will be required to pass a
criterion-referenced high school graduation examination. One high fixed score is set at a
level adequate for university admission whether the high school graduate is college
bound or not. Each state gives its own test and sets its own passing level, with states like
Massachusetts showing very high pass rates, while in Washington State, even average
students are failing, as well as 80 percent of some minority groups. This practice is
opposed by many in the education community such as Alfie Kohn as unfair to groups
and individuals who score lower than others.
Advantages and limitations
An obvious disadvantage of norm-referenced tests is that it cannot measure progress of
the population as a whole, only where individuals fall within the whole. Thus,
measuring against only a fixed goal can be used to measure the success of an
educational reform program that seeks to raise the achievement of all students against
new standards that seek to assess skills beyond choosing among multiple choices.
However, while this is attractive in theory, in practice, the bar has often been moved in
the face of excessive failure rates, and improvement sometimes occurs simply because
of familiarity with and teaching to the same test.
English Language Testing 15
With a norm-referenced test, grade level was traditionally set at the level set by the
middle 50 percent of scores.By contrast, the National Children's Reading Foundation
believes that it is essential to assure that virtually all of read at or above grade level by
third grade, a goal which cannot be achieved with a norm referenced definition of grade
level.
Advantages to this type of assessment include that students and teachers know what to
expect from the test and just how the test will be conducted and graded. Likewise, all
schools will conduct the exam in the same manner, reducing such inaccuracies as time
differences or environmental differences that may cause distractions to the students.
This also makes these assessments fairly accurate as far as results are concerned, a
major advantage for a test.
Critics of criterion-referenced tests point out that judges set bookmarks around items of
varying difficulty without considering whether the items actually are compliant with
grade level content standards or are developmentally appropriate Thus, the original
1997 sample problems published for the WASL 4th grade mathematics contained items
that were difficult for college educated adults, or easily solved with 10th grade level
methods such as similar triangles.The difficulty level of items themselves and the cut-
scores to determine passing levels are also changed from year to year. Pass rates also
vary greatly from the 4th to the 7th and 10th grade graduation tests in some states.
One of the limitations of No Child Left Behind is that each state can choose or construct
its own test, which cannot be compared to any other state. A Rand study of Kentucky
results found indications of artificial inflation of pass rates which were not reflected in
increasing scores in other tests such as the NAEP or SAT given to the same student
populations over the same time.Graduation test standards are typically set at a level
consistent for native born 4 year university applicants.Unusual side effect is that while
colleges often admit immigrants with very strong math skills who may be deficient in
English, there is no such leeway in high school graduation tests, which usually require
passing all sections, including language. Thus, it is not unusual for institutions like the
University of Washington to admit strong Asian American or Latino students who did
English Language Testing 16
not pass the writing portion of the state WASL test, but such students would not even
receive a diploma once the testing requirement is in place.
Although the tests such as the WASL are intended as a minimal bar for high school, 27
percent of 10th graders applying for Running Start in Washington State failed the math
portion of the WASL. These students applied to take college level courses in high
school, and achieve at a much higher level than average students. The same study
concluded the level of difficulty was comparable to, or greater than that of tests
intended to place students already admitted to the college.
A norm-referenced test has none of these problems because it does not seek to enforce
any expectation of what all students should know or be able to do other than what actual
students demonstrate. Present levels of performance and inequity are taken as fact, not
as defects to be removed by a redesigned system. Goals of student performance are not
raised every year until all are proficient. Scores are not required to show continuous
improvement through Total Quality Management systems. Disadvantages include
standards based assessments measure the level that students are currently by measuring
against where their peers are currently at instead of the level that both students should
be at.
A rank-based system produces only data that tell which average students perform at an
average level, which students do better, and which students do worse, contradicting
fundamental beliefs, whether optimistic or simply unfounded, that all will perform at
one uniformly high level in a standards based system if enough incentives and
punishments are put into place. This difference in beliefs underlies the most significant
differences between a traditional and a standards based education system.
Examples
1. IQ tests are norm-referenced tests, because their goal is to see which test taker is
more intelligent than the other test takers.
English Language Testing 17
2. Theater auditions and job interviews are norm-referenced tests, because their
goal is to identify the best candidate compared to the other candidates, not to
determine how many of the candidates meet a fixed list of standards.
A criterion-referenced test is one that provides for translating test scores into a statement
about the behavior to be expected of a person with that score or their relationship to a
specified subject matter. Most tests and quizzes that are written by school teachers can
be considered criterion-referenced tests. The objective is simply to see whether the
student has learned the material. Criterion-referenced assessment can be contrasted with
norm-referenced assessment and ipsative assessment.
A common misunderstanding regarding the term is the meaning of criterion. Many, if
not most, criterion-referenced tests involve a cutscore, where the examinee passes if
their score exceeds the cutscore and fails if it does not (often called a mastery test). The
criterion is not the cutscore; the criterion is the domain of subject matter that the test is
designed to assess. For example, the criterion may be "Students should be able to
correctly add two single-digit numbers," and the cutscore may be that students should
correctly answer a minimum of 80% of the questions to pass.
The criterion-referenced interpretation of a test score identifies the relationship to the
subject matter. In the case of a mastery test, this does mean identifying whether the
examinee has "mastered" a specified level of the subject matter by comparing their
score to the cutscore. However, not all criterion-referenced tests have a cutscore, and the
score can simply refer to a person's standing on the subject domain.The ACT is an
example of this; there is no cutscore, it simply is an assessment of the student's
knowledge of high-school level subject matter.Because of this common
misunderstanding, criterion-referenced tests have also been called standards-based
assessments by some education agencies,as students are assessed with regards to
standards that define what they "should" know, as defined by the state.
English Language Testing 18
Comparison of criterion-referenced and norm-referenced tests
Both terms criterion-referenced and norm-referenced were originally coined by Robert
Glaser.Unlike a criterion-reference test, a norm-referenced test indicates whether the
test-taker did better or worse than other people who took the test. For example, if the
criterion is "Students should be able to correctly add two single-digit numbers," then
reasonable test questions might look like " " or " " A criterion-
referenced test would report the student's performance strictly according to whether the
individual student correctly answered these questions. A norm-referenced test would
report primarily whether this student correctly answered more questions compared to
other students in the group. Even when testing similar topics, a test which is designed to
accurately assess mastery may use different questions than one which is intended to
show relative ranking. This is because some questions are better at reflecting actual
achievement of students, and some test questions are better at differentiating between
the best students and the worst students. (Many questions will do both.) A criterion-
referenced test will use questions which were correctly answered by students who know
the specific material. A norm-referenced test will use questions which were correctly
answered by the "best" students and not correctly answered by the "worst" students (e.g.
Cambridge University's pre-entry 'S' paper). Some tests can provide useful information
about both actual achievement and relative ranking. The ACT provides both a ranking,
and indication of what level is considered necessary to likely success in college. Some
argue that the term "criterion-referenced test" is a misnomer, since it can refer to the
interpretation of the score as well as the test itself.In the previous example, the same
score on the ACT can be interpreted in a norm-referenced or criterion-referenced
manner.
Sample scoring for the history question: What caused World War II?
Student answers
Criterion-
referenced
assessment
Norm-referenced assessment
Student #1:
WWII was caused by Hitler
This answer is
correct.
This answer is worse than Student #2's
answer, but better than Student #3's
English Language Testing 19
and Germany invading
Poland.answer.
Student #2:
WWII was caused by multiple
factors, including the Great
Depression and the general
economic situation, the rise of
nationalism, fascism, and
imperialist expansionism, and
unresolved resentments related
to WWI. The war in Europe
began with the German
invasion of Poland.
This answer is
correct.
This answer is better than Student #1's
and Student #3's answers.
Student #3:
WWII was caused by the
assassination of Archduke
Ferdinand.
This answer is
wrong.
This answer is worse than Student #1's
and Student #2's answers.
Relationship to high-stakes testing
Many high-profile criterion-referenced tests are also high-stakes tests, where the results
of the test have important implications for the individual examinee. Examples of this
include high school graduation examinations and licensure testing where the test must
be passed to work in a profession, such as to become a physician or attorney. However,
being a high-stakes test is not specifically a feature of a criterion-referenced test. It is
instead a feature of how an educational or government agency chooses to use the results
of the test.
Examples
1. Driving tests are criterion-referenced tests, because their goal is to see whether
the test taker is skilled enough to be granted a driver's license, not to see whether
one test taker is more skilled than another test taker.
English Language Testing 20
2. Citizenship tests are usually criterion-referenced tests, because their goal is to
see whether the test taker is sufficiently familiar with the new country's history
and government, not to see whether one test taker is more knowledgeable than
another test taker.
5English Language Testing 21
Discrete Point Testing and Integrative
Testing
Electronic quiz tools usually involve a discrete point approach to testing as opposed to
an integrated or authentic approach, such as papers and projects. Discrete point tests are
made up of test questions each of which is meant to measure one content point. Discrete
point testing is associated with multiple choice and true/false formats, which have been
criticized for testing only recognition knowledge and facilitating guessing and cheating.
However, if they are used for an appropriate PURPOSE and if the test questions are
well constructed, discrete point tests can be used for effective teaching and learning.
Should language be tested by discrete points or by integrative testing?Traditionally, language test have
been constructed on the assumption that: language can be broken down intoits component and those
component parts are duly tested.What is discrete point?
Language is segmented into many small linguistic points and the four language skills of
listening, speaking,reading and writing. Test questions are designed to test these skills
and linguistic points. A discrete point testconsists of many questions on a large number
of linguistic points, but each question tests only one linguisticpoint.Examples of
Discrete point test are:1. Phoneme recognition.2. Yes/No, True/ False answers.3.
Spelling.4. Word completion.5. Grammar items.6. Multiple choice tests.Such tests have
a down side in that they take language out of context and usually bear no relationship to
theconcept or use of whole language.Discrete point test met with some criticism,
particularly in the view of more recent trends toward viewing theunits of language and
its communicative nature and purpose, and viewing language as the arithmetic sum ofall
its parts.That is why John Oller (1976) introduced“INTEGRATIVE TESTING”.
According to him“language competence is a unified set of interacting abilities which
cannot be separated apart and testedadequately.“ Oller (1979:37) “Whereasdiscrete
items attempt to test knowledge of language one bit at a time, integrativetests attempt to
assess a learner's capacity to use many bits all at the same time, and possibly while
exercisingseveral presumed components of a grammatical system, and perhaps more
English Language Testing 22
than one of the traditional skills oraspects of skills. Therefore, communicative
competence is so global and requires such“integration”for its“pragmatic”use in thereal
world that it cannot be captured in additive tests of grammar or reading or vocabulary
and other discretepoints of language.This emphasizes the simultaneous testing of the
testee's multiple linguistic competence from variousperspectives.Examples of
integrative test are:
1.Cloze tests
2. Dictation
3.Translation
4.Essays and other coherent writing tasks
5.Oral interviews and conversation
6.Reading, or other extended samples of real text
Oller (1979:38) has refined the integrative concept further by proposing what he calls
pragmatic test.A pragmatic test is“...any procedure or task that causes the learner to
process sequences of elements in alanguage that conform to the normal contextual
constraints of that language and which requires learner torelate sequences of linguistics
elements via pragmatic mappings to extra linguistic contexts.
” A step in a positive direction would be to concentrate on tests of communicative
competence.The recent direction of linguistic study has been toward viewing language
as an integrated and pragmatic skill,we cannot be certain that a test like a cloze test
meets the criterion of predicting or assessing a unified andintegrated underlying
linguistic competence we must be cautious in selecting and constructing test
oflanguage.There is nothing wrong to use the traditional tests of discrete points of
language especially in achievement andother classroom-oriented testing in which
certain discrete points are very important.
English Language Testing 23
6
English Language Testing 24
Communicative Language Testing
The notion of communicative competence is broad and needs to be fully understood
before being considered as a basis for a research testing regime. As previously indicated
assessment can be viewed in terms of two distinct paradigms as follows: 1) The
Psychometric-Structuralist era: Testing is based on discrete linguistic points related to
four language skill areas, reading, writing, speaking and listening. Additionally there is
the Psycholinguistic-Sociolinguistic era: Integrative tests were conceived in response to
the language proficiency limitations associated with discrete point testing. According to
Oller (in Weir, 1988), Integrative testing could measure the ability to integrate disparate
language skills in ways that more closely resembled the actual process of language use.
The communicative paradigm is founded on the notion of competence. According to
Morrow (in Weir, 1988; pp8) communicative language testing should be concerned
with :1) what the learner knows about the form of the language and how to use it
appropriately in context (Competence). 2) the extent to which the learner is able to
demonstrate this knowledge in a meaningful situation (Performance) i.e what can he do
with the language. Performance testing should therefore be representative of a real-life
situation where an integration of communicative skills is required. The performance test
criteria should relate closely to the effective communication of ideas in that context.
Weir emphasises the importance of context and related tasks as an important dimension
in communicative (performance) language assessment (ibid, pp11). In conclusion a
variety of tests different tests are required for a range of different purposeds and the
associated instruments are no longer uniform in content or method.
In recognising the broad definitions of communication, Carroll (Testing Communicative
Performance, 1980) adopts a rationalist approach to test requirement definition. The
basis of the methodology therefore is a detailed analysis including the identification of
events and activities (communication functions) that drive the communicative need.
Having identified the test requirements, they are divided between the principle
communicative domains of speaking, listening, writing and reading.
English Language Testing 25
This approach is no doubt reminiscent of the requirements definition related to English
for Specific Purposes (ESP) i.e functional language appropriate for Tourist, Students,
Lawyers etc. However, this strategy (and associated methodology) would seem
inappropriate in the given research context for the following salient reasons:
1. No practical to undertake a meaningful needs analysis for all participants
2. The entirely process is far too complex and labour intensive
3. ESP is not aimed at marginalised communities or children
Sabria and Samer (other students) have pointed me in the direction of Cambridge
ToEFL exams (conformant with the Common European Framework of Reference for
Languages) as a potential basis for communicative testing. The tests are divided into the
4 principal language dimensions (Speaking, Listening, Writing and Reading) and
provide tests and marking criteria at all levels of competency including that for the
research context (Young Learners English – YLE starters).
English Language Testing 26
7
Testing Communicative Competence
Testing language has traditionally taken the form of testing knowledge about language,
usually the testing of knowledge of vocabulary and grammar. However, there is much
more to being able to use language than knowledge about it. Dell Hymes proposed the
concept of communicative competence. He argued that a speaker can be able to produce
grammatical sentences that are completely inappropriate. In communicative
competence, he included not only the ability to form correct sentences but to use them
at appropriate times. Since Hymes proposed the idea in the early 1970s, it has been
expanded considerably, and various types of competencies have been proposed.
However, the basic idea of communicative competence remains the ability to use
language appropriately, both receptively and productively, in real situations.
The Communicative Approach to Testing
What Communicative Language Tests Measure
Communicative language tests are intended to be a measure of how the testees are able
to use language in real life situations. In testing productive skills, emphasis is placed on
appropriateness rather than on ability to form grammatically correct sentences. In
testing receptive skills, emphasis is placed on understanding the communicative intent
of the speaker or writer rather than on picking out specific details. And, in fact, the two
are often combined in communicative testing, so that the testee must both comprehend
and respond in real time. In real life, the different skills are not often used entirely in
isolation. Students in a class may listen to a lecture, but they later need to use
information from the lecture in a paper. In taking part in a group discussion, they need
to use both listening and speaking skills. Even reading a book for pleasure may be
followed by recommending it to a friend and telling the friend why you liked it.
English Language Testing 27
The "communicativeness" of a test might be seen as being on a continuum. Few tests
are completely communicative; many tests have some element of communicativeness.
For example, a test in which testees listen to an utterance on a tape and then choose
from among three choices the most appropriate response is more communicative than
one in which the testees answer a question about the meaning of the utterance.
However, it is less communicative than one in which the testees are face- to-face with
the interlocutor (rather than listening to a tape) and are required to produce an
appropriate response.
Tasks
Communicative tests are often very context-specific. A test for testees who are going to
British universities as students would be very different from one for testees who are
going to their company's branch office in the United States. If at all possible, a
communicative language test should be based on a description of the language that the
testees need to use. Though communicative testing is not limited to English for Specific
Purposes situations, the test should reflect the communicative situation in which the
testees are likely to find themselves. In cases where the testees do not have a specific
purpose, the language that they are tested on can be directed toward general social
situations where they might be in a position to use English.
This basic assumption influences the tasks chosen to test language in communicative
situations. A communicative test of listening, then, would test not whether the testee
could understand what the utterance, "Would you mind putting the groceries away
before you leave" means, but place it in a context and see if the testee can respond
appropriately to it.
If students are going to be tested over communicative tasks in an achievement test
situation, it is necessary that they be prepared for that kind of test, that is, that the course
material cover the sorts of tasks they are being asked to perform. For example, you
cannot expect testees to correctly perform such functions as requests and apologies
appropriately and evaluate them on it if they have been studying from a structural
English Language Testing 28
syllabus. Similarly, if they have not been studying writing business letters, you cannot
expect them to write a business letter for a test.
Tests intended to test communicative language are judged, then, on the extent to which
they simulate real life communicative situations rather than on how reliable the results
are. In fact, there is an almost inevitable loss of reliability as a result of the loss of
control in a communicative testing situation. If, for example, a test is intended to test the
ability to participate in a group discussion for students who are going to a British
university, it is impossible to control what the other participants in the discussion will
say, so not every testee will be observed in the same situation, which would be ideal for
test reliability. However, according to the basic assumptions of communicative
language testing, this is compensated for by the realism of the situation.
Evaluation
There is necessarily a subjective element to the evaluation of communicative tests. Real
life situations don't always have objectively right or wrong answers, and so band scales
need to be developed to evaluate the results. Each band has a description of the quality
(and sometimes quantity) of the receptive or productive performance of the testee.
Examples of Communicative Test Tasks
Speaking/Listening
Information gap. An information gap activity is one in which two or more testees work
together, though it is possible for a confederate of the examiner rather than a testee to
take one of the parts. Each testee is given certain information but also lacks some
necessary information. The task requires the testees to ask for and give information. The
task should provide a context in which it is logical for the testees to be sharing
information.
English Language Testing 29
The following is an example of an information gap activity.
Student A
You are planning to buy a tape recorder. You don't want to spend more than about 80
pounds, but you think that a tape recorder that costs less than 50 pounds is probably not
of good quality. You definitely want a tape recorder with auto reverse, and one with a
radio built in would be nice. You have investigated three models of tape recorder and
your friend has investigated three models. Get the information from him/her and share
your information. You should start the conversation and make the final decision, but
you must get his/her opinion, too.
(information about three kinds of tape recorders)
Student B
Your friend is planning to buy a tape recorder, and each of you investigated three types
of tape recorder. You think it is best to get a small, light tape recorder. Share your
information with your friend, and find out about the three tape recorders that your friend
investigated. Let him/her begin the conversation and make the final decision, but don't
hesitate to express your opinion.
(information about three kinds of tape recorders)
This kind of task would be evaluated using a system of band scales. The band scales
would emphasize the testee's ability to give and receive information, express and elicit
opinions, etc. If its intention were communicative, it would probably not emphasize
pronunciation, grammatical correctness, etc., except to the extent that these might
interfere with communication. The examiner should be an observer and not take part in
the activity, since it is difficult to both take part in the activity and evaluate it. Also, the
activity should be tape recorded, if possible, so that it could be evaluated later and it
does not have to be evaluated in real time.
English Language Testing 30
Role Play. In a role play, the testee is given a situation to play out with another person.
The testee is given in advance information about what his/her role is, what specific
functions he/she needs to carry out, etc. A role play task would be similar to the above
information gap activity, except that it would not involve an information gap. Usually
the examiner or a confederate takes one part of the role play.
The following is an example of a role play activity.
Student
You missed class yesterday. Go to the teacher's office and apologize for having missed
the class. Ask for the handout from the class. Find out what the homework was.
Examiner
You are a teacher. A student who missed your class yesterday comes to your office.
Accept her/his apology, but emphasize the importance of attending classes. You do not
have any extra handouts from the class, so suggest that she/he copy one from a friend.
Tell her/him what the homework was.
Again, if the intention of this test were to test communicative language, the testee would
be assessed on his/her ability to carry out the functions (apologizing, requesting, asking
for information, responding to a suggestion, etc.) required by the role.
English Language Testing 31
English Language Testing 32
8
Testing Reading and Writing
Some tests combine reading and writing in communicative situations. Testees can be
given a task in which they are presented with instructions to write a letter, memo,
summary, etc., answering certain questions, based on information that they are
given.Letter writing. In many situations, testees might have to write business letters,
letters asking for information, etc.The following is an example of such a task.
Your boss has received a letter from a customer complaining about problems with a
coffee maker that he bought six months ago. Your boss has instructed you to check the
company policy on returns and repairs and reply to the letter. Read the letter from the
customer and the statement of the company policy about returns and repairs below and
write a formal business letter to the customer.
(the customer's complaint letter; the company policy)
The letter would be evaluated using a band scale, based on compliance with formal
letter writing layout, the content of the letter, inclusion of correct and relevant
information, etc.
Summarizing. Testees might be given a long passage--for example, 400 words--and be
asked to summarize the main points in less than 100 words. To make this task
communicative, the testees should be given realistic reasons for doing such a task. For
example, the longer text might be an article that their boss would like to have
summarized so that he/she can incorporate the main points into a talk.The summary
would be evaluated, based on the inclusion of the main points of the longer text.
Testing Listening and Writing/Note Taking
English Language Testing 33
Listening and writing may also be tested in combination. In this case, testees are given a
listening text and they are instructed to write down certain information from the text.
Again, although this is not interactive, it should somehow simulate a situation where
information would be written down from a spoken text.
English Language Testing 34
9
Performance-Based Assessment
Performance-based assessment is an alternative form of assessment that moves away
from traditional paper and pencil tests. Performance-based assessment involves having
the students produce a project, whether it is oral, written or a group performance. The
students are engaged in creating a final project that exhibits their understanding of a
concept they have learned.
A unique quality of performance-based assessment is that is allows the students to be
assessed based on a process. The teacher is able to see first hand how the students
produce language in real-world situations. In addition, performance-based assessments
tend to have a higher content validity because a process is being measured. The focus
remains on the process, rather than the product in performance-based assessment.
There are two parts to performance-based assessments. The first part is a clearly
defined task for the students to complete. This is called the product descriptor. The
assessments are either product related, specific to certain content or specific to a given
task. The second part is a list of explicit criteria that are used to assess the students.
Generally this comes in the form of a rubric. The rubrics can either be analytical,
meaning it assesses the final product in parts, or holistic, meaning that is assesses the
final product as a whole.
Performance-based assessment tasks are generally not as formally structured. There
is room for creativity and student design in performance-based tasks. Generally, these
tasks measure the students when they are actually performing the given task. Due to the
nature of these tasks, performance-based assessment is highly interactive. Students are
interacting with each other in order to complete real-world examples of language tasks.
English Language Testing 35
Also, performance-based assessment tends to integrate many different skills. For
example, reading and writing can be involved in one task or speaking and listening can
be involved in the same task.
As previously mentioned, there are many types of performance-based assessments.
Each type of assessment brings with it different strengths and deficiencies relative to
credible and dependable information. Because it is virtually impossible for a single
assessment tool to adequately assess all aspects of student performance, the real
challenge comes in selecting or developing performance-based assessments that
complement both each other and more traditional assessments to equitably assess
students in physical education and human performance.
The goal for assessment is to accurately determine whether students have learned the
materials or information taught and reveal whether they have complete mastery of the
content with no misunderstandings. Just as researchers use multiple data sources to
determine the truthfulness of the results, teachers can use multiple types of assessment
to evaluate the level of student learning. Because assessments involve the gathering of
data or information, some type of product, performance, or recording sheet must be
generated. The following are some examples of various types of performance-based
assessments used in physical education.
Performance-based assessment is an opportunity to allow students to produce
language in real-world contexts while being assessed. This type of assessment is unique
because it is not a traditional test format. Some examples of performance-based
assessment tasks are as follows:
Types of Performance-Based Assessment:
1. Journals
English Language Testing 36
Students will write regularly in a journal about anything relevant to their life, school or
thoughts. Their writing will be in the target language. The teacher will collect the
journals periodically and provide feedback to the students. This can serve as a
communication log between the teacher and students.Journals can be used to record
student feelings, thoughts, perceptions, or reflections about actual events or results. The
entries in journals often report social or psychological perspectives, both positive and
negative, and may be used to document the personal meaning associated with one’s
participation (NASPE Standard 6). Journal entries would not be an appropriate
summative assessment by themselves, but might be included as an artifact in a portfolio.
Journal entries are excellent ways for teachers to “take the pulse” of a class and
determine whether students are valuing the content of the class. Teachers must be
careful not to assess affective domain journal entries for the actual content, because
doing so may cause students to write what teachers want to hear (or give credit for)
instead of true and genuine feelings. Teachers could hold students accountable for
completing journal entries. Some teachers use journals as a way to log participation
over time.
2.Letters
The students will create original language compositions through producing a letter.
They will be asked to write about something relevant to their own life using the target
language. The letter assignment will be accompanied by a rubric for assessment
purposes.
3. Oral Reports
The students will need to do research in groups about a given topic. After they have
completed their research, the students will prepare an oral presentation to present to the
class explaining their research. The main component of this project will be the oral
production of the target language.
4. Original Stories
The students will write an original fictional story. The students will be asked to
include several specified grammatical structures and vocabulary words. This assignment
will be assessed analytically, each component will have a point value.
English Language Testing 37
5. Oral Interview
An oral interview will take place between two students. One student will ask the
questions and listen to the responses of the other student. From the given responses,
more questions can be asked. Each student will be responsible for listening and
speaking.
6. skit
The students will work in groups in order to create a skit about a real-world situation.
They will use the target language. The vocabulary used should be specific to the
situation. The students will be assess holistically, based on the overall presentation of
the skit.
7.Poetry Recitations
After studying poetry, the students will select a poem in the target langugage of their
choice to recite to the class. The students will be assessed based on their pronunciation,
rhythm and speed. The students will also have an opportunity to share with the class
what they think the poem means.
8.Portfolios
Portfolios allow students to compile their work over a period of time. The students
will have a checklist and rubric along with the assignment description. The students will
assemble their best work, including their drafts so that the teacher can assess the
process.
9.PuppetShow
The students can work in groups or individually to create a short puppet show. The
puppet show can have several characters that are involved in a conversation of real-
world context. These would most likely be assessed holistically.
10. Art Work/ Designs/Drawings
This is a creative way to assess students. They can choose a short story or piece or
writing, read it and interpret it. Their interpretation can be represented through artistic
English Language Testing 38
expression. The students will present their art work to the class, explaining what they
did and why.
Using Observation in the Assessment Process
Human performance provides many opportunities for students to exhibit behaviors that
may be directly observed by others, a unique advantage of working in the psychomotor
domain. Wiggins (1998) uses physical activity when providing examples to illustrate
complex assessment concepts, as they are easier to visualize than would be the case
with a cognitive example. The nature of performing a motor skill makes assessment
through observational analysis a logical choice for many physical education teachers. In
fact, investigations of measurement practices of physical educators have consistently
shown a reliance on observation and related assessment methods (Hensley and East
1989; Matanin and Tannehill 1994; Mintah 2003).
Observation is a skill used with several performance-based assessments. It is often used
to provide students with feedback to improve performance. However, without some way
to record results, observation alone is not an assessment. Going back to the definition of
assessment provided earlier in the chapter, assessment is the gathering of information,
analyzing the data, and then using the information to make an evaluation. Therefore,
some type of written product must be produced if the task is considered an assessment.
Teachers and peers can assess others using observation. They might use a checklist or
some type of event recording scheme to tally the number of times a behavior occurred.
Keeping game play statistics is an example of recording data using event recording
techniques. Students can self-analyze their own performance and record their
performances using criteria provided on a checklist or a game play rubric. Table 14.1 is
an example of a recording form that could be used for peer assessment. When using
peer assessment, it is best to have the assessor do only the assessment. When the person
recording assessment results is also expected to take part in the assessment (e.g., tossing
the ball to the person being assessed), he or she cannot both toss and do an accurate
observation. In the case of large classes, teachers might even use groups of four, in
English Language Testing 39
which one person is being evaluated, a second person is feeding the ball, the third
person is doing the observation, and a fourth person is recording the results.
Individual or Group Projects
Projects have long been used in education to assess a student’s understanding of a
subject or a particular topic. Projects typically require students to apply their knowledge
and skills while completing the prescribed task, which often calls for creativity, critical
thinking, analysis, and synthesis. Examples of student projects used in physical
education and human performance include the following: demonstrating knowledge of
invasion game strategies by designing a new game; demonstrating knowledge of how to
become an active participant in the community by doing research on obesity and then
developing a brochure for people in the community that presents ideas for developing a
physically active lifestyle; demonstrating knowledge of fitness components and how to
stay fit by designing one’s own fitness program using personal fitness test results;
demonstrating knowledge of how to create a dance by video recording a dance that
members of the group choreographed; and doing research on childhood games and
teaching children from a local elementary school how to play them. Criteria for
evaluating the projects are developed and the results of the project are recorded.
Group projects involve a number of students working together on a complex problem
that requires planning, research, internal discussion, and presentation. Group projects
should include a component that each student completes individually to avoid having a
student receive credit for work that he or she did not do. Another way to avoid this issue
is to have members of the group award paychecks to the various members of the group
(e.g., split a $10,000 check) and provide justifications about the amount given to each
person. To encourage reflections on the contributions of others, students are not allowed
to give an equal amount to everyone. These “checks” are confidential and submitted
directly to the teacher in an envelope that others in the group are not allowed to see.
The following example of a project designed for middle school or high school students
involves a research component, analysis and synthesis of information, problem solving,
and effective communication.
English Language Testing 40
Portfolios
Portfolios are systematic, purposeful, and meaningful collections of an individual’s
work designed to document learning over time. Since a portfolio provides
documentation of student learning, the knowledge and skills that the teacher desires to
have students document guides the structure of the portfolio. The type of portfolio, its
format, and the general contents are usually prescribed by the teacher. Portfolio
collections may also include input provided by teachers, parents, peers, administrators,
or others.The guidelines used to format a portfolio will be based on the type of learning
that the portfolio is used to document. The following are two basic types of portfolios:
Working portfolio—A repository of portfolio documents that the student accumulates
over a certain period of time. Other types of process information may also be included,
such as drafts of student work or records of student achievement or progress over time.
Showcase or model portfolio—A portfolio consisting of work samples selected by the
student that document the student’s best work. The student has consciously evaluated
his or her work and selected only those products that best represent the type of learning
identified for this assessment. Each artifact selected is accompanied by a reflection, in
which the student explains the significance of the item and the type of learning it
represents.
It’s a good idea to limit the portfolio to a certain number of pieces of work to prevent
the portfolio from becoming a scrapbook that has little meaning to the student and to
avoid giving teachers a monumental evaluation task. This also requires students to
exercise some judgment about which artifacts best fulfill the requirements of the
portfolio task and document their level of achievement. The portfolio itself is usually a
file or folder that contains the student’s collected work. The contents could include
items such as a training log, student journal or diary, written reports, photographs or
sketches, letters, charts or graphs, maps, copies of certificates, computer disks or
computer-generated products, completed rating scales, fitness test results, game
English Language Testing 41
statistics, training plans, report of dietary analyses, and even video- or audio recordings.
Collectively, the artifacts selected will document student growth and learning over time
as well as current levels of achievement. The potential items that could become
portfolio artifacts are almost limitless. Kirk (1997) suggests the following list of
possible portfolio artifacts that may be useful for physical activity settings. A teacher
would never require that a portfolio contain all of these items. The list is offered as a
way to generate ideas for possible artifacts.
A rubric (scoring tool) should be used to evaluate portfolios in much the same manner
as any other product or performance. Providing a rubric to students in advance allows
them to self-assess their work and thus be more likely to produce a portfolio of high
quality. Portfolios, since they are designed to show growth and improvement in student
learning, are evaluated holistically. The reflections that describe the artifact and why the
artifact was selected for inclusion in the portfolio provide insights into levels of student
learning and achievement. Teachers should remember that format is less important than
content and that the rubric should be weighted to reflect this. Table 14.2 illustrates a
qualitative analytic rubric for judging a portfolio along three dimensions.
For additional information about portfolio assessments, Lund and Kirk (2010) have a
chapter on developing portfolio assessments. An article published as part of a JOPERD
feature presents a suggested scoring scale for a portfolio (Kirk 1997). Melograno’s
Assessment Series publication (2000) on portfolios also contains helpful information.
Performances
Student performances can be used as culminating assessments at the completion of an
instructional unit. Teachers might organize a gymnastics or track and field meet at the
conclusion of one of those units to allow students to demonstrate the skills and
knowledge that they gained during instruction. Game play during a tournament is also
considered a student performance. Rubrics for game play can be written so that students
are evaluated on all three learning domains (psychomotor, cognitive, and affective).
English Language Testing 42
Students might demonstrate their skills and learning in one of the following ways:
Performing an aerobics routine for a school assembly
Organizing and performing a jump rope show at the half-time of a basketball game
Performing in a folk dance festival at the county fair
Demonstrating wushu (a Chinese martial art) at the local shopping mall
Training for and participating in a local road race or cycling competition
Although performances do not produce a written product, there are several ways to
gather data to use for assessment purposes. A score sheet can be used to record student
performance using the criteria from a game play rubric. Game play statistics are another
example of a way to document performance. Performances can also be video recorded
to provide evidence of learning. In some cases teachers might want to shorten the time
used to gather evidence of learning from a performance. Event tasks are performances
that are completed in a single class period. Students might demonstrate their knowledge
of net or wall game strategies by playing a scripted game that is video recorded during a
single class. The ability to create movement sequences or a dance that uses different
levels, effort, or relationships could be demonstrated during a single class period with
an event task. Many adventure education activities that demonstrate affective domain
attributes can be assessed using event tasks.
Student Logs
Documenting student participation in physical activity (NASPE Standard 3) is often
difficult. Teachers can assess participation in an activity or skill practice trials
completed outside of class using logs. Practice trials during class that demonstrate
student effort can also be documented with logs. A log records behaviors over a period
of time (see figure 14.1). Often the information recorded shows changes in behavior,
trends in performance, results of participation, progress, or the regularity of physical
English Language Testing 43
activity. A student log is an excellent artifact for use in a portfolio. Because logs are
usually a self-recorded document, they are not used for summative assessments unless
as an artifact in a portfolio or for a project. If teachers wanted to increase the importance
placed on a log, a method of verification by an adult or someone in authority should be
added.
English Language Testing 44
10
VALIDITY AND RELIABILITY
For the statistical consultant working with social science researchers the estimation of
reliability and validity is a task frequently encountered. Measurement issues differ in
the social sciences in that they are related to the quantification of abstract, intangible
and unobservable constructs. In many instances, then, the meaning of quantities is only
inferred.
Let us begin by a general description of the paradigm that we are dealing with. Most
concepts in the behavioral sciences have meaning within the context of the theory that
they are a part of. Each concept, thus, has an operational definition which is governed
by the overarching theory. If a concept is involved in the testing of hypothesis to
support the theory it has to be measured. So the first decision that the research is faced
with is “how shall the concept be measured?” That is the type of measure. At a very
broad level the type of measure can be observational, self-report, interview, etc. These
types ultimately take shape of a more specific form like observation of ongoing activity,
observing video-taped events, self-report measures like questionnaires that can be open-
ended or close-ended, Likert-type scales, interviews that are structured, semi-structured
or unstructured and open-ended or close-ended. Needless to say, each type of measure
has specific types of issues that need to be addressed to make the measurement
meaningful, accurate, and efficient.
Another important feature is the population for which the measure is intended. This
decision is not entirely dependent on the theoretical paradigm but more to the
immediate research question at hand.
English Language Testing 45
A third point that needs mentioning is the purpose of the scale or measure. What is it
that the researcher wants to do with the measure? Is it developed for a specific study or
is it developed with the anticipation of extensive use with similar populations?
Once some of these decisions are made and a measure is developed, which is a careful
and tedious process, the relevant questions to raise are “how do we know that we are
indeed measuring what we want to measure?” since the construct that we are measuring
is abstract, and “can we be sure that if we repeated the measurement we will get the
same result?”. The first question is related to validity and second to reliability. Validity
and reliability are two important characteristics of behavioral measure and are referred
to as psychometric properties.
It is important to bear in mind that validity and reliability are not an all or none issue but
a matter of degree.
Measurement Error
All measurements may contain some element of error; validity and reliability
concern the amount and type of error that typically occurs, and they also show how we
can estimate the amount of error in a measurement.
There are three chief sources of error:
1. in the thing being measured (my weight may fluctuate so it's difficult to get an
accurate picture of it);
2. the observer (on Mondays I may knock a pound off my weight if I binged on my
mother's cooking at the week-end. Obviously the binging doesn't reflect my true
weight!);
3. or in the recording device (our clinic weigh scale has been acting up; we really
should get it recalibrated).And there are two types of error:
Random errors are not attributable to a specific cause. If sufficiently large numbers of
observations are made, random errors average to zero, because some readings over-
estimate and some under-estimate.Systematic errors tend to fall in a particular direction
and are likely due to a specific cause. Because systematic errors fall in one direction
(e.g., I always exaggerate my athletic abilities) they bias a measurement.Random errors
English Language Testing 46
are considered part of the reliability of a measurement.Systematic errors are considered
part of the validity of a measurement.
Reliability and validity
The reliability of an assessment tool is the extent to which it measures learning
consistently.
The validity of an assessment tool is the extent by which it measures what it was
designed to measure.
Reliability
The reliability of an assessment tool is the extent to which it consistently and accurately
measures learning. When the results of an assessment are reliable, we can be confident
that repeated or equivalent assessments will provide consistent results. This puts us in a
better position to make generalised statements about a student’s level of achievement,
which is especially important when we are using the results of an assessment to make
decisions about teaching and learning, or when we are reporting back to students and
their parents or caregivers. No results, however, can be completely reliable. There is
always some random variation that may affect the assessment, so educators should
always be prepared to question results.
Factors which can affect reliability:
The length of the assessment – a longer assessment generally produces more reliable
results.
The suitability of the questions or tasks for the students being assessed.
The phrasing and terminology of the questions.
The consistency in test administration – for example, the length of time given for the
assessment, instructions given to students before the test.
The design of the marking schedule and moderation of marking procedures.
The readiness of students for the assessment – for example, a hot afternoon or straight
after physical activity might not be the best time for students to be assessed.
English Language Testing 47
How to be sure that a formal assessment tool is reliable
Check in the user manual for evidence of the reliability coefficient. These are measured
between zero and 1. A coefficient of 0.9 or more indicates a high degree of reliability.
Assessment tool manuals contain comprehensive administration guidelines. It is
essential to read the manual thoroughly before conducting the assessment.
Validity
Educational assessment should always have a clear purpose. Nothing will be gained
from assessment unless the assessment has some validity for the purpose. For that
reason, validity is the most important single attribute of a good test.
The validity of an assessment tool is the extent to which it measures what it was
designed to measure, without contamination from other characteristics. For example, a
test of reading comprehension should not require mathematical ability.
There are several different types of validity:
Face validity: do the assessment items appear to be appropriate?
Content validity: does the assessment content cover what you want to assess?
Criterion-related validity: how well does the test measure what you want it to?
Construct validity: are you measuring what you think you're measuring?
It is fairly obvious that a valid assessment should have a good coverage of the criteria
(concepts, skills and knowledge) relevant to the purpose of the examination. The
important notion here is the purpose. For example:
The PROBE test is a form of reading running record which measures reading
behaviours and includes some comprehension questions. It allows teachers to see the
reading strategies that students are using, and potential problems with decoding. The
test would not, however, provide in-depth information about a student’s comprehension
strategies across a range of texts.
English Language Testing 48
STAR (Supplementary Test of Achievement in Reading) is not designed as a
comprehensive test of reading ability. It focuses on assessing students’ vocabulary
understanding, basic sentence comprehension and paragraph comprehension. It is most
appropriately used for students who don’t score well on more general testing (such as
PAT or e-asTTle) as it provides a more fine grained analysis of basic comprehension
strategies.
There is an important relationship between reliability and validity. An assessment that
has very low reliability will also have low validity; clearly a measurement with very
poor accuracy or consistency is unlikely to be fit for its purpose. But, by the same token,
the things required to achieve a very high degree of reliability can impact negatively on
validity. For example, consistency in assessment conditions leads to greater reliability
because it reduces 'noise' (variability) in the results. On the other hand, one of the things
that can improve validity is flexibility in assessment tasks and conditions. Such
flexibility allows assessment to be set appropriate to the learning context and to be
made relevant to particular groups of students. Insisting on highly consistent assessment
conditions to attain high reliability will result in little flexibility, and might therefore
limit validity.
Validity:
Very simply, validity is the extent to which a test measures what it is supposed
to measure. The question of validity is raised in the context of the three points made
above, the form of the test, the purpose of the test and the population for whom it is
intended. Therefore, we cannot ask the general question “Is this a valid test?”. The
question to ask is “how valid is this test for the decision that I need to make?” or “how
valid is the interpretation I propose for the test?” We can divide the types of validity
into logical and empirical.
VALIDITY refers to what conclusions we can draw from the results of a measurement.
Introductory-level definitions are "Does the test measure what we are intending to
measure?", or "How closely do the results of a measurement correspond to the true state
of the phenomenon being measured?"
English Language Testing 49
Nerd's Corner: These ideas of validity fit under a more general conception in terms of
"How can we interpret the test results?" or "What does this measurement actually
mean?" This approach is useful because sometimes information collected for one
purpose can also tell us about something quite different. So, the World Bank records the
gross national product of each country for economic monitoring, but this also gives us a
pretty good idea of how countries will rank in terms of child health.
Nerd's Corner: Putting these ideas together, we get a table showing how validity and
reliability may be assessed:
Thing being measured
Observer
Recording device
(e.g., screening test)
Random error
test re-test reliability
correlation between observers
calibration trial (variation with standard object)
Systematic
record diurnal (etc) variation (e.g. BP higher on Mondays)
agreement between observers (e.g. nurses or patients)
construct& criterion validity;
sensitivity& specificity
Validity of a screening test. This can be used to illustrate the way validity is assessed.
Here, it is commonly reported in terms of sensitivity and specificity.
Sensitivity refers to what fraction of all the actual cases of disease a test detects.
If the test is not very good, it may miss cases it should detect. Its sensitivity is low and it
generates "false negatives" (i.e., people score negatively on the test when they should
have scored positive). This can be extremely serious if early treatment would have
saved the person's life.
Mnemonics to help you: The word 'sensitivity' is intuitive: a sensitive test is one that
can identify the disease.
English Language Testing 50
SeNsitivity is inversely associated with the false Negative rate of a test (high sensitivity
= few false negatives).
Specificity refers to whether the test identifies only those with the disease, or
does it mistakenly classify some healthy people as being sick? Errors of this type are
called "false positives." This can lead to worry and expensive further investigations.
Types of Validity
1. Content Validity:
When we want to find out if the entire content of the behavior/construct/area is
represented in the test we compare the test task with the content of the behavior. This is
a logical method, not an empirical one. Example, if we want to test knowledge on
American Geography it is not fair to have most questions limited to the geography of
New England.
2. Face Validity:
Basically face validity refers to the degree to which a test appears to measure
what it purports to measure. Face Validity ascertains that the measure appears to be
assessing the intended construct under study. The stakeholders can easily assess face
validity. Although this is not a very “scientific” type of validity, it may be an essential
component in enlisting motivation of stakeholders. If the stakeholders do not believe the
measure is an accurate assessment of the ability, they may become disengaged with the
task. Example: If a measure of art appreciation is created all of the items should be
related to the different components and types of art. If the questions are regarding
historical time periods, with no reference to any artistic movement, stakeholders may
not be motivated to give their best effort or invest in this measure because they do not
believe it is a true assessment of art appreciation.
3. Criterion-Oriented or Predictive Validity:
English Language Testing 51
Criterion-Related Validity is used to predict future or current performance - it correlates
test results with another criterion of interest.
Example: If a physics program designed a measure to assess cumulative student
learning throughout the major. The new measure could be correlated with a
standardized measure of ability in this discipline, such as an ETS field test or the GRE
subject test. The higher the correlation between the established measure and new
measure, the more faith stakeholders can have in the new assessment tool.
When you are expecting a future performance based on the scores obtained
currently by the measure, correlate the scores obtained with the performance. The later
performance is called the criterion and the current score is the prediction. This is an
empirical check on the value of the test – a criterion-oriented or predictive validation.
4. Concurrent Validity:
Concurrent validity is the degree to which the scores on a test are related to the
scores on another, already established, test administered at the same time, or to some
other valid criterion available at the same time. Example, a new simple test is to be
used in place of an old cumbersome one, which is considered useful, measurements are
obtained on both at the same time. Logically, predictive and concurrent validation are
the same, the term concurrent validation is used to indicate that no time elapsed between
measures.
5. Construct Validity:
Construct Validity is used to ensure that the measure is actually measure what it is
intended to measure (i.e. the construct), and not other variables. Using a panel of
“experts” familiar with the construct is a way in which this type of validity can be
assessed. The experts can examine the items and decide what that specific item is
intended to measure. Students can be involved in this process to obtain their feedback.
Example: A women’s studies program may design a cumulative assessment of learning
throughout the major. The questions are written with complicated wording and
phrasing. This can cause the test inadvertently becoming a test of reading
comprehension, rather than a test of women’s studies. It is important that the measure is
actually assessing the intended construct, rather than an extraneous factor.
English Language Testing 52
Construct validity is the degree to which a test measures an intended
hypothetical construct. Many times psychologists assess/measure abstract attributes or
constructs. The process of validating the interpretations about that construct as
indicated by the test score is construct validation. This can be done experimentally,
e.g., if we want to validate a measure of anxiety. We have a hypothesis that anxiety
increases when subjects are under the threat of an electric shock, then the threat of an
electric shock should increase anxiety scores (note: not all construct validation is this
dramatic!)
A correlation coefficient is a statistical summary of the relation between two
variables. It is the most common way of reporting the answer to such questions as the
following: Does this test predict performance on the job? Do these two tests measure
the same thing? Do the ranks of these people today agree with their ranks a year ago?
(rank correlation and product-moment correlation)
According to Cronbach, to the question “what is a good validity coefficient?”
the only sensible answer is “the best you can get”, and it is unusual for a validity
coefficient to rise above 0.60, though that is far from perfect prediction.
All in all we need to always keep in mind the contextual questions: what is the
test going to be used for? how expensive is it in terms of time, energy and money? what
implications are we intending to draw from test scores?
Formative Validity when applied to outcomes assessment it is used to assess how well a
measure is able to provide information to help improve the program under study.
Example: When designing a rubric for history one could assess student’s knowledge
across the discipline. If the measure can provide information that students are lacking
knowledge in a certain area, for instance the Civil Rights Movement, then that
assessment tool is providing meaningful information that can be used to improve the
course or program requirements.
Sampling Validity (similar to content validity) ensures that the measure covers
the broad range of areas within the concept under study. Not everything can be
covered, so items need to be sampled from all of the domains. This may need to be
completed using a panel of “experts” to ensure that the content area is adequately
sampled. Additionally, a panel can help limit “expert” bias (i.e. a test reflecting what an
individual personally feels are the most important or relevant areas).
English Language Testing 53
Example: When designing an assessment of learning in the theatre department, it would
not be sufficient to only cover issues related to acting. Other areas of theatre such as
lighting, sound, functions of stage managers should all be included. The assessment
should reflect the content area in its entirety.
What are some ways to improve validity?
1. Make sure your goals and objectives are clearly defined and operationalized.
Expectations of students should be written down.
2. Match your assessment measure to your goals and objectives. Additionally, have
the test reviewed by faculty at other schools to obtain feedback from an outside
party who is less invested in the instrument.
3. Get students involved; have the students look over the assessment for
troublesome wording, or other difficulties.
4. If possible, compare your measure with other measures, or data that may be
available.
Reliability:
Research requires dependable measurement. (Nunnally) Measurements are
reliable to the extent that they are repeatable and that any random influence which tends
to make measurements different from occasion to occasion or circumstance to
circumstance is a source of measurement error. (Gay) Reliability is the degree to which
a test consistently measures whatever it measures. Errors of measurement that affect
reliability are random errors and errors of measurement that affect validity are
systematic or constant errors.
Test-retest, equivalent forms and split-half reliability are all determined through
correlation.
RELIABILITY refers to consistency or dependability. Your patient Jim is
unpredictable; sometimes he comes to his appointment on time, sometimes he's late and
once or twice he was early.
One way to estimate reliability of a measurement is to record its stability: do you
get the same blood pressure reading if you repeat the measurement? This is sometimes
English Language Testing 54
called "test-retest stability" or "intra-rater reliability" and focuses on the observer and
the instrument as potential sources of error. (Note that we must assume that no actual
change in BP occurred between the measurements: there is no error in the thing being
measured).
You can also estimate reliability by comparing the agreement between different people
making a rating (e.g., if several nurses measure a patient's blood pressure, do they get
the same reading?). This can be called "inter-rater reliability" or "inter-rater agreement."
Nerd's Corner: This is a simplification. Sometimes it's difficult to figure out if an
error is random or systematic: the disagreement between the nurses could really be
random, or it could arise because one of them tends to under-record the BP. Further
testing would be needed to trace the origin of the inaccuracy.
Types of Reliability
1. Test-retest Reliability:
Test-retest reliability is the degree to which scores are consistent over time. It
indicates score variation that occurs from testing session to testing session as a result of
errors of measurement. Problems: Memory, Maturation, Learning.
Test-retest reliability is a measure of reliability obtained by administering the same
test twice over a period of time to a group of individuals. The scores from Time 1 and
Time 2 can then be correlated in order to evaluate the test for stability over time.
Example: A test designed to assess student learning in psychology could be given to a
group of students twice, with the second administration perhaps coming a week after the
first. The obtained correlation coefficient would indicate the stability of the scores.
2. Equivalent-Forms or Alternate-Forms Reliability
Parallel forms reliability is a measure of reliability obtained by administering
different versions of an assessment tool (both versions must contain items that probe the
same construct, skill, knowledge base, etc.) to the same group of individuals. The
scores from the two versions can then be correlated in order to evaluate the consistency
of results across alternate versions.
English Language Testing 55
Example: If you wanted to evaluate the reliability of a critical thinking assessment, you
might create a large set of items that all pertain to critical thinking and then randomly
split the questions up into two sets, which would represent the parallel forms.
Equivalent-Forms or Alternate-Forms Reliability:
Two tests that are identical in every way except for the actual items included.
Used when it is likely that test takers will recall responses made during the first session
and when alternate forms are available. Correlate the two scores. The obtained
coefficient is called the coefficient of stability or coefficient of equivalence. Problem:
Difficulty of constructing two forms that are essentially equivalent.
Both of the above require two administrations.
3. Inter-rater reliability
Inter-rater reliabilit is a measure of reliability used to assess the degree to which
different judges or raters agree in their assessment decisions. Inter-rater reliability is
useful because human observers will not necessarily interpret answers the same way;
raters may disagree as to how well certain responses or material demonstrate knowledge
of the construct or skill being assessed.
Example: Inter-rater reliability might be employed when different judges are evaluating
the degree to which art portfolios meet certain standards. Inter-rater reliability is
especially useful when judgments can be considered relatively subjective. Thus, the use
of this type of reliability would probably be more likely when evaluating artwork as
opposed to math problems.
4. Internal consistency reliability
Internal consistency reliability is a measure of reliability used to evaluate the
degree to which different test items that probe the same construct produce similar
results.
A. Average inter-item correlation is a subtype of internal consistency reliability.
It is obtained by taking all of the items on a test that probe the same construct
(e.g., reading comprehension), determining the correlation coefficient for each
English Language Testing 56
pair of items, and finally taking the average of all of these correlation
coefficients. This final step yields the average inter-item correlation.
B. Split-half reliability is another subtype of internal consistency reliability. The
process of obtaining split-half reliability is begun by “splitting in half” all items
of a test that are intended to probe the same area of knowledge (e.g., World War
II) in order to form two “sets” of items. The entire test is administered to a
group of individuals, the total score for each “set” is computed, and finally the
split-half reliability is obtained by determining the correlation between the two
total “set” scores.
Split-Half Reliability:
Requires only one administration. Especially appropriate when the test is very
long. The most commonly used method to split the test into two is using the odd-even
strategy. Since longer tests tend to be more reliable, and since split-half reliability
represents the reliability of a test only half as long as the actual test, a correction
formula must be applied to the coefficient. Spearman-Brown prophecy formula.
Split-half reliability is a form of internal consistency reliability.
Internal Consistency Reliability:
Determining how all items on the test relate to all other items. Kudser-
Richardson-> is an estimate of reliability that is essentially equivalent to the average of
the split-half reliabilities computed for all possible halves.
Rationale Equivalence Reliability:
Rationale equivalence reliability is not established through correlation but rather
estimates internal consistency by determining how all items on a test relate to all other
items and to the total test.
Inter-rater reliability is a measure of reliability used to assess the degree to which
different judges or raters agree in their assessment decisions. Inter-rater reliability is
useful because human observers will not necessarily interpret answers the same way;
raters may disagree as to how well certain responses or material demonstrate knowledge
of the construct or skill being assessed.
English Language Testing 57
Example: Inter-rater reliability might be employed when different judges are evaluating
the degree to which art portfolios meet certain standards. Inter-rater reliability is
especially useful when judgments can be considered relatively subjective. Thus, the use
of this type of reliability would probably be more likely when evaluating artwork as
opposed to math problems.
Standard Error of Measurement:
Reliability can also be expressed in terms of the standard error of measurement.
It is an estimate of how often you can expect errors of a given size.
Principles of Language Testing
1. What are the principles of language testing?
2. How can we define them?
3. What factors can influence them?
4. How can we measure them?
5. How do they interrelate?
Three Important Characteristics of Tests:
1.Reliability: consistency and free from extraneous sources of error
2.Validity : how well a test measures what it is supposed to measure
Refers to measuring what we intend to measure. How well a test measures what it is
supposed to measure. For eexampleIf math and vocabulary truly represent intelligence
then a math and vocabulary test might be said to have high validity when used as a
measure of intelligence. Estimating the Validity of a Measure:
1. A good measure must not only be reliable, but also valid
2. A valid measure measures what it is intended to measure
3. Validity is not a property of a measure, but an indication of the extent to which
an assessment measures a particular construct in a particular context—thus a
measure may be valid for one purpose but not another
English Language Testing 58
4. A measure cannot be valid unless it is reliable, but a reliable measure may not be
valid
Content Validity:
1. Does the test contain items from the desired “content domain”?
2. Based on assessment by experts in that content domain.
3. Is especially important when a test is designed to have low face validity.
4. Is generally simpler for “other tests” than for “psychological constructs”
For Example - Easier for math experts to agree on an item for an algebra test than it is
for psych experts to agree whether or not an item should be placed in a EI or a
personality measure.
5. Content Validity is not “tested for”. Rather it is assured by experts in the
domain.
Basic Procedure for Assessing Content Validity:
1. Describe the content domain
2. Determine the areas of the content domain that are measured by each test item
3. Compare the structure of the test with the structure of the content domain
For Example:
In developing a nursing licensure exam, experts on the field of nursing would identify
the information and issues required to be an effective nurse and then choose (or rate)
items that represent those areas of information and skills.
A test is to measure foreign students’mastery of English sentence structure, an analysis
must first be made of the language itself and decisions made on which matters need to
be tested and in what proportions.
Face Validity
English Language Testing 59
1. Face validity refers to the extent to which a measure ‘appears’ to measure what
it is supposed to measure
2. Not statistical—involves the judgment of the researcher (and the participants)
3. A measure has face validity—’if people think it does’
4. Just because a measure has face validity does not ensure that it is a valid
measure (and measures lacking face validity can be valid)
Relationship Between Reliability & Validity
• usefulness of a test. Though different, they work together. It would not be
beneficial to design a test with good reliability that did not measure what it was
intended to measure. The inverse, accurately measuring what you desire to
measure with a test that is so flawed that results are not reproducible, is
impossible. Reliability is a necessary requirement for validity. This means that
you have to have good reliability in order to have validity. Reliability actually
puts a cap or limit on validity, and if a test is not reliable, it can not be valid.
Establishing good reliability is only the first part of establishing validity.
Validity has to be established separately. Having good reliability does not mean
you have good validity, it just means you are measuring something consistently.
Now you must establish what it is that you are measuring consistently. The main
point here is reliability is necessary but not sufficient for validity. Tests that are
reliable are not necessarily valid or predictive. If the reliability of a
psychological measure increases, the validity of the measure is also expected to
increase.
FACTORS THAT INFLUENCE VALIDITY:
1. Inadequate sample
2. Items that do not function as intended
3. Improper arrangement/unclear directions
4. Too few items for interpretation
5. Improper test administration
6. Scoring that is subjective
English Language Testing 60
Reliability is influenced by:
1. the longer the test, the more reliable it is likely to be [though there is a point of
no extra return]
2. items which discriminate will add to reliability, therefore, if the items are too
easy / too difficult, reliability is likely to be lower
3. if there is a wide range of abilities amongst the test takers, test is likely to have
higher reliability
4. the more homogeneous the items are, the higher the reliability is likely to be
Practicality:
The ease with which the test:
1. items can be replicated in terms of resources needed e.g. time, materials, people
2. can be administered
3. can be graded
4. results can be interpreted
Factors which can influence reliability, validity and practicality:
From the TEST:
1. quality of items
2. number of items
3. difficulty level of items
4. level of item discrimination
5. type of test methods
6. number of test methods
7. time allowed
8. clarity of instructions
English Language Testing 61
9. use of the test
10. selection of content
11. sampling of content
12. invalid constructs
From the TEST TAKERS:
1. familiarity with test method
2. attitude towards the test i.e. interest, motivation, emotional/mental state
3. degree of guessing employed
4. level of ability
From the Test Administration
1. consistency of administration procedure
2. degree of interaction between invigilators and test takers
3. time of day the test is administered
4. clarity of instructions
5. test environment – light / heat / noise / space / layout of room
6. quality of equipment used e.g. for listening tests
From the Scoring
1. accuracy of the key e.g. does it include all possible alternatives?
2. inter-rater reliability e.g. in writing, speaking
3. intra-rater reliability e.g. in writing, speaking
4. machine vs. human
How can we measure reliability?
Test-retest:same test administered to the same test takers following an interval of no
more than 2 weeks
English Language Testing 62
Inter-rater reliability: two or more independent estimates on a test e.g. written scripts
marked by two raters independently and results compared
3.Practicality
English Language Testing 63
11
CONSTRUCTING TESTS
Writing items requires a decision about the nature of the item or question to which we
ask students to respond, that is, whether discreet or integrative, how we will score the
item; for example, objectively or subjectively, the skill we purport to test, and so on. We
also consider the characteristics of the test takers and the test taking strategies
respondents will need to use. What follows is a short description of these considerations
for constructing items.
Test Items
A test item is a specific task test takers are asked to perform.Test items can assess one
or more points or objectives, and the actual item itself may take on a different
constellation depending on the context. For example, an item may test one point
(understaning of a given vocabulary word) or several points (the ability to obtain facts
from a passage and then make inferences based on the facts). Likewise, a given
objective may be tested by a series of items. For example, there could be five items all
testing one grammatical point (e.g., tag questions). Items of a similar kind may also be
grouped together to form subtests within a given test.
Classifying Items
Discrete – A completely discrete-point item would test simply one point or objective
such as testing for the meaning of a word in isolation. For example:
Choose the correct meaning of the word paralysis.
English Language Testing 64
(A) inability to move
(B) state of unconscious
(C) state of shock
(D) being in pain
Integrative – An integrative item would test more than one point or objective at a time.
(e.g., comprehension of words, and ability to use them correctly in context). For
example:
Demonstrate your comprehension of the following words by using them together in a
written paragraph: “paralysis,” “accident,” and “skiing.”
Sometimes an integrative item is really more a procedure than an item, as in the case of
a free composition, which could test a number of objectives; for example, use of
appropriate vocabulary, use of sentence level discourse, organization, statement of
thesis and supporting evidence. For example:
Write a one-page essay describing three sports and the relative likelihood of being
injured while playing them competitively.
Objective – A multiple-choice item, for example, is objective in that there is only one
right answer.
Subjective – A free composition may be more subjective in nature if the scorer is not
looking for any one right answer, but rather for a series of factors (creativity, style,
cohesion and coherence, grammar, and mechanics).
The Skill Tested
The language skills that we test include the more receptive skills on a continuum –
listening and reading, and the more productive skills – speaking and writing. There are,
English Language Testing 65
of course, other language skills that cross-cut these four skills, such as vocabulary.
Assessing vocabulary will most likely vary to a certain extent across the four skills, with
assessment of vocabulary in listening and reading – perhaps covering a broader range
than assessment of vocabulary in speaking and writing. We can also assess nonverbal
skills, such as gesturing, and this can be both receptive (interpreting someone else’s
gestures) and productive (making one’s own gestures).
The Intellectual Operation Required
Items may require test takers to employ different levels of intellectual operation in order
to produce a response (Valette, 1969, after Bloom et al., 1956). The following levels of
intellectual operation have been identified:
knowledge (bringing to mind the appropriate material);
comprehension (understanding the basic meaning of the material);
application (applying the knowledge of the elements of language and comprehension to
how they interrelate in the production of a correct oral or written message);
analysis (breaking down a message into its constituent parts in order to make explicit
the relationships between ideas, including tasks like recognizing the connotative
meanings of words and correctly processing a dictation, and making inferences);
synthesis (arranging parts so as to produce a pattern not clearly there before, such as in
effectively organizing ideas in a written composition); and
evaluation (making quantitative and qualitative judgments about material).
it has been popularly held that these levels demand increasingly greater cognitive
control as one moves from knowledge to evaluation – that, for example, effective
operation at more advanced levels, such as synthesis and evaluation, would call for
more advanced control of the second language. Yet this has not necessarily been borne
out by research (see Alderson &Lukmani, 1989). The truth is that what makes items
difficult, sometimes defies the intuitions of the test constructors.
English Language Testing 66
The Tested Response Behavior
Items can also assess different types of response behavior. Respondents may be tested
for accuracy in pronunciation or grammar. Likewise, they could be assessed for fluency,
for example, without concern for grammatical correctness. Aside from accuracy and
fluency, respondents could also be assessed for speed – namely, how quickly they can
produce a response, to determine how effectively the respondent replies under time
pressure.In recent years, there has also been an increased concern for developing
measures of performance – that is, measures of the ability to perform real-world tasks,
with criteria for successful performance based on a needs analysis for the given task
(Brown, 1998; Norris, Brown, Hudson, & Yoshioka, 1998).
Performance tasks might include “comparing credit card offers and arguing for the best
choice” or “maximizing the benefits from a given dating service.” At the same time that
there is a call for tasks that are more reflective of the real world, there is a
commensurate concern for more authentic language assessment. At least one study,
however, notes that the differences between authentic and pedagogic written and spoken
texts may not be readily apparent, even to an audience specifically listening for
differences (Lewkowicz, 1997). In addition, test takers may not necessarily concern
themselves with task authenticity in a test situation. Test familiarity may be the
overriding factor affecting performance.
Characteristics of Respondents
Items can be designed to be appropriate for groups of test-takers with differing
characteristics. Bachman and Palmer (1996: 64-78) classify these characteristics into
four categories: the personal characteristics of the respondents – for example, their age,
gender, and native language; the knowledge of the topic that they bring to the language
testing situation; their affective schemata (that is, their prior likes and dislikes with
regard to assessment); and their language ability.
Research into the impact of these characteristics continues. For example, with regard to
the age variable, researchers have suggested that educators revisit this issue and perhaps
English Language Testing 67
conceive of new ways to consider the impact of the age variable in assessing language
ability (Marinova-Todd, Marshall, & Snow, 2000). With regard to performance on
language measures, it would appear that age interacts with other variables such as
attitudes, motivation, the length of exposure to the target language, as well as the nature
and quality of language instruction (see García Mayo &GarcíaLecumberri, 2003).
With regard to language ability, both Bachman and Palmer (1996) and Alderson (2000)
detail the many types of knowledge that respondents may need to draw on to perform
well on a given item or task:world knowledge and culturally-specific knowledge,
knowledge of how the specific grammar works, knowledge of different oral and written
text types, knowledge of the subject matter or topic, and knowledge of how to perform
well on the given task.
Item-Elicitation Format
the format for item elicitation has to be determined for any given item. An item can
have a spoken, written, or visual stimulus, as well as any combination of the three.
Thus, while an item or task may ostensibly assess one modality, it may also be testing
some other as well. So, for example, a subtest referred to as “listening” which has
respondents answer oral questions by means of written multiple-choice responses is
testing reading as well as listening. It would be possible to avoid introducing this
reading element by having the multiple-choice alternatives presented orally as well. But
then the tester would be introducing yet another factor, namely, short-term memory
ability, since the respondents would have to remember all the alternatives long enough
to make an informed choice.
Item-Response Format
The item-response format can be fixed, structured, or open-ended. Item responses with a
fixed format include true/false, multiple-choice, and matching items.Item responses,
which call for a structured format include ordering (where respondents are requested to
arrange words to make a sentence, and several orders are possible), duplication – both
written (such as., dictation) and oral (for example, recitation, repetition, mimicry),
English Language Testing 68
identification (explaining the part of speech of a form), and completion.Those item
responses calling for an open-ended format include composition – both written (for
example, creative fiction, expository essays) and oral (such as a speech) – as well as
other activities, such as free oral response in role-playing situations.
Grammatical competence
According to Canale and Swain (1980, p. 29), grammatical competence includes
phonology, morphology, syntax, knowledge of lexical items, and semantics, as well as
matters of mechanics (spelling, punctuation, capitalization, and handwriting). It would
seem that this definition is perhaps too broad for practical purposes. A truly perplexing
issue is determining what constitutes a grammatical error, as well as determining the
severity of this error. In other words, will the use of the error stigmatize the speaker?
Let us say that we are using a grammatical scale which deals with how acceptably
words, phrases, and sentences are formed and pronounced in the respondents'
utterances. Let us assume that the focus is on both of the following: clear cases of
errors in form, such as the use of the present perfect for an action completed in the past
(e.g., ”We have had a great time at your house last night."), and matters of style, such
as the use of a passive verb form in a context where a native would use the active form
(e.g., Question - “What happened to the CD I lent you, Jorge?” Reply - "The CD was
lost." vs. "I lost your CD.").
Major grammatical errors might be considered those that either interfere with
intelligibility or stigmatize the speaker. Minor errors would be those that do not get in
the way of the listener's comprehension nor would they annoy the listener to any
extent.Thus, getting the tense wrong in the above example, "We have had a great time at
your house last night" could be viewed as a minor error, whereas in another case,
producing "I don't have what to say" ("I really have no excuse" by translating directly
from the appropriate Hebrew language) could be considered a major error since it is not
only ungrammatical but also could stigmatize the speaker as rude and unconcerned,
rather than apologetic.
Rational for Tests:
English Language Testing 69
Measures of student performance (testing) may have as many as five purposes:
Student Placement,
Diagnosis of Difficulties,
Checking Student Progress,
Reports to Student and Superiors,
Evaluation of Instruction.
Unfortunately the most common perception is that tests are designed to statistically rank
all students according to a sampling of their knowledge of a subject and to report that
ranking to superiors or anyone else interested in using that information to adversely
influence the student's feeling of self-worth. It is even more unfortunate that the
perception matches reality in the majority of testing situations. Consequently tests are
highly stressful anxiety producing events for most persons.
All too often tests are constructed to determine how much a student knows rather than
determining what he/she must learn. Frequently tests are designed to "trap" the student
and in still other situations tests are designed to insure a "bell curve" distribution of
results. Most of the other numerous testing designs and strategies fail to help the student
in his learning process and in many cases are quite detrimental to that process.
In a Mastery Based system of instruction the two main reasons for testing are to
determine mastery and to diagnose difficulties. When tests are constructed for these
purposes, the other four purposes will also be satisfied. For example, consider a test
which requires the student to demonstrate mastery and at the same time rigorously
diagnoses learning difficulties. If no difficulties are indicated, it may be safely assumed
that the learner has mastered the concept. That information may then be used to record
student progress and to make reports to the student and superiors. Examining student
performance collectively for a group of students provides information about the quality
of instruction. Examining a single student's performance collectively for a group of
learning objectives may be used to determine proper placement within that group of
learning objectives.
English Language Testing 70
It is therefore important that the instructional developer construct each question so that a
correct response indicates mastery of the learning objective and any incorrect response
provides information about the nature of the student's lack of mastery. Furthermore,
each student should have ample opportunity to "inform" the instructor of any form of
lack of mastery. Unfortunately the mere presence of a test question influences the
student's response to the question. The developer should minimize that influence by
constructing questions which permit the student to make any error he would make in the
absence of such influence. For example, a multiple choice question should have all the
wrong answers the student might want to select and should also have as many correct
answers as the student might want to provide.
True/False Questions:
True/false questions should be written without ambiguity. That is, the statement of the
question should be clear and the decision whether the statement is true or false should
not depend on an obscure interpretation of the statement. A true/false question may
easily be used, and most commonly is used, to determine if the student recalls facts.
However, a true/false question may also be used to determine if the learner has mastered
the learning objective well enough to correctly analyze a statement.
It is important to be aware that only two choices are available to the student and
therefore the nature of the question gives the student a 50% chance of being correct. A
single True/False question therefore is helpful only if the student answers the question
incorrectly and the incorrect response indicates a specific misunderstanding of the
learning objective. A collection of true/false questions, about a single learning
objective, all answered correctly by a student is a much stronger indication of mastery.
It is therefore important that the instructional developer construct a "test bank"
containing a large number of true/false questions. It is also important to include
numerous true/false questions on any test which utilizes true/false questions. Ideally a
true/false question should be constructed so that an incorrect response indicates
something about the student's misunderstanding of the learning objective. This may be a
difficult task, especially when constructing a true statement. The instructional developer
English Language Testing 71
should try to accomplish the ideal, but should recognize that in some instances he/she
will not reach that goal.
Multiple Choice Questions:
Multiple choice questions should be written without ambiguity. That is, the statement of
the question stem should be clear and should leave no doubt about how to select
choices. Additionally the choices should be written without ambiguity and should
contain all information required to make a decision whether or not to choose it. The
decision whether to select or not select a choice should not depend on an obscure
interpretation of either the stem or the choice. A multiple choice question may easily be
used to determine if the student recalls facts. However, a multiple choice question may
also be used to determine if the student has mastered the learning objective well enough
to correctly analyze a statement.
The instructional developer should not construct multiple choice questions with a
uniform number of choices, a uniform number of valid choices, or any other
recognizable pattern for construction of choices. Instead the instructional developer
should include as many valid and invalid choices as is required to determine the
student's deficiencies with respect to the learning objective. Moreover, each choice
should appear to be a valid choice to some student.
Multiple choice questions should therefore contain any number of choices with one or
more valid choices. The student is of course required to select all valid choices and
failure to select any one of the valid choices will provide information about the student's
misunderstanding of the learning objective in the same way that selection of an invalid
choice reveals the nature of his/her misunderstanding. The nature of the choices
provided in a multiple choice question may be of two types: those which require merely
recall of facts and those which require additionally activity such as synthesis, analysis,
computation, comparison, or diagramming. The instructional developer who is seriously
concerned with the student's success will use both types extensively.
Fill-in-the-Blank Questions:
English Language Testing 72
The temptation, when constructing fillintheblank questions, is to construct traps for the
student. The instructional developer should avoid this problem. Ensure that there is only
one acceptable word for the student to provide and that the word (or words) is
significant. Avoid asking the student to supply "minor" words. Avoid fillintheblank
question with so many blanks that the student is unable to determine what is to be
completed.
Sometime/Always/Never Questions:
The collection of Sometime/Always/Never (referred to as SAN) statements are
statements which are: true sometimes, always true, and never true. The statements used
in these questions must be stated carefully and should contain enough information to
permit the student to decide whether the statement is true sometimes, always, or never.
SAN questions (especially the sometimes statements) are the most difficult to construct
but can be the most significant part of a test. SAN questions should be constructed to
force the student to engage in some critical thinking about the learning objective. When
used properly, SAN questions force the student to consider important details about the
learning objective. Careful use of this type of question and careful analysis of student's
response will provide detailed information about some of the student's deficiencies.
SAN questions are especially appropriate, and easy to construct, for learning objectives
addressing concepts which are "black" or "white" except in a few cases. The true
statements in a collection of true/false questions are of course always true statements
while the set of false statement may be further subdivided into those which are true
sometimes and those which are never true.
Test Construction
Closed-Answer or “Objective” Tests
Although by definition no test can be truly “objective” (existing as an object of fact,
independent of the mind), this handbook refers to tests made up of multiple choice,
matching, fill-in, true/false, or fill-in-the-blank items as objective tests. Objective tests
English Language Testing 73
have the advantages of allowing an instructor to assess a large and potentially
representative sample of course material and allow for reliable and efficient scoring.
The disadvantages of objective tests include a tendency to emphasize only “recognition”
skills, the ease with which correct answers can be guessed on many item types, and the
inability to measure students’ organization and synthesis of material (Adapted with
permission from Yonge, 1977).
Since the practical arguments for giving objective exams are compelling, we offer a few
suggestions for writing multiple-choice items. The first is to find and adapt existing test
items. Teachers’ manuals containing collections of items accompany many textbooks.
(AIs: Your course supervisor or former teachers of the same course may be willing to
share items with you.) However, the general rule is adapt rather than adopt. Existing
items will rarely fit your specific needs; you should tailor them to more adequately
reflect your objectives.
Second, design multiple choice items so that students who know the subject or material
adequately are more likely to choose the correct alternative and students with less
adequate knowledge are more likely to choose a wrong alternative. That sounds simple
enough, but you want to avoid writing items that lead students to choose the right
answer for the wrong reasons. For instance, avoid making the correct alternative the
longest or most qualified one, or the only one that is grammatically appropriate to the
stem. Even a careless shift in tense or verb-subject agreement can often suggest the
correct answer.
Finally, it is very easy to disregard the above advice and slip into writing items which
require only rote recall but are nonetheless difficult because they are taken from obscure
passages (footnotes, for instance). Some items requiring only recall might be
appropriate, but try to design most of the items to tap the students’ understanding of the
subject (Adapted with permission from Farris, 1985). One way to write multiple choice
questions that require more than recall is to develop questions that resemble miniature
“cases” or situations. Provide a small collection of data, such as a description of a
situation, a series of graphs, quotes, a paragraph, or any cluster of the kinds of raw
information that might be appropriate material for the activities of your discipline. Then
English Language Testing 74
develop a series of questions based on that material. These questions might require
students to apply learned concepts to the case, to combine data, to make a prediction on
the outcome of a process, to analyze a relationship between pieces of the information, or
to synthesize pieces of information into a new concept.
Here are a few additional guidelines to keep in mind when writing multiple-choice tests
(Adapted with permission from Yonge, 1977):
The item-stem (the lead-in to the choices) should clearly formulate a problem.
As much of the question as possible should be included in the stem.
Randomize occurrence of the correct response (e.g., you don’t always want “C” to be
the right answer).
Make sure there is only one clearly correct answer (unless you are instructing
students to select more than one).
Make the wording in the response choices consistent with the item stem.
Don’t load down the stem with irrelevant material.
Beware of using answers such as “none of these” or “all of the above.”
Use negatives sparingly in the question or stem; do not use double negatives.
Beware of using sets of opposite answers unless more than one pair is presented (e.g.,
go to work, not go to work).
Beware of providing irrelevant grammatical cues.
Grading of multiple choice exams can be done by hand or through the use of computer
scannable answer sheets available from your departmental office. Take completed
answer sheets to IUB Evaluation Services and Testing (BEST) located in Franklin Hall
M014. If you have your test scored by BEST, they will provide statistics on difficulty
and reliability, which will help you to improve your tests.
English Language Testing 75
If you choose the computer-grading route, you must be sure students have number 2
pencils to mark answers on their sheets. These are often available from your
department’s main office. At the time of the exam it is helpful to write on the
chalkboard all pertinent information required on the answer sheet (course name, course
number, section number, instructor’s name, etc.). Also, remind students to fill in their
university identification numbers carefully so that you can have a roster showing the ID
number and grade for each student. If you would like to consult with someone about
developing test items, call theCenter for Innovative Teaching and Learning at 855-9023.
If you would like to consult with someone about how to interpret your test results, call
BEST at 855-1595.
Essay Tests
Conventional wisdom accurately portrays short-answer and essay examinations as the
easiest to write and the most difficult to grade, particularly if they are graded well. You
should give students an exam question for each crucial concept that they must
understand.
If you want students to study in both depth and breadth, don't give them a choice among
topics. This allows them to choose not to answer questions about those things they
didn’t study. Instructors generally expect a great deal from students, but remember that
their mastery of a subject depends as much on prior preparation and experience as it
does on diligence and intelligence; even at the end of the semester some students will be
struggling to understand the material. Design your questions so that all students can
answer at their own levels.
The following are some suggestions that may enhance the quality of the essay tests that
you produce (Adapted with permission from Ronkowski, 1986):
1. Have in mind the processes that you want measured (e.g., analysis, synthesis).
2. Start questions with words such as “compare,” “contrast,” “explain why.” Don’t
use “what,” “when,” or “list.” (These latter types are better measured with
objective-type items). Writing Tutorial Services, Ballantine Hall 207, 855-6738,
English Language Testing 76
has a handout for students which defines these terms and explains how to study
for and respond to essay questions.
3. Write items that define the parameters of expected answers as clearly as
possible.
4. Make sure that the essay question is specific enough to invite the level of detail
you expect in the answer. A question such as “Discuss the causes of the
American Civil War,” might get a wide range of answers, and therefore be
impossible to grade reliably. A more controlled question would be, “Explain
how the differing economic systems of the North and South contributed to the
conflicts that led to the Civil War.
5. Don’t have too many questions for the time available.
12
English Language Testing 77
TYPES OF LISTENING TESTING
1. DISCRIMINATIVE LISTENING
Discriminative Listening is an awareness of changes in pitch and loudness of
sounds and it is determining if sounds are different or the same.These activities are
designed to enahnce this listening skill:
1) Same or different? - Call out two words andhave the children determine if they are
the same or different. For example, say bat/ bat, bat/bet.
2) Rhyming words- Practice rhyming discriminative listening skills by calling out a
few rhyming words, such as“hat, bat, rat, cat, and so on” Have the children take
turns calling out a word that rhymes with “at” as well as other rhyming words you
want to use.
3) What’s the problem? - After reading a storybook to children (one that’s very
familiar to them) have them tell you what the problem is. As you read the story
change things around so the story isdifferent somehow, to see if they catch the
changes and can tell you what theproblem is.
4) Musical moods- Play music, but change it up some by changing the pace, make it
fast, slow, loud, soft, high and low. Have the children tell you when a sound change
is made and what the change is.
5) Clap it out- After talking about syllables of words, clap out the syllables of some
words you call out, starting with a two syllable word, then three, and so on. Repeat a
word at least twice (or more if needed) so the concept is fully graspe.
Lastly, we have discriminative listening which has to do with the identification of
different variations in sounds and words in order to understand the different messages.
This is the most important listening and it spans all the other forms of listening. It
involves being sensitive to pitch, volume, emphasis and rate of speech in order to detect
the messages that may be hidden. This form of listening usually requires one to be
English Language Testing 78
efficient in two factors: have a good hearing ability and the knowledge of sound
structure (Kline, 2010).
Hearing ability
The ability to hear helps in sound differentiation and therefore is one can hear well, then
there is a high likelihood that they can get the message well (Lengel, 1998).
Knowledge of sound structure
The knowledge of sound structure enables an individual to differentiate different sounds
and be able to tell what is being said. For example the difference between “I would rank
it first” and “I drank it first” requires such kind of ability in order to get the message
clearly.In conclusion there are various forms of listening and these include listening for
the sake of making critical evaluations, building relationships, making discriminations
and obtaining information or gaining appreciation and each of the needs in listening
calls for a different form of listening. These forms of listening depend on basic factors
such as concentration, attention, memory, perception, experience, presentation style and
the determination of ethos, pathos and logos under the various forms of listening. The
lack of these may imply that there will be no communication that would be ongoing.
For example of discriminative listening
Exercise
Difference sounds is identified
1) “I would rank it first” and “I drank it first”
2) bat/ bat, bat/bet.
3) Safe/save
4) Made/mate
5) Age/h
English Language Testing 79
COMPREHENSION LISTENING
The next step beyond discriminating between different sound and sights is to make
sense of them. To comprehend the meaning requires first having a lexicon of words at
our fingertips and also all rules of grammar and syntax by which we can understand
what others are saying.
The same is true, of course, for the visual components of communication, and an
understanding of body language helps us understand what the other person is really
meaning.
In communication, some words are more important and some less so, and
comprehension often benefits from extraction of key facts and items from a long
spiel.Comprehension listening is also known as content listening, informative listening
and full listening.Listening Comprehension Sample Questions Transcript
Sample Item A
On the recording, you will hear:
(Narrator): Listen to a high school principal talking to the school's students.
(Man): I have a very special announcement to make. This year, not just
one,but three of our students will be receiving national awards for
their academic achievements. Krista Conner, Martin Chan, and Shriya
Patel have all been chosen for their hard work and consistently high
marks.It is very unusual for one school to have so many students
receive this award in a single year.
(Narrator): What is the subject of the announcement?
In your test book, you will read:
1. What is the subject of the announcement?
A. The school will be adding new classes.
B. Three new teachers will be working at the school.
C. Some students have received an award.
D. The school is getting its own newspaper.
English Language Testing 80
Sample Item B
On the recording, you will hear:
(Narrator): Listen to a teacher making an announcement at the end of the day.
(Man): Remember that a team of painters is coming in tomorrow to paint the
walls. In this box on my desk are sheets of plastic that I want you to
slip over your desks. Make sure you cover your desks completely so
that no paint gets on them. Everything will be finished and the plastic
will be removed by the time we return on Monday.
(Narrator): What does the teacher want the students to do?
In your test book, you will read:
2. What does the teacher want the students to do?
A. Take everything out of their desks
B. Put the painting supplies in plastic bags
C. Bring paints with them to school on Monday
D. Put covers on their desks to keep the paint off
Sample Set A
On the recording, you will hear:
(Narrator): Listen to a conversation between two friends at school.
(Boy): Hi, Lisa.
(Girl): Hi, Jeff. Hey, have you been to the art room today?
(Boy): No, why?
(Girl): Well, Mr. Jennings hung up a notice about a big project that's going
ondowntown. You know how the city's been doing a lot of work to
fix up MainStreet—you know, to make it look nicer? Well, they're
going to create a mural.
(Boy): You mean, like, make a painting on the entire wall of a building?
(Girl): It's that big wall on the side of the public library. And students from
English Language Testing 81
this school are going to do the whole thing ... create a design, and
paint it, and everything. I wish I could be a part of it, but I'm too
busy.
(Boy): [excitedly] Cool! I'd love to help design a mural. Imagine everyone in
town walking past that wall and seeing my artwork, every day.
(Girl): I thought you'd be interested. They want the mural to be about nature,
so I guess all the design ideas students come up with should have a
nature theme.
(Boy): That makes sense—they've been planting so many trees and plants
along the streets and in the park.
(Girl): If you're interested you should talk with Mr. Jennings.
(Boy): [half listening, daydreaming] This could be so much fun. Maybe I'll
try to visit the zoo this weekend ... you know, to see the wild animals
and get some ideas, something to inspire me!
(Girl): [with humor] Well maybe you should go to the art room first to get
more information from Mr. Jennings.
(Boy): [slightly sheepishly] Oh yeah. Good idea. Thanks for letting me
know, Lisa! I'll go there right away.
(Narrator): Now answer the questions.
In your test book,you will read:
3. What are the speakers mainly discussing?
A. A new art project in the city
B. An assignment for their art class
C. An art display inside the public library
D. A painting that the girl saw downtown
4. Why is the boy excited?
A. A famous artist is going to visit his class.
English Language Testing 82
B. His artwork might be seen by many people.
C. His class might visit an art museum.
D. He is getting a good grade in his art class.
5. Where does the boy say he may go this weekend?
A. To the zoo
B. To an art store
C. To Main Street
D. To the public library
6. Why does the girl suggest that the boy go to the art room?
A. So that he can hand in his homework
B. So that he can sign up for a class trip
C. So that he can see a new painting
D. So that he can talk to the teacher
Sample Set B
On the recording, you will hear:
Script Text:
(Narrator): Listen to a teacher talking in a biology class.
(Woman): We've talked before about how ants live and work together in huge
communities. Well, one particular kind of ant community also grows its
own food. So you could say these ants are like people like farmers. And
what do these ants grow? They grow fungi [FUN-guy]. Fungi are kind
of like plants—mushrooms are a kind of fungi. These ants have gardens,
you could say, in their underground nests. This is where the fungi are
grown.
Now, this particular kind of ant is called a leafcutter ant. Because of
their name, people often think that leafcutter ants eat leaves. If they cut
up leaves they must eat them, right? Well, they don't! They actually use
English Language Testing 83
the leaves as a kind of fertilizer. Leafcutter ants go out of their nests
looking for leaves from plants or trees. They cut the leaves off and carry
them underground . . . and then feed the leaves to the fungi—the fungi
are able to absorb nutrients from the leaves. What the ants eat are the
fungi that they grow. In that way, they are like farmers!
The amazing thing about these ants is that the leaves they get are often
larger and heavier than the ants themselves. If a leaf is too large,
leafcutter ants will often cut it up into smaller pieces—but not all the
time. Some ants carry whole leaves back into the nest. In fact, some
experiments have been done to measure the heaviest leaf a leafcutter ant
can lift without cutting it. It turns out, it depends on the individual ant.
Some are stronger than others. The experiments showed that some
"super ants" can lift leaves about 100 times the weight of their body!
(Narrator): Now answer the questions.
In your test book, you will read:
7. What is the main topic of the talk?
A. A newly discovered type of ant
B. A type of ant with unusual skills
C. An increase in the population of one type of ant
D. A type of ant that could be dangerous to humans
8. According to the teacher, what is one activity that both leafcutter ants and
people do?
A. Clean their food
B. Grow their own food
C. Eat several times a day
D. Feed their young special food
9. What does the teacher say many people think must be true about leafcutter
ants?
A. They eat leaves.
B. They live in plants.
English Language Testing 84
C. They have sharp teeth.
D. They are especially large.
10. What did the experiments show about leafcutter ants?
A. How fast they grow
B. Which plants they eat
C. Where they look for leaves
D. How much weight they can carry
Answer Key for Listening Comprehension
1. C
2. D
3. A
4. B
5. A
6. D
7. B
8. B
9. A
10. D
English Language Testing 85
1. CRITICAL LISTENING
Critical listening is listening in order to evaluate and judge, forming opinion about what
is being said. Judgment includes assessing strengths and weaknesses, agreement and
approval.
This form of listening requires significant real-time cognitive effort as the listener
analyzes what is being said, relating it to existing knowledge and rules, whilst
simultaneously listening to the ongoing words from the speaker.
2. BIASED LISTENING
Biased listening happens when the person hears only what they want to hear, typically
misinterpreting what the other person says based on the stereotypes and other biases that
they have. Such biased listening is often very evaluative in nature.
3. EVALUATIVE LISTENING
In evaluative listening, or critical listening, we make judgments about what the other
person is saying. We seek to assess the truth of what is being said. We also judge what
they say against our values, assessing them as good or bad, worthy or unworthy.
Evaluative listening is particularly pertinent when the other person is trying to persuade
us, perhaps to change our behavior and maybe even to change our beliefs. Within this,
we also discriminate between subtleties of language and comprehend the inner meaning
of what is said. Typically also we weigh up the pros and cons of an argument,
determining whether it makes sense logically as well as whether it is helpful to
us.Evaluative listening is also called critical, judgmental or interpretive listening.
4. APPRECIATIVE LISTENING
In appreciative listening, we seek certain information which will appreciate, for
example that which helps meet our needs and goals. We use appreciative listening when
English Language Testing 86
we are listening to good music, poetry or maybe even the stirring words of a great
leader. The student use ppreciative listening when they are listening this poetry and they
seek certain information which will appreciate
Adventure Quotient (AQ) Test 77 questions, 30 min
How adventurous are you? Thrill-seeking can come in different forms, whether it's
doing a swan dive bungee jump off the Auckland Harbour Bridge in New Zealand, or
trying that new exotic restaurant around the corner from work. The type of adventure
you enjoy (or avoid) depends a great deal on your personality. Are you more of a
planner or spontaneous? Courageous or careful? Do you have the energy level of a bee
or a sloth? Find out more about your adventure personality with this test!
Examine the following statements and choose the answer option that best applies to you.
There may be some questions describing situations that may not be relevant to you. In
such cases, select the answer you would most likely choose if you ever found yourself
in that type of situation. In order to receive the most accurate results, please answer as
truthfully as possible.After finishing the test, you will receive a Snapshot Report with an
introduction, a graph and a personalized interpretation for one of your test scores. You
will then have the option to purchase the full results.
Adventure Quotient (AQ) Test 50 questions, 30 min
1. I _____ repetitive tasks.
enjoy
don't mind
can't stand
2. I take pride in my appearance and upkeep.
Agree
Somewhat agree
Disagree
3. I have already been or would consider any of the following: skydiving, bungee
jumping, hang gliding, or free climbing.
Definitely
Maybe
English Language Testing 87
No way
4. I see getting away from it all as a chance to:
Connect with people and places
Connect with myself
5. I would travel to a developing country and leave the airport/train station:
With pleasure.
Only with a friend.
Only with a hired guide.
6. I seek new experiences more...
To learn about new places, people, and things.
For the way they make me feel
7. I am more likely to ask myself:
"When is break time?"
"What's next?"
8. I am more likely to get my thrills from:
Doing something physically or emotionally gutsy
Watching someone else do something physically or emotionally gutsy
9. The lowest comfort I would consider for sleeping is:
Outside on the ground.
A tent.
An RV or camper.
A motel.
A bed and breakfast.
A furnished apartment or house.
A 3 or 4 star hotel.
10. Adrenaline is a chemical that:
I avoid
I enjoy from time to time
I seek
I am addicted to
11. Having a daily routine is:
English Language Testing 88
Oppressive and stifling
Annoying and limiting
Sometimes a good thing, sometimes not
Helpful and comforting
Totally necessary
12. Not knowing what the future might hold is:
Terrifying
A little disconcerting, but that's just the way life is
Exhilarating
13. At a theme park, I'll try:
The highest, scariest ride
Something fast, but no upside-down stuff
The kiddie train or merry-go-round
The park bench
14. Having nice things and looking good is important to me
Extremely
Somewhat
Not very
15. A life without luxury is:
Not worth living
Difficult to imagine.
Perfectly acceptable
Expected.
16. Knowing what others think of me is:
Essential
Important
Helpful
Not important
17. When visiting new places, I am more interested in:
Soaking up the environment
Interacting with people
18. Others are more likely to wonder...
English Language Testing 89
Where my energy goes.
Where my energy comes from
19. Life's experiences are most rich and interesting when I contemplate them...
With others
In my own mind
An old friend is in town. Where are you most likely to eat?
A. We'd eat at:
A fast food joint
An ethnic café
B. We'd eat at:
A themed restaurant or dinner theater
At the kitchen table in my house, warming up something in the microwave
C. We'd eat at a:
Chain restaurant
Upscale restaurant
You inherit $100,000 from a distant uncle. What are you more likely to do with it?
D. I'd take my wallet out and:
Go on an epic shopping spree
Donate some, or all, to charity
E. I'd take my wallet out and:
Go on a casino fling
Put it in the bank
F. I'd take my wallet out and:
Go on a dream vacation
Throw a gigantic party
It's time to learn something new. Which class would you be most interested in taking
up?
G. I would rather take:
Acting classes
Creative writing classes
English Language Testing 90
H. I would rather take:
Survival skills classes
Speed reading classes
I. I would rather take:
Kickboxing classes
Tai Chi classes
Which of the following would you rather visit or spend some time in?
J. I would rather go to:
An Inuit igloo
A Buddhist monastery
K. I would rather go to:
An African hut
A European hostel
L. I would rather go to:
A Japanese pagoda
A California spa
Pick your preferred pet
M. I'd rather have a:
Parrot
Hamster
N. I'd rather have a:
Goldfish
Snake
O. I'd rather have a:
Tarantula
Horse
Which is your preferred adrenaline rush?
P. There's nothing like the thrill of:
A looming deadline
A charging rhino
Q. There's nothing like the thrill of:
English Language Testing 91
Running cross-country
Running with the bulls
R. There's nothing like the thrill of:
Swimming with dolphins
Swimming with sharks
Which is your preferred adrenaline rush?
S. There's nothing like the thrill of:
Finding something I really like on sale.
Finding an ancient Egyptian artifact in Valley of the Kings
T. There's nothing like the thrill of:
Cycling or hiking
Taking a scenic drive
U. There's nothing like the thrill of:
Getting a tattoo or piercing
Skydiving or hang gliding
Pick the adjective that best describes you.
V. I am more:
Bold
Timid
W. I am more:
Impulsive
Deliberate
X. I am more:
Of an improviser
Of a planner
Y. What's your most favorite way to get from point A to point B?
First class or Business class
The scenic railroad route
Automobile - the classic "road trip"
Budget airline - who needs legroom?
Tour bus - sit back and relax
English Language Testing 92
An all-terrain vehicle. No road? No problem
My bike - and the wind in my hair
Z. What's your comfort zone when it comes to heights?
Top shelf of the bookcase
The 3-meter diving board
A bungee jump
A skydive
A spacewalk
AA. What is the one form of footwear you could never live without?
Skis
Cycling shoes
Stiletto heels
Cross-trainers
Walking shoes
Flip-flops
Hiking boots
Dress shoes
BB. Which voice mail message are you most likely to leave on a friend's phone?
"How about a movie and some take out?"
"Got an extra ticket to a show, let's go!"
"Party of the century! Pick you up at 9."
"Meet me at the airport with a suitcase and your passport."
CC. Which phrase do you agree with more?
"Better safe than sorry."
"Nothing ventured, nothing gained."
DD. How much of Mother Nature's wrath will you endure for adventure?
Monsoon, tornado, ice storm - bring it on!
Thundershowers, extreme hot and cold
Some wind, clouds, and drizzle
If it's not blue skies, forget it
EE. How often do you pick up new fashions?
Daily to weekly.
English Language Testing 93
Monthly to yearly.
Every decade or so
5. SYMPATHETIC LISTENING
In sympathetic listening we care about the other person and show this concern in
the way we pay close attention and express our sorrow for their ills and happiness at
their joys.
EMPATHETIC LISTENING
When we listen empathetically, we go beyond sympathy to seek a truer understand how
others are feeling. This requires excellent discrimination and close attention to the
nuances of emotional signals. When we are being truly empathetic, we actually feel
what they are feeling.In order to get others to expose these deep parts of themselves to
us, we also need to demonstrate our empathy in our demeanor towards them, asking
sensitively and in a way that encourages self-disclosure.
6. THERAPEUTIC LISTENING
In therapeutic listening, the listener has a purpose of not only empathizing with the
speaker but also to use this deep connection in order to help the speaker understand,
change or develop in some way.This not only happens when you go to see a therapist
but also in many social situations, where friends and family seek to both diagnose
problems from listening and also to help the speaker cure themselves, perhaps by some
cathartic process. This also happens in work situations, where managers, HR people,
trainers and coaches seek to help employees learn and develop.
7. DIALOGIC LISTENING
English Language Testing 94
The word 'dialogue' stems from the Greek words 'dia', meaning 'through' and 'logos'
meaning 'words'. Thus dialogic listening mean learning through conversation and an
engaged interchange of ideas and information in which we actively seek to learn more
about the person and how they think.Dialogic listening is sometimes known as
'relational listening'.
The example of dialogic listening
A : I was working as a training director for a national homelessness foundation. I was
traveling around the country doing a lot of teaching and consulting. I was mostly
the only white male wherever I went. So I was doing big urban shelters and city
governments in Detroit and places like that. I was always coming up against race,
class, and gender issues between myself and the participants.
Q : Because they weren't white males?
A : Right, they were mostly females of color, and I could always deal with it, but it
was by the seat of my pants. So I came to PCP for consultation initially and then I
was accepted into their first workshop back in 1994. I found it to be such a
revolutionary approach to difference, one that I had never experienced before in
all my training in diversity and all that other stuff. I found out after my first class
that I had to do my training in Louisville, Kentucky for the homelessness network
there. The issue there was that the staff of the homeless shelters were mostly
women of color, and the volunteers were mostly affluent white women from the
suburbs and they differed in many ways and had different ideas about each other
as well. So I started doing this training. One of the goals of this group was that
they wanted the people to work more effectively together.
About half way through the first day, an African American woman stood up and
she was very angry. She said, "You don't know shit about my life, you're a white
man with privilege." I had some choices to make there. But because I had been to
this one PCP class, I decided that I was going to deal with this differently than I
would have dealt with this prior. I said, "You're absolutely right. I am white. I'm a
guy. I have certain level of power. I wear a tie. I live in suburbs. I drive a nice car.
And I imagine that your story has a lot to do with why you're here. I imagine that
English Language Testing 95
a lot of other people's stories have a lot to do with why they're here. I'm
wondering if we can make a choice together as a group to hear your story, and
what it is that you want people to understand about you. Would you be willing to
hear the stories of others?" She said, "Yeah." So I had everyone go around and tell
the group how their personal story connected to why they were there. Everybody
went around the room. Women told these incredible stories.
I remember there was one white woman who told how she had been homeless for
the last two years. That she had been beaten by her husband, but because they
were wealthy and lived in the suburbs, he was basically able to buy off the police,
and she was basically in prison because of her wealth. Finally, when he started
beating the children, she took them. She was cut off completely from his wealth
and lived on the streets for two years. She had just gotten out of shelter. This
tremendous bonding happened among these women. We were all brought to tears
by it. That affected me deeply. I came home and a couple days later my kids were
fighting. I was always the type, and I still give into this temptation, of getting
involved in the middle and trying to referee, thinking I know what's going on. In
this instance, I tried taking what is called a not-knowing attitude. I suggested that
each kid take five minutes to explain what's going on. I was using the "what's-at-
the-heart-of-the-matter-for-you approach," but in a way that was easier for them to
understand because they were younger. So each kid had five minutes.
Once they spoke, I realized that I certainly didn't have a clue about what their
concerns were. I had a completely different idea about what they were concerned
about, and they had completely different ideas about each other. They were then
able to say, "Oh, so that's all you want," and then move along. Now it doesn't
always happen like that, but it made a really deep impression on me. The biggest
thing for me is being a father, it's the most important thing in my life, and the fact
that I can do it well is my biggest accomplishment. To think that I was doing it so
well, yet I was doing it so ineffectively that I could not know my own kids. I
could be with them ten hours a day but still not know them because I wasn't
listening to them deeply. It blew my mind. I just thought that this is the best thing
since sliced bread. So those two things really catapulted me into the whole PCP
mindset.
English Language Testing 96
Q : So you were really struck by the real power of letting the parties speak for
themselves, without being the convoy, without being the person who summarizes
and says, "This is what's going on."
A : Right. Exactly. Yeah, because I could have said, "This is what I hear." I try to
relate it to my own experience in some way, but basically I don't know. Being
asked, "Can we use your wisdom and tap the rest of the wisdom in the room and
make it work for us here?" Then leaving it in their hands afterward was big. It just
was not my style to do that before.
Listen carefully to the dialog between nick and jimmy,then complete the
conversation
Nick : I heard (1)..........as a computer pragrammer
Jimmy : Yes,and I had already(2)..............
Nick : Really?i’m happy(3)...
Jimmy : Thank you.
Nick : Your parents must be(4)........
Jimmy : They want me to run their business.they’re(5)......
Nick : That’s a pity!did you explain your reasons?
Jimmy : I did and I hope they’ll accept my decision.
Dialog II
Margaret : Look at you!you look so great now.what have you been doing?
Joe : Really?(1).................i’ve been in canada for two weeks.by the way,how
about your job?
Margaret : (2)............it’s in a big new hospital.My working conditions aremuch
better than the the last place.
Tony : Attention,please.today,we have a surprise.we’ve been offered a trip from
our boss
Joe : Really?(3)........................?
Tony : Bandung
Joe : (4)..................but where is it located?
Tony : Aren’t you pleased?
English Language Testing 97
Joe : Yes,of course.(5)........................but tell me where it is.
Margare : It’s in indonesia.
Joe : Oh,I see.that’s not so good
Tony : Don’t worry joe.my friend,lisa,who lives there,wrote to me about the
conditions in indonesia.indonesia is safe now,especially in that twon.there
is no riot.it’s just a rumour.
Key Answer
1) I think it’s usual
2) That’s great
3) Where to
4) Marvellous
5) I’m delighted to hear that
8. RELATIONSHIP LISTENING
Sometimes the most important factor in listening is in order to develop or sustain a
relationship. This is why lovers talk for hours and attend closely to what each other has
to say when the same words from someone else would seem to be rather
boring.Relationship listening is also important in areas such as negotiation and sales,
where it is helpful if the other person likes you and trusts you.
English Language Testing 98
13
Testing Grammar
English is a very important language in the world. It plays a very big rule in
communication and education. Everything which is served by technology should be
related to English. By and by, English will be the global language in every part of the
world.Since English is an international language, people all over the world try to learn
as much as Possible about english. To develop our skill in english we always meet the
grammar. And we practice our english by the testing grammar, so that we know how far
we understand the english.
A. Definition of grammar
Grammar is the structural foundation of our ability to express ourselves. The more we
are aware of how it works, the more we can monitor the meaning and effectiveness of
the way we and others use language. It can help foster precision, detect ambiguity, and
English Language Testing 99
exploit the richness of expression available in English. And it can help everyone--not
only teachers of English, but teachers of anything, for all teaching are ultimately a
matter of getting to grips with meaning.
1. Descriptive grammar refers to the structure of a language as it is actually
used by speakers and writers.
2. Prescriptive grammar refers to the structure of a language as certain
people think it should be used.
Both kinds of grammar are concerned with rules--but in different ways. Specialists in
descriptive grammar (called linguists) study the rules or patterns that underlie our use of
words, phrases, clauses, and sentences. On the other hand, prescriptive grammarians
(such as most editors and teachers) lay out rules about what they believe to be the
“correct” or “incorrect” use of language.
B. Types of test
Before writing a test it is vital to think about what it is you want to test and what its
purpose is. We must make a distinction here between proficiency tests, achievement
tests, diagnostic tests and prognostic tests.
21. A proficiency test is one that measures a candidate's overall ability in a
language; it isn't related to a specific course.
22. An achievement test on the other hand tests the students' knowledge of
the material that has been taught on a course.
23. A diagnostic test highlights the strong and weak points that a learner may
have in a particular area.
24. A prognostic test attempts to predict how a student will perform on a
course.
There are of course many other types of tests. It is important to choose elicitation
techniques carefully when you prepare one of the aforementioned tests.There are many
elicitation techniques that can be used when writing a test. Below are some widely used
types with some guidance on their strengths and weaknesses. Using the right kind of
English Language Testing 100
question at the right time canbe enormously important in giving us a clear
understanding of our students' abilities, but we must also be aware of the limitations of
each of these task or question types so that we use each one appropriately
1. Multiple choice
Choose the correct word to complete the sentence.
Cook is ________________today for being one of Britain's most famous explorers.
a) Recommended b) reminded c) recognized d) remembered
In this question type there is a stem and various options to choose from. The advantages
of this question type are that it is easy to mark and minimizes guess work by having
multiple distracters. The disadvantage is that it
can be very time-consuming to create, effective multiple choice items are surprisingly
difficult to write. Also it takes time for the candidate to process the information which
leads to problems with the validity of the exam. If a low level candidate has to read
through lots of complicated information before they can answer the question, you may
find you are testing their reading skills more than their lexical knowledge.
Multiple choice can be used to test most things such as grammar, vocabulary, reading,
listening etc. but you must remember that it is still possible for students to just 'guess'
without knowing the correct answer.
2. Transformation
Complete the second sentence so that it has the same meaning as the first.
'Do you know what the time is, John?' asked Dave.
Dave asked John __________ (what) _______________ it was.
This time a candidate has to rewrite a sentence based on an instruction or a key word
given. This type of task is fairly easy to mark, but the problem is that it doesn't test
understanding. A candidate may simply be able to rewrite sentences to a formula. The
fact that a candidate has to paraphrase the whole meaning of the sentence in the
example above however minimizes this drawback.
English Language Testing 101
Transformations are particularly effective for testing grammar and understanding of
form. This wouldn't be an appropriate question type if you wanted to test skills such as
reading or listening.
3. Gap-filling
Complete the sentence.
Check the exchange ______________ to see how much your money is worth.
The candidate fills the gap to complete the sentence. A hint may sometimes be included
such as a root verb that needs to be changed, or the first letter of the word etc. This
usually tests grammar or vocabulary. Again this type of task is easy to mark and
relatively easy to write. The teacher must bear in mind though that in some cases there
may be many possible correct answers.
Gap-fills can be used to test a variety of areas such as vocabulary, grammar and
are very effective at testing listening for specific words
4. True / False
Decide if the statement is true or false.
England won the world cup in 1966. T/F
Here the candidate must decide if a statement is true or false. Again this type is easy to
mark but guessing can result in many correct answers. The best way to counteract this
effect is to have a lot of items.
This question type is mostly used to test listening and reading comprehension
5. Open questions
Answer the questions.
English Language Testing 102
Why did John steal the money?
Here the candidate must answer simple questions after a reading or listening or as part
of an oral interview. It can be used to test anything. If the answer is open-ended it will
be more difficult and time consuming to mark and there may also be a an element of
subjectivity involved in judging how 'complete' the answer is, but it may also be a more
accurate test.
These question types are very useful for testing any of the four skills, but less
useful for testing grammar or vocabulary.
6.Error Correction
Find the mistakes in the sentence and correct them.
Ipswich Town was the more better team on the night.
Errors must be found and corrected in a sentence or passage. It could be an extra word,
mistakes with verb forms, words missed etc. One problem with this question type is that
some errors can be corrected in more than one way.
Error correction is useful for testing grammar and vocabulary as well as readings
and listening.
7. Other Techniques
There are of course many other elicitation techniques such as translation, essays,
dictations, ordering words/phrases into a sequence and sentence construction
(He/go/school/yesterday).
It is important to ask yourself what exactly you are trying to test, which techniques suit
this purpose best and to bear in mind the drawbacks of each technique. Awareness of
this will help you to minimize the problems and produce a more effective test.
English Language Testing 103
C.The Value of Studying Grammar
The study of grammar all by itself will not necessarily make you a better writer. But by
gaining a clearer understanding of how our language works, you should also gain
greater control over the way you shape words into sentences and sentences into
paragraphs. In short, studying grammar may help you become a more effective
writer.Descriptive grammarians generally advise us not to be overly concerned with
matters of correctness: language, they say, isn't good or bad; it simply is. As the history
of the glamorous word grammar demonstrates, the English language is a living system
of communication, a continually evolving affair. Within a generation or two, words and
phrases come into fashion and fall out again. Over centuries, word endings and entire
sentence structures can change or disappear.
Prescriptive grammarians prefer giving practical advice about using language:
straightforward rules to help us avoid making errors. The rules may be over-simplified
at times, but they are meant to keep us out of trouble--the kind of trouble that may
distract or even confuse our readers.
English Language Testing 104
14
INTERPRETING TEST SCORE
Introduction
What does interpret mean? To interpret is to decide what the intended meaning of
something is (Cambridge Advanced Learner’s Dictionary). To interpret is to conceive
the significance of; construe (thefreedictionary.com). Thus, to interpret is to understand
the meaning and the significance of something.Interpreting test scores is to understand
the meaning and the significance of test scores, which can be used to plan next action -
to fix or to retain. There are many ways to do it, but the most common three are
frequency distribution, measures of central tendency, and measures of dispersion.
Frequency distribution here is talking about the distribution of scores and the frequency
of each category. On the other hand, measures of central tendency refer to measure of
“middle” value, and are measured using the mode, median, and mean. The last but not
least, is the measures of dispersion. It is related to the range or spread of scores. All
three can help teachers interpret the meaning behind test scores.
English Language Testing 105
II. Content
A. Frequency Distribution
Frequency distribution deals with the distribution of scores and the frequency of the
distribution. Each entry in the table contains the frequency or count of the occurrences
of scores within a particular name, and in this way, the table summarizes the
distribution of scores.
The example case here is: a teacher administers a test of 40 questions to 26 students.
Marks are awarded by counting the number of correct answers on the test scripts. These
are known as raw marks.
Here are the steps to create a table of frequency distribution:
1. Create Table 1 and put the raw mark of every student in it.
TABLE 1
Testee Mark
A 20
B 25
C 33
D 35
E 29
F 25
G 30
H 26
I 19
J 27
K 26
L 32
M 34
N 27
O 27
P 29
Q 25
English Language Testing 106
R 23
S 30
T 26
U 22
V 23
W 33
X 26
Y 24
Z 26
2. Create Table 2. Sort the marks from the highest to the lowest score. This is called
descending sorting. It is easier and faster to use tool like Microsoft Excel to do the
sorting.
TABLE 2
Testee Mark
D 35
M 34
C 33
W 33
L 32
G 30
S 30
E 29
P 29
J 27
N 27
O 27
H 26
K 26
T 26
English Language Testing 107
X 26
Z 26
B 25
F 25
Q 25
Y 24
R 23
V 23
U 22
A 20
I 19
Now, we determine the rank. We start form rank 1 up to rank 26, for there are 26
students.
The problem comes when there are two or more students with the same mark. Here we
highlight the same mark to make it easier to distinguish. Then, we write imaginary rank
on the right of Rank column from 1 to 26. The imaginary rank of the same mark is then
added and divided by how many people who get the same mark. For example, student C
and W have the same mark, 33. Their imaginary rank is 3 and 4. To get the actual rank,
we add 3 and 4 (3+4=7). The result, 7, is then divided by the number of people of the
same score, which is 2 here. The final result is 3.5. Thus, the final result is 3.5. Thus,
the final result is 3.5. Thus, the ranks of both of them are 3.5.
TABLE 2
Testee MarkRan
k
Imaginar
y rank
D 35 ? 1
M 34
? 2
C 33 ? 3
W 33 ? 4
English Language Testing 108
(3+4) / 2 = 3.5
L 32 ? 5
G 30? 6
S 30 ? 7
E 29? 8
P 29 ? 9
J 27? 10
N 27 ? 11
O 27 ? 12
H 26
? 13
K 26? 14
T 26 ? 15
X 26 ? 16
Z 26 ? 17
B 25? 18
F 25 ? 19
Q 25 ? 20
Y 24 ? 21
R 23? 22
V 23 ? 23
U 22 ? 24
A 20 ? 25
I 19 ? 26
The result will be like this. Table 2 shows the students’ scores in order of merit and
their rank as well.
TABLE 2
English Language Testing 109
(6+7) / 2 = 6.5
(8+9) / 2 = 8.5
(10+11+12) / 3 =
11
(13+14+15+16+17) / 5
= 15 8.5
(18+19+20) / 3 =
19
(22+23) / 2 =
22.5
Testee Mark Rank
D 35 1
M 34 2
C 33 3.5
W 33 3.5
L 32 5
G 30 6.5
S 30 6.5
E 29 8.5
P 29 8.5
J 27 11
N 27 11
O 27 11
H 26 15
K 26 15
T 26 15
X 26 15
Z 26 15
B 25 19
F 25 19
Q 25 19
Y 24 21
R 23 22.5
V 23 22.5
U 22 24
A 20 25
I 19 26
3. Create Table 3, which consists of Mark column, Tally column, and Frequency
column.
English Language Testing 110
In Mark column, we can expand the range from 40 up to 15, for the highest score is
35 and the lowest score is 19. We usually do this to give more space to enhance
readability.
Tally is the stroke of how many students get a certain score. It is simply a method of
counting the frequency of scores.
Frequency column lists the number of students obtaining each score. It is easier to
count due to the tallies.
Table 3 is the table of frequency distribution.
TABLE 3
Mark TallyFrequenc
y
40
39
38
37
36
35 / 1
34 / 1
33 // 2
32 / 1
31
30 // 2
29 // 2
28
27 /// 3
26 //// 5
25 /// 3
24 / 1
23 // 2
22 / 1
English Language Testing 111
21
20 / 1
19 / 1
18
17
16
15
TOTAL 26
B. Measures of Central Tendency
A measure of central tendency is a measure that tells us where the “middle” of a bunch
of data lies. The three most common measures of central tendency are the mode, the
median, and the mean.
B.1. Mode
Mode refers to the score which most candidates obtained. We can easily spot it from
Table 3. The most frequent score in Table 3 is 26, as five testees have scored this mark.
Thus, the mode is 26.
English Language Testing 112
B.2. Median
Median refers to the score gained by the middle candidate after the data is put in order.
We use Table 2, which has been ordered in descending order, to find the median. In the
case of 26 students here, there can obviously be no middle student and thus the score
halfway between the lowest score in the top half and the highest score in the bottom half
is taken as the median. The median score in this case is 26.
English Language Testing 113
English Language Testing 114
TABLE 2
Testee Mark Rank
D 35
1
M 34 2
C 33 3.5
W 33 3.5
L 32 5
G 30
6.5
S 30 6.5
E 29 8.5
P 29 8.5
J 27 11
N 27 11
O 27
11
H 26 15
K
26 15
T 26 15
X 26 15
Z 26 15
B 25 19
TOP HALF
Lowest score of top half: 26
Highest score of bottom half: 26(26+26)/2=26
B.3. Mean
Mean or average score is the sum of the scores divided by the total number of testees.
The mean is the most efficient measure of central tendency, but it is not always
appropriate.
Now, we are going to create Table 4, to count the mean. Note that symbol x is used to
denote the score, N the number of the testees, and m the mean. The symbol fdenotes the
frequency with which a score occurs. The symbol ∑ mean the sum of.
First, we gather the data from previous Table 3. We get the scores and their frequencies
from it. The score (x) is then multiplied by the frequency (f). The result is put in column
fx. After that, the total of fxis summed up as ∑fx.
TABLE 4
x . f Fx
35 x 1 35
34 x 1 34
33 x 2 66
32 x 1 32
30 x 1 60
29 x 2 58
27 x 3 81
26 x 5 130
25 x 3 75
24 x 1 24
23 x 2 46
22 x 1 22
20 x 1 20
19 x 1 19
TOTA
L
∑fx
702
English Language Testing 115
To get the mean, we use formula m = ∑fx / N.
m=∑ fxN
=70226
=27
Thus, the mean is 27.
C. Measures of Dispersion
Measures of dispersion are important for describing the spread of the scores, or its
variation around a central value. There are various methods that can be used to measure
the dispersion of a dataset, but the most common ones are the range and the standard
deviation.
C.1. Range
A simple way of measuring the spread of marks is based on the difference between the
highest and the lowest scores. It is called the range. From previous Table 2, we can see
the highest score is 35 and the lowest score is 19. The range is 16.
Range =Xmax – Xmin
Range = 35 – 19 = 16
C.2. Standard Deviation
The standard deviation (s.d.) is another way of showing the spread of scores. It shows
how all the scores are spread out and gives a fuller description of test scores than the
range.
One simple method of calculating s.d. is shown below:
s . d .=√ Σ d2
N
N is the number of scores
d is the deviation of each score from the mean
From previous calculation, mean is 27. The steps to calculate s.d. are as followings:
English Language Testing 116
1. Step 1: Find out the amount by which each score deviates from the mean (d).
ScoreMean deviation (d)
(Score - 27)
35 8
34 7
33 6
33 6
32 5
30 3
30 3
29 2
29 2
27 0
27 0
27 0
26 -1
26 -1
26 -1
26 -1
26 -1
25 -2
25 -2
25 -2
24 -3
23 -4
23 -4
22 -5
20 -7
19 -8
2. Step 2: Square each result (d2)
English Language Testing 117
Score
Mean
deviation (d)
(Score - 27)
d2
35 8 64
34 7 49
33 6 36
33 6 36
32 5 25
30 3 9
30 3 9
29 2 4
29 2 4
27 0 0
27 0 0
27 0 0
26 -1 1
26 -1 1
26 -1 1
26 -1 1
26 -1 1
25 -2 4
25 -2 4
25 -2 4
24 -3 9
23 -4 16
23 -4 16
22 -5 25
20 -7 49
19 -8 64
3. Step 3: Total all the results (Σ d2)
English Language Testing 118
Σ d2=432
4. Step 4: Divide the total by the number of testees(Σ d2/N )
Σ d2/N = 432 / 26 = 16.62
5. Step 5: Find the square root of the result √ Σ d2/ N
√Σ d2/ N=√16.62 = 4.077= 4.08
Thus, standard deviation (s.d.) is 4,08. That means that on average, the scores are about
4 points away from the average.
English Language Testing 119
References:
Teaching Student-Centered Mathematics (K-3). John van de Walle and
LouAnnLovin, Pearson Publishing, 2006.
Classroom and Large- Scale Assessment. Wilson and Kenney. This article appeared
in A Research Companion to Principles and Standards for School Mathematics
(NCTM), 2003, (pages 53-67).
Principles and Standards for School Mathematics. National Council of Teachers of
Mathematics (NCTM), 2000.
Young Mathematicians at Work, Constructing Number Sense, Addition, and
Subtraction. By Catherine TwomeyFosnot and Maarten Dolk, Heinemann
Assessing Learners with Special Needs: 6TH ED. By Terry Overton
Weaver, B. Formal versus Informal Assessments.
http://www.scholastic.com/teachers/article/formal-versus-informal-assessments
Morrison, G. Informal Methods of Assessment.
http://www.education.com/reference/article/informal-methods-assessment/
Forlizzi, L. Informal assessment: The
Basics. http://aded.tiu11.org/disted/FamLitAdminSite/fn04assessinformal.pdf
Navarete, C., et al. Informal Assessment In Educational Evaluation.
Mind map retrieved february 20, 2013 from the URL:
http://www.mindmeister.com/122645400/formal-vs-informal-assessments
TESTING: BASIC CONCEPTS: BASIC TERMINOLOGY by
Anthony Bynom, Ph.D., December 2001
A Statistical Analysis of Different Instruments to Measure Short-term In an L2
ImmersionProgram1. by
Kyle Perkins Southern Illinois University at Carbondale U.S.A.
TESL Journal, Vol. II, No. 5, May 1996
http://iteslj.org/
English Language Testing 120
Berk, R., 1979. Generalizability of Behavioral Observations: A Clarification of
Interobserver Agreement and Interobserver Reliability. American Journal of Mental
Deficiency, Vol. 83, No. 5, p. 460-472.
Cronbach, L., 1990. Essentials of psychological testing. Harper & Row, New York.
Carmines, E., and Zeller, R., 1979. Reliability and Validity Assessment. Sage
Publications, Beverly Hills, California.
Gay, L., 1987. Eductional research: competencies for analysis and application. Merrill
Pub. Co., Columbus.
Guilford, J., 1954. Psychometric Methods. McGraw-Hill, New York.
Nunnally, J., 1978. Psychometric Theory. McGraw-Hill, New York.
Winer, B., Brown, D., and Michels, K., 1991. Statistical Principles in Experimental
Design, Third Edition. McGraw-Hill, New York.
American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (1985). Standards for educational and
psychological testing. Washington, DC: Authors.
Cozby, P.C. (2001). Measurement Concepts. Methods in Behavioral Research (7th ed.).
California: Mayfield Publishing Company.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational
Measurement (2nd ed.). Washington, D. C.: American Council on Education.
Moskal, B.M., &Leydens, J.A. (2000). Scoring rubric development: Validity and
reliability. Practical Assessment, Research & Evaluation, 7(10). [Available online:
http://pareonline.net/getvn.asp?v=7&n=10].
The Center for the Enhancement of Teaching. How to improve test reliability and
validity: Implications for grading. [Available online:
http://oct.sfsu.edu/assessment/evaluating/htmls/improve_rel_val.html].
http://spiritize.blogspot.com/2007/10/active-listening.html
http://wiki.answers.com/Q/Examples_of_poetry#ixzz1xYYnZXmU
English Language Testing 121
Alderson, J. C 2002 Conceptions of validity and validation. Paper presented at a
conference in Bucharest, June 2002.
Angoff, 1988 Validity: An evolving concept. In H. Wainer& H. Braun [Eds.] Test
validity [pp. 19-32], Hillsdale, NJ: Erlbaum.
Bachman, L. F. 1990 Fundamental considerations in language testing. Oxford: O.U.P.
Cumming A. & Berwick R. [Eds.] Validation in Language Testing Multilingual
Matters 1996
Hatch, E. &Lazaraton, A. 1991 The Research Manual - Design & Statistics for Applied
Linguistics Newbury House
Henning, G. 1987 A guide to language testing: Development, evaluation and research
Cambridge, Mass: Newbury House
Hubley, A. M. &Zumbo, B. D. A dialectic on validity: where we have been and where
we are going. The Journal of General Psychology 1996. 123[3] 207-215
Messick, S. 1988 The once and future issues of validity: Assessing the meaning and
consequences of measurement. In H. Wainer& H. Braun [Eds.] Test validity [pp. 33-
45], Hillsdale, NJ: Erlbaum.
Messick, S. 1989 Validity. In R. L. Linn [Ed.] Educational measurement. [3rd ed., pp
13-103]. New York: Macmillan
English Language Testing 122