|
Test (student assessment)
A test or an examination (or "exam") is an assessment, often administered on
paper or on the computer, intended to measure the test-takers' or
respondents' (often a student) knowledge, skills, aptitudes or many other
topics (e.g., beliefs). Tests are often used in education, professional
certification, counseling, psychology (e.g., MMPI), the military, and many
other fields.
A test has more questions of greater difficulty and requires more time for
completion than a quiz. A standardized test is one that compares the
performance of every individual subject with a norm or criterion. The norm
may be established independently, or by statistical analysis of a large
number of subjects.
Types of questions
Multiple-choice questions
For a multiple-choice question, the author of the test provides several
possible answers (usually four or five) from which the test subjects must
choose. There is one right answer, usually represented by only one answer
option, though sometimes divided into two or more, all of which subjects
must identify correctly. Such a question may look like this:
The number of right angles in a square is:
a) 2 b) 3 c) 4 d) 5
Test authors generally create incorrect response options, often referred to
as distracters, which correspond with likely errors. For example,
distracters may represent common misconceptions that occur during the
developmental process. The construction of effective distracters is a key
challenge that must be faced in order to construct multiple-choice items
that possess strong psychometric properties. Well-designed distracters,
considered in combination, can attract considerably more than 25% of the
weakest students, so reducing the effects of guessing on total scores. The
construction of such items may in some cases require some skill and
experience on the part of the item developer.
A graph showing the functioning of a multiple-choice question is shown in
Figure 1. The x-axis represents an ability continuum and the y-axis the
probability of any given choice. The grey line maps ability to the
probability of a correct response according to the Rasch model, which is a
psychometric model used to analyse test data. The correct response in the
example shown in Figure 1 is E. The proportion of students along the ability
continuum who chose the correct response is highlighted in pink. The graph
shows the proportion of students opting for other choices along the range of
the ability continuum, as shown in the legend. The proportion of students at
about − 1.5 on the scale who responded correctly to this item is
approximately 0.1, which is below the proportion expected if students were
purely guessing.
An attractive feature of multiple-choice questions is that they are
particularly easy to score. Machines such as the Scantron and software
grading of computer-based tests can be performed automatically and
instantly, which is particularly valuable for situations where there aren't
enough graders available to grade a large class or large-scale standardized
test.
This format is not, however, appropriate for assessing all types of skills
and abilities. Poorly written multiple-choice questions often create an
overemphasis on simple memorization and deemphasize processes and
comprehension, and they leave no room for disagreement or alternate
interpretation, making them particularly unsuitable for humanities such as
literature and philosophy.
Free-response questions
Free-response questions (also known as extended constructed responses)
generally require subjects to produce written responses. The length of the
written response may be as short as a single word or mathematical
expression, in which case the question acquires some of the characteristics
of the multiple-choice type. However, at higher levels of education, this
type of question usually requires deeper, more analytical thinking. The most
difficult free-response questions may involve an essay or original
composition of a page or more in length, or a scientific proof or solution
requiring over an hour.
Free-response questions do not pose as much of a challenge to the test
author, but evaluating the responses is a different matter. Effective
scoring involves reading the answer carefully and looking for specific
features, such as clarity and logic, which the item is designed to assess.
Often, the best results are achieved by awarding scores according to
explicit ordered categories which reflect an increasing quality of response.
Doing so may involve the construction of marking criteria and support
materials, such as training materials for markers and samples of work which
exemplify categories of responses. Typically, these questions are scored
according to a uniform grading rubric for greater consistency and
reliability.
At the other end of the spectrum, scores may be awarded according to
superficial qualities of the response, such as the presence of certain
important terms. In this case, it is easy for test subjects to fool scorers
by writing a stream of generalizations, non sequiturs that incorporates the
terms that the scorers are looking for.
Practical examination
Knowledge of how to do something does not lend itself well to either
free-response or multiple-choice questions. It may be demonstrated only
outright. Art, music, and language fall into this category, as do
non-academic disciplines such as sports and driving. Students of engineering
are often required to present an original design or computer program
developed over the course of days or even months.
A practical examination may be administered by an examiner in person (in
which case it may be called an audition or a tryout) or by means of an audio
or video recording. It may be administered on its own or in combination with
other types of questions; for instance, many driving tests in the United
States include a practical examination as well as a multiple-choice section
regarding traffic laws.
Tests of the sciences may include laboratory experiments (practicals/laboratory
sessions) to make sure that the student has learned not only the body of
knowledge comprising the science but also the experimental methods through
which it has been developed. Again, the use of explicit criteria is
generally beneficial in the marking of practical examinations or
performances.
Limitations of testing and associated issues
General aptitude tests are used in certain countries as a basis for entrance
into colleges and universities. An issue associated with the use of these
tests is that they are known to be subject to practice effects, and do not
assess the accumulated learning of students during their schooling years. As
a consequence, the SAT have been renamed from the Scholastic Aptitude Test
to the Scholastic Assessment Test. Some evidence indicates that SAT scores
of 11th and 12th graders do not correlate highly with freshman year grades
and correlate poorly with overall undergraduate ranking — this has caused
pressure for ETS to re-evaluate their exams before universities start
requiring applicants to provide exam scores for ACT, an exam which also does
not correlate very well with freshmen GPA but does correlate better than the
SAT. Reasons for poor correlation are as follows:
* Questions on the exam may be improperly weighting the types of problems
encountered within the environment the exam intends to predict. An example
of improperly weighting would be for an exam to have the ratio of questions
in geometry, calculus, and number theory dissimilar to the ratio of these
questions present in the environment for which the exam is intended to serve
as a predictor of future performance. More egregiously, a mathematics exam
may ask solely about the names, birthdates, and country of origin of various
mathematicians when such knowledge is of little importance in a mathematics
curriculum.
* People are variously susceptible to stress. Some are virtually unaffected,
and excel on tests, while in extreme cases, individuals can become very
nervous and forget large components of exam material. To counterbalance
this, often teachers and professors don't grade their students on tests
alone, placing considerable weight on homework, attendance, in-class
discussion activity, and laboratory investigations (where applicable).
* Through specialized training on material and techniques specifically
created to suit the test, students can be "coached" to "game" the test,
significantly raising their scores without actually significantly increasing
their general intelligence or knowledge.
* Although test organizers attempt to prevent it and impose strict penalties
for it, academic dishonesty (cheating) can be used to obtain an advantage
over other test-takers. On a multiple-choice test, lists of answers may be
obtained beforehand. On a free-response test, the questions may be obtained
beforehand, or the subject may write an answer that creates the illusion of
knowledge.
Despite such issues, tests are less susceptible to cheating than other tools
of learning evaluation. Laboratory results can be fabricated, and homework
can be done by one student and copied by rote by others. The presence of a
responsible test administrator, in a controlled environment, helps to guard
against cheating.
Additionally, in some cases, high-stakes testing induces examinees to rise
to meet the exam's high expectations. Generally, the term high-stakes is
reserved for tests that are used as a basis for competitive entry into
future courses, including tests which are highly weighted within selection
criteria that are used for entrance into university courses.
The SAT and other high-stakes exams
In the United States and other countries, tests based primarily on
multiple-choice questions have come to be used for assessments of great
importance, with consequences including the funding levels of public schools
and the admission of students to institutions of higher education. The most
important such test in the U.S. is the SAT, which consists almost entirely
of multiple-choice questions (though some of these are specifically designed
to inherent inaccuracies of that question type). Originally developed as a
test of a student's intrinsic intelligence, its methodology has proven
vulnerable to specialized test-preparation programs that improve the
subject's score. The SAT is written and administered by the College Board.
For this reason, certain commentators have suggested that high stakes
testing should be based more on content learned during the schooling years.
Difficulties arise with respect to comparability across different schools,
sectors, states and so on. A key challenge is to balance the need for
comparability with the need to assess the skills, knowledge and abilities
students have developed during the schooling years.
The SAT has also been criticized for an alleged racial bias; ethnic
minorities supposedly fare worse on the exam than they should. As a result,
it began to fall out of favor in the late 1990s, with increasing emphasis on
standardized tests that measure actual knowledge. Some of these replacements
have likewise come from the College Board, but many states have taken the
initiative to design tests of their own. The ACT examination, introduced in
1959 as a competitor to the SAT, also features more knowledge-based
questions, and is accepted as an alternative to the SAT for admission to
many United States colleges. Many colleges are also placing more emphasis on
measures of long-term performance such as the high-school grade point
average, the difficulty of classes taken in high school, and teacher letters
of recommendation.
There are also other high-stakes exams at higher educational levels, like;
Fundamentals of Engineering exam administered by National Council of
Examiners for Engineering and Surveying (NCEES).
International exams
* GCSE and A-level — Used in the UK
* Standard Grade, Higher Grade, and Advanced Higher — used in Scotland
* Abitur — used in Germany
* Matura/Maturita — used in Austria, Bosnia and Herzegovina, Bulgaria,
Croatia, Italy, Liechtenstein, Hungary, Macedonia, Montenegro, Poland,
Serbia, Slovenia, Switzerland and Ukraine; priorly used in Albania.
* International Baccalaureate Diploma Programme — International exam
* Internationella prov - used in Sweden.
* Matura Shtetërore - used in Albania.
Further reading
* Airasian, P. (1994) "Classroom Assessment," Second Edition, NY"
McGraw-Hill.
* Cangelosi, J. (1990) "Designing Tests for Evaluating Student Achievement."
NY: Addison-Wesley.
* Grunlund, Ndasdhasda (1993) "How to make achievement tests and
assessments," 5th edition, NY: Allyn and Bacon.
* Haladyna, T.M. & Downing, S.M. (1989) Validity of a Taxonomy of
Multiple-Choice Item-Writing Rules. "Applied Measurement in Education,"
2(1), 51-78.
* Monahan, T. (1998) The Rise of Standardized Educational Testing in the
U.S. – A Bibliographic Overview.
* Wilson, N. (1997) Educsational standards and the problem of error. http://olam.ed.asu.edu.
Tap into archives, vol6. No10
read the
copyright
|