Original publication 'Toetsvragen schrijven' 1983 Utrecht: Het Spectrum, Aula 809, Onderwijskundige Reeks voor het Hoger Onderwijs ISBN 90-274-6674-0. The 2006 text is a revised text. The integral 1983 text is available at www.benwilbrink.nl/publicaties/83ToetsvragenAula.pdf.

Item writing

Techniques for the design of items for teacher-made tests

8. Test item quality

Examples

Ben Wilbrink

this database of examples has yet to be constructed. Suggestions? Mail me.

Kwaliteitskaart

Figure 1. Quality map.

The one term that will cover both the attempts at item design in the chapters three until seven, and the chapter eight quality checks on the items thus designed, is construct validity. I will use Snow (1993) and Messick (1993) as sources on the concept of construct validity, because these contributions focus on the construction versus choice theme that is so germane to item design. Construct validity covers the item and test design process, as well as the empirical evaluation of the items and tests so designed against the major and minor purposes of the course concerned. Be warned, however, that the work of these authors will be criticized for being rather parochial to the educational measurement situation in the United States, it being so different from that in, for example, continental European countries like the Netherlands. The criticism does not touch the idea and construct validity as such, but the way it is fleshed out by Snow, Messick and other American authors. Points of criticism will be the rather norm-referenced instead of criterion-referenced character of the suggested research methodologies, the neglect of national examinations as one possible bench mark for construct validating high-stakes standardized tests, and in general the lipservice paid to the competitive sentiment in American education and its testing traditions.

The emphasis on construct validity is an addition to the 1983 text. I have to work this out, yet. In the process I will also use Hofstee, 1971, and the possible more attractive conception of validity as expounded by Borsboom, Mellenbergh and van Heerden (2004) pdf

8.1 Fair grading

Questions should be clear and unambiguous.

truth-in-testing

See Haney (1984), p. 627-8.

You are shown two pyramids, one of them with four faces, the other with five faces. All faces are equilateral triangles, except for the second pyramid's square base.

"If the two pyramids were placed together face-to-face with the vertices of the equal-sized equilateral triangles coinciding, how many exposed faces would the resulting solid have?

Haney, 1984, p. 627, citing the College Boards' Preliminary Scholastic Aptitude test (PSAT).

You might take some time, pondering the question in the box, and why I might have cited it as an example of what. The question has become rather famous because it was the first one, after the truth-in-testing law of New York State came into effect, that was shown to be keyed faulty. The official key was that the number of faces is seven. A Florida schoolboy had marked 'five,' and protested against only the official answer being scored right. ETS immediately admitted they had made a mistake. Can you figure out what the mistake was? According to the New York Times, 240.000 scores had to be raised! The same item had been used in the years before, and on those tests the scores also were corrected. Remark that standard item analyses had not turned up this item as a faulty one! In the earlier years, students were unable to protest because the tests then were kept secret. The right answer to this question indeed is five, because two equilateral triangles fall in the same plane, and another two also, diminishing the number of planes by two. Reviewers of the original item had completely missed this solution.

Walt Haney (1984). Testing reasoning and reasoning about testing. Review of Educational Research, 54, 597-654.

Kathleen Rhoades and George Madaus (2003). Errors in Standardized Tests: A Systemic Problem. Boston College: Lynch School of Education. National Board on Educational Testing and Public Policy. pdf

This monograph is concerned with the role of human errors in testing. From the introduction: "Human errors do not occur randomly; their presence is not known. These errors are of greater concern than random errors because they are capricious and bring with them unseen consequences." The study examines human errors in systems in general and in the education system in particular. The authors document active errors and latent errors in educational testing.

John R. Hills (1991). Apathy concerning grading and testing. Phi Delta Kappan, 540-545.

"Teachers and administrators alike are woefully ignorant of sound assessment practices, Mr. Hills contends. The root of the problem is a system that neither recognizes nor rewards evaluation skills."
The article does not seem to be available on the web. Use its title to find related publications.
Hills gives lots of examples of misuse and abuse of tests and grading by teachers and administrators. The bad news is that misuse tor abuse is not the exception, but regular bad practice in our schools.

8.2 Checklist quality control

Checklist of format concerns

Checklist of content concerns

Haladyna (1999, p. 77), Haladyna a.o. (2002): "Base each item on specific content and a type of mental behavior.'

The Haladyna guideline endorses the construction of extensive tables crossing specific content against levels of cognitive functioning, see Bloom, Hastings and Madaus (1971).

This book, however, does not endorse the psychological approach of trying to force items in the Bloom et aliis cognitive taxonomy of 'mental behaviors' (terrible term). Also on the point of enumerating specific content, this book follows the alternative approach of schematizing course content, a technique that is so much more flexible than the linear enumeration of topics treated.
For the literature mentioned, see the list in chapter 2

1. Write up the specific content of the question.

Guideline two replaces the cognitive taxonomy or mental behavior specifications still widely used in the field, see Thomas Haladyna, Steven M. Downing, and Michael C. Rodriguez (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15, 309-334. http://depts.washington.edu/currmang/Toolsforteaching/MCItemWritingGuidelinesJAME.pdf [dead link? 1-2009]

2. Write up the kind of mastery that is asked for.
Reproduction, producing new examples, a new application, inference, translation, etcetera.

I am not aware of any places in the literature on educational measurement on the topic of what the appropriate level of abstraction in test items is. Excluding, of course, the educational writing of John Dewey.

A somewhat related guideline is Haladyna's number three "Avoid overly specific and overly general knowledge."

TOO ABSTRACT or GENERAL

What is life?

The exchange of oxygen for carbon dioxide
The existence of a soul
Movement Haladyna 1999 p. 79

3. Is the level of abstraction appropriate?

Guideline six in the Haladyna list: Avoid trick items.

7. No trick items allowed.
(Loosely speaking, a trick item is one the student cannot prepare for, or has not been prepared for in the course.)

catch

"In this country a date such as July 4, 1971, is often written 7/4/71, but in other countries the month is given second and the same date is written 4/7/71. If you do not know which system is being used, how many dates in a year are ambiguous in two-slash notation?"

Gardner (2006). problem 1.1

Catch questions can be trick questions, too. In the above box, a date such as 7/7/77 is not ambiguous (therefore the answer is 132, not 144). By definition, the catch is not something already familiar or learned. Therefore, catch questions do not belong in achievement tests. They might be used in instructional situations, of course. For example Martin Gardner (2006) offers a fine collection of problems some of which are catch questions.

Checklist of general concerns

THE CORRECT ALTERNATIVE — It saves energy — IS MISSING

5: Why do penguins walk so silly?

Their fat layers are in the way.
They do not have knees.
Their legs are rather short.

The item was used in the National Science Quiz 2005 in the Netherlands (site), and consequentially was left out of the contest.

5a. Is the printed question correct?

In Dutch secundary education the exit exams are national tests. It is not humanly possible to prevent errors in the formulation of the questions. An example of one such error is given here, a rather a stupid one, demonstrating a serious shortcoming in the quality control process. The particular error results in all students getting full credit for the question involved, even if they did not answer it at all. The trouble is, errors of this kind can not be repaired adequately/justly after the fact. The better students might in fact incur a small penalty here because of the grading-on-the-curve method of norming used in these exams, but they will not be told so. Almost everyting else may be found on the high-stakes testing on Het Examenblad site.

De eerste twee scorepunten van vraag 12, behorende bij het onderdeel dat in het beoordelingsmodel is aangeduid als:
"Er dienen twee gebeurtenissen in Zuidoost-Azië te worden genoemd naar aanleiding waarvan Eisenhower deze uitspraak doet (de gebeurtenissen mogen niet na 1954 plaatsvinden), bijvoorbeeld:

China wordt in 1949 een communistische Volksrepubliek
de communistische agressie tijdens de Koreaoorlog/de invasie van het communistische Noord-Korea in Zuid-Korea in 1950"

dienen aan alle kandidaten te worden toegekend, ongeacht het gegeven antwoord.

[The question asks for events in South East Asia before 1954, and expects answers like 'China becoming communist in 1949' and 'The invasion of South Korea in 1950.' ALL examinees will get two points for this question now, regardless whether or not and what exactly they did answer. For Korea and China do not belong to South East Asia, it would not have been very clever to have given these answers!]

Exit exam VWO history, 2006 http://www2.cito.nl/vo/ex2006/600043-A-12c-VW.pdf [dead link? 1-2009]. The CEVO is the authority handling scoring problems in the National Examination in the Netherlands

Dewey G. Cornell, Jon A. Krosnick and LinChiat Chang (under review 2004). Student Reactions to Being Wrongly Informed of Failing a High-Stakes Test: The Case of the Minnesota Basic Standards Test. pdf

"This study is based on an unfortunate event in 2000 when 7,989 students were wrongly informed that they had failed the Minnesota Basic Standards Test in Mathematics."

K. Rhoades and R. Madaus, G. (2003). Errors in standardized tests: A systematic problem. Boston: National Board of Educational Testing and Public Policy. pdf

"This paper has shown that human error can be, and often is, present in all phases of the testing process. Error can creep into the development of items. It can be made in the setting of a passing score. It can occur in the establishment of norming groups, and it is sometimes found in the scoring of questions."
"As this monograph goes to press additional errors have come to our attention that could not be included. These and the errors documented in this report bear strong witness to the unassailable fact that testing, while providing users with useful information, is a fallible technology, one subject to internal and external errors. This fact must always be remembered when using test scores to describe or make decisions about individuals, or groups of students."
Lots of further documentation and literature in this shocking report.

Robert Schaeffer testimony before the New York Senate Higher Education Committee (May 2, 2006). http://www.fairtest.org/univ/SAT_Error_testimony.html [dead link? 1-2009]

Your worst nightmare, because of an irresponsible testing industry. My words, b.w.

5b. Is the scoring key correct?

universal quantor
schema verknooptheid

Biology VWO 2006 http://www2.cito.nl/vo/ex2006/600025-1-26o1.pdf [dead link? 1-2009] question 15. The problem here is that alternative 2 is meant to be the 'correct' one. However, there is no indication whatsoever about the order of magitude of pressure changes as depicted on the vertical scale. Therefore the issue arises whether it is absolutely true there is no other process influencing de pressure in one or the other way. According to Karel Knip, NRC Handelsblad June 4, p. 51 (Alledaagse Wetenschap 'Erwtverwarming') there is: combustion of starch, a burning process, produces heat, therefore does increase pressure in a closed vessel. Therefore, alternative 3 is the correct one.

Layout: Thompson, Johnstone and Thurlow (2002) html . For example, avoid right justified text. "Unjustified text may be easier for poorer readers to understand because the uneven eye movements created in justified text can interrupt reading." [I have not yet checked the emprical basis for the claim. As soon as I am convinced, I will start using ragged text.] Short lines - items printed in two columns - are more vulnerable in this respect than longer lines.

miscellaneous

Security

KEEP THE TEST SECURE

The National Science Quiz 2003, a TV quiz in the Netherlands, was published just before the planned recording session, making it possible for the participants to have read the questions beforehand. The decision not to go on with the 2003 show led to the departure of qiizmaster Wim T. Schippers, who pioneered the quiz in 1994. Otherwise no serious harm was done. Security management in the case of examinations, especially nationwide ones, is a very serious matter. Premature publication on the internet of items from the exams would be a disaster for all examinees.

PISA FALLS

Welche Aussage erklärt, warum es auf der Erde Tageslicht und Dunkelheit gibt?

Die Erde rotiert um ihre Achse.
Die Sonne rotiert um ihre Achse.
Die Erdachse ist geneigt.
Die Erde dreht sich um die Sonne.

Aufgabe aus dem Naturwissenschaften-Test , vom deutschen PISA-2003-Koordinator Manfred Prenzel in der "Zeit" (09.12.04) folgende als beispielhaft vorgestellt (P2, 394; 591 Punkte).
Peter Bender (2005) Neue Anmerkungen zu alten und neuen PISA-Ergebnissen und -Interpretationen http://www-math.uni-paderborn.de/~bender/neueAnmerkungenPISAausf%FChrlich.pdf [dead link? 1-2009], p. 10: "Alle Antworten sind falsch, insbesondere auch A. Die Erklärung lautet vielmehr: "Die Erde rotiert mit einer anderen Winkelgeschwindigkeit um die eigene Achse, als sie sich um die Sonne dreht."

The PISA question is why there is day and night on earth. The correct answer, according to PISA, is because the Earth rotates. Peter Bender: the Moon does also, yet it always turns the same side to the Earth.

Bender's repairing act fails also, because it is known that Uranus, has its axis of rotation in the direction of the sun, in that way always turning the same side to the sun; how does this fact stand to statement c? Designing achievement test items is a tricky business.

Of course, the stem of the question is ambiguous also, see Bender p. 11.

Well, is't answer a. the better one of the four alternatives? It would be foolish for a knowledgeable student NOT to answer a. What do you think: is this bad design, or not?

Do not misuse the instruction to students always to choose the best alternative, if more than one are deemed to be correct. The PISA question cited does not have a single correct answer, and is therefore a faulty item.

8.3 Independent quality check

Two senior GCE examiners re-marked photocopies of the same 200 GCE examination scripts, half still containing the marks and comments of the original examiners and half with these markings removed. Removing previous markings made a considerable difference to the extent of agreement between these sets of marks.

The above is the (ERIC) abstract of R. J. L. Murphy (1979). Removing the Marks from Examination Scripts before Re-Marking Them: Does It Make Any Difference? British Journal of Educational Psychology, 49, 73-78.

This kind of exercise, of course, has been - in one variant or another - thousands of times, many of them have been published. The results tend to be the same as reported in this particular case. It's relevance to the theme of this paragraph is, that it does not matter much whether it is examination scripts being rated, or the quality of proposed test items. The point is: everything possible should be done to independent judgments of item quality; if anything might be amiss with a particular item, the probability to get it noticed must be maximised (within reasonable bounds of available resources, of course).

MULTIPLE TRUE-FALSE

In children, ventricular septal defects are associated with

systolic murmur
pulmonary hypertension
tetralogy of Fallot
cyanosis

The difficulty is that the examinee has to make assumptions about the severity of the disease, the age of the patient, and whether or not the disease has been treated. Different assumptions lead to different answers, even among experts.

Case and Swanson (2001) p. 15 http://www.nbme.org/PDF/2001iwg.pdf [dead link? 1-2009]

Even if you were a doctor, could you intuitively feel something might be wrong with an item such as the boxed one? In many cases you will pass the item, I can assure you. In those cases, before using the item in your next test, test it by having it independently answered by knowledgeable colleagues; if the item is flawed they will come up with different answers. Case and Swanson clearly indicate what is the trouble with the boxed item. In fact, the stem can be regarded as a too abstract patient vignette. In this case the vignette should have information about the severity of the disease, etcetera.

Bar-Hillel en Falk (1982) analyse a seemingly simple statistics question, showing how superficial analysis can result in different answers being given, among them the correct one (but for the wrong reason), because there are no short-cuts possible on the a careful and complete analysis of the simple problem as presented in the question. The case is treated in the Dutch chapter. This kind of problem has become famous in the US, the case is given in the box.

The Monty Hall problem

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the other doors, opens another door, say No. 3, which has a goat. He then says to you, 'Do you want to pick door No. 2?' Is it to your advantage to take the switch?
A reader’s question put to Marilyn Vos Savant, September 1991, columnist for the Sunday Parade. (e.g. see here.)

Did you answer the question? Why do you think your answer is the correct one? Want to play the game here or there? More information, see Zomorodian (1998) pdf. A book on the problem is Rosenhouse (2009). By the way, Marilyn’answer was ‘’: switch. Thousands of readers disagreed. The critical sentence in the problem is that the quizmaster knows what is behind the doors, making Marilyn’s answer true.

The reason to mention this particular problem on this partical place is that it signals a tremendous problem in assessment. Here is a very simple question, or so it seems, allowing a choice between two answers only. And yet many experts disagree with each on the correct answer. What is worrying about this case is that it shows that there do not exist procedures that are guaranteed to result in a correct verdict on what is a correct or the best answer on a given test question. In this particular case it is clear, after the fact, what is the correct answer. But what to think of all those cases where experts disagree (the items concerned should be discarded) or where the assessor's judgment is contested on reasonable grounds by the student?
There is, however, quite another way to look at the Monty Hall type of problem. See De Neys and Verschuren (2006) pdf. Working memory capacity proves a distinguishing characteristic between subjects solving the problem correctly from those failing to do so. This finding does not rule out other explanations in the extensive literature on the Monty Hall problem (see the list in De Neys and Verschueren), but it should be a warning that this type of problem is a breach of validity in achievement testing. Differences in working memory capacity, or any other differences in intelligence, should not be a factor in educational achievement testing. They might be in participating in particular courses, of course.

The question was answered by Marilyn vos Savant in her Sunday Parade column, September 1991: Switch!
Asking only fellow teachers to independently check on the quality of items runs the risk of overseeing an important category of faulty item design: items being way too difficult for the pupils or students, while at the same time being quite easy to the expert colleague. See Redish, Scherr and Tuminaro (2005 pdf) for classroom research on exactly this kind of design problem (the problem is given in the figure). Finding out about this kind of problem after the fact, is not clever - the damage can not be repaired. It may be necessary to use students to check for this kind of problem - then do so.

Figure 8.3.1. The three-charge-problem (Redish, Scherr and Tuminaro, 2005)

8.4 Check lists

8.5 An historical perspective

THE NECK VERSE (READING TEST TO SAVE ONE'S LIFE)

Verse 1 of Psalm 51: "Have mercy upon me, O god, according according to thy loving kindness; according unto the ,ultitude of thy tender mercies blot out my transgressions."

This biblical verse was using in medieaval England to test whether a convict's claim to belong to the clergy was true: if he could read, he was a clerk, and could esacpe capital punishment by a lay court. The favorite pasaage to be read was the verse cited, it could save one's neck. Yes, cases of fraud have been known: the verse could be learned by heart. See Frank Daniels (1973). Teaching reading in early England (p. 160-4). Pitman. Or Google "Neck Verse" html.

Starch and Elliott, in 1913, used the photographed copy of one final examination in United States history in a large school in Wisconsin: in print it is one half page of questions, and two pages of answers. Seventy principal teachers of history from schools whose passing grade was 75, graded this same examination work. The picture shows the result as summarized in Starch (1916, p. 7).

o o o oo o o oo oo o o oo o oo o o o o o oo ooo oo ooo oo o o o o ooo oooo ooo oooooo oooooooo oooooo o ____________________________________________________ 40 50 60 70 80 90
"Marks assigned to a history paper by seventy teachers." Note: misplaced o's follow the Starch Figure.

Daniel Starch and Edward C. Elliott (1913). Reliability of grading work in history. School Review, 21, 676-681. Reprinted in John A. Laska and Tina Juarez (Eds) (1992). Grading and marking in American schools. Two centuries of debate. Springfield, Illinois: Thomas. The article gives the full text of the examination questions and the examination work used in the investigation.

Research by Daniel Starch and Edward Elliott just before World War I has been highly influential in the American way of grading, and the explosion of standardized testing in schools in the years immediately following the war.

The common method of grading in the early twentieth century is in percentage of the material the student has learned - in the eyes of the assessor, of course. To begin with, this research showed percentage marking to be ridiculously imprecise, and so marked the change to letter grading, common nowadays.

Much more important, though, results like these gave proponents of standardized testing decisively strong arguments to introduce these new testing formats in the schools.

Probably you have always thought history papers were difficult to grade reliably. The next figure shows the results of the same kind of research, now using a paper in mathematics.

o o o o o o o o o o o oo o o o o oo oo o o oo o ooo ooooooo oooo oooo o o o o o oooooooooooo ooooooooo oooo o oooooo o ooooooooooooooooooooooooooo oooooo o _ _ _ ________________________________________ 28 53 60 70 80 90
"Passing grade 75. Marks assigned by schools whose passing grade was 70 were weighted by 3 points. Median 70. Probable error 7.5"

Daniel Starch and Edward C. Elliott (1913). The Reliability of grading work in mathematics. School Review, 21, 254-9. The data from this research, as presented in figure 3 in Starch (1916, p. 6).

What I want you to take away from the Starch and Elliott research is, first of all, its method: get independent assessments of the same work, and let the results educate you. The second important observation is that pupils never should be brought in a position where their fate depends on the marking of one or two of this kind of examination papers. The usual solution is what is called in Dutch 'vijven en zessen' - after a great deal of shilly-shallying - the teachers and the principal of the unlucky pupil will consider other indicators of his or her school achievement to correct unexpectedly low marks. There is a lot more to say about this kind of problem, I will not do so here, except this one remark.

It is to our own peril to forget about this kind of research. It has been of tremendous significance in the promotion of so called objective methods of assessment. And yet, to the critical observer, it should be immediately obvious that the research methodology of Starch - and also that of Hartog and Rhodes (see below) and many others - is seriously at fault. The independent assessors do not know anything about the pupil who wrote the essay, nor about the instruction he or she has received: no wonder the assessments of the same work are that wildly different. This is just another case of the old psychometric dictum (a.o. Frederic Lord on the way scores in a testbattery should be combined to arrive at a final verdict: they should compensate each other): always use information from other assessments of work of the same pupil to arrive at a more balanced view of achievements and talents. Even more serious, though, is the total neglect of the possibility of assessment for learning: the grading of the same work is analysed, not the feedback these assessors would have given the pupil. The last point has been elaborated beautifully by Dennie Palmer Wolf (1993).

C. W. Valentine (1932). The Reliability of Examinations. An Enquiry. London: University of London Press.

"Many of us," writes the Headmaster of Holt School, Liverpool, "have had the heart-breaking experience of a candidate, the best in his year, with as many as four distinctions, failing to get matriculation or even a School Certificate."

Valentine, among other things, illustrates how unjust the requirement of a pass for compulsaory subjects can be. His book, which really is about examinations, not about individual examination papers, nevertheless contains gems like the cited one about the mishandling of the graded individual examination paper.

School certificate history: predictable?

"Fifteen scripts were selected which had been awarded exactly the same "middling" mark by the School Certificate authority concerned, and these scripts were marked in turn and independently by 15 examinaers, who were asked to assign to them both marks and awards of Failure, Pass and Credit. After an interval which varied with the ifferent examinrs, but wasnot less than 12 nor more than 19 months in any instance, the same scripts, after being renumbered, were marked again by 14 out of the 15 original examiners (...). The 14 examiners assured us that they had kept no record of their previous work and this was indeed obvious from the results.
(...)
Perhaps the most striking feauture in the investigation is this: (...) On each occasion the 14 examiners awarded a total of 210 verdicts [Fail, Pass, Credit] to the 15 candidates. It was found tat in 92 cases out of the 210 the individual examiners gave a different verdict on the second occasion from the verdict awarded on the first. In nine cases candidates were moved two classes up or down
One examiner changed his verdict in regard to eight candidates out of the fifteen. Yet he only varied his average by a unit [scale of 100] and he awarded the same number of Failure marks, one less Pass, and one more Credit. Such irregularity of judgment is not only formidable, but it is one which would not be detected by any odinary analysis."

Ph. Hartog and E. C. Rhodes (1936). An examination of examinations (p. 14-15). Second edition. International Institute Examinations Enquiry. London: MacMillan.

Ph. Hartog and E. C. Rhodes (1936). An examination of examinations. Second edition. International Institute Examinations Enquiry. London: MacMillan.

The main work, however, is Ph. Hartog and E. C. Rhodes (1936). The marks of examiners. London.

The big omission in the Hartog and Rhodes prose is that they have not taken into account the number of different verdicts that are to be expected under ideal circumstances. Let me explain this in a few words. Under pass-fail scoring it will almost always be the case that the cutting score falls on a point where there are many pupils scoring exactly the cutting score, or one point less. Even when using a test that is scored very reliably, an independent assessment will result in an appreciable number of pupils crossing the line one way or the other. The problem here is much more complicated than scoring reliability alone. Regrettably, the educational measurement profession has chosen to focus on the 'subjectivity' of these assessments, because it is the natural thing to do for the measurement specialist. Seven decades later, the basic problems still have to be solved, for example see Bennet and Ward (1993). They are the more difficult to solve now, because in the meantime the world - especially the US - has been sold on standardized tests.

J. R. Gerberich (1956). Specimen objective test items. A guide to achievement test construction. Longmans.

'Recognizing good sentences' a valid educational objective?

.... decide whether or not it is a good sentence.

right wrong 1. During the program

right wrong 2. If you can't go

right wrong 3. Here are some flowers

right wrong 4. While running down the street

right wrong 5. That night it rained hard

Gerberich (1956, p. 39), from Iowa Tests of Basic Skills, Test C, Basic Language Skills, Elementary Battery, Form S. Part V, Sentence Sense.
Gerberich presents this as an example of good practice. Brrrrrr. Poor kids.
This is a serious problem. Many of the 28 examples in the skills chapter are of the same type: recorgnize erroneous usage, etcetera.
1. If the exercise is not a good sentence, place an X in the E (Wrong) box.
4. Read all the shorthand carefully. An occasional outline makes no reading sense in the place used. When you find such an outline, cross it out.
6. Read each sentence and decide whether there is an error in usage in any of the underlined parts of the sentence. (...) If there is no usage error in the sentence, put a zero (0) in the parentheses.
7. In each of the following sentences some one for of punctuation is missing.
9. The following sentences all show faults of construction, such as mixed, incomplete, dangling, or illogical constructions or lack of desirable parallelism.
12. Each line in the paragraph may contain an error in grammar (agreement, tense, mood, comparison, reference, etc.) or in idiomatic usage (right preposition, proper word or phrase, etc.).
25. If you think a word is misspelled, draw a circle around it. Write the misseplled words correctly on the lines below the passage.
27. This is a test in proofreading.
28. Reprinted below is a poorly written passage. You are to treat it as though it were the final draft of a composition of your own, and revise it so that it conforms with standard formal English.

With the exception of item 27 which tests an authentic educational goal - it is from the Landis Achievement Test in Printing - these items test skills that evidently do not belong to the primary goals of the courses involved. I will grant that scores on these items will correlate appreciably with whatever are the core skills of these courses, but that is not the point, even if the test constructors should have made it their point - they surely will have done so. Show me how you test your pupils, and I will tell you what you are teaching them. What these tests teach pupils is that - whatever their efforts to learn - tests will compare their ability to that of others. A clear signal to give up on education as soon as there is an opportunity to do so.

Banesh Hoffmann (1962/1978). The tyranny of testing. Crowell-Collier. Reprint 1978. Westport, Connecticut: Greenwood Press.
A critical book on testing, especially on the lack of quality in the questions used. Highly influential. I will prepare a number of typical examples mined by Hoffmann, and how the testing business handled his criticisms.
His book probably is dated in the sense that the times have changed since the early 60's: a lot of secrecy has disappeared, and it is generally acknowledged that items might be wrongly designed. The later book by Owen, however, proves there still might be a lot amiss, there still is room for the Hoffamnns of the 21st century to do their muck racking work. In fact, Hoffmann's book was reprinted again in 2003.
ANYTHING GOES

Which is the odd one out among cricket, football, billiards, and hockey? _____________

Hoffmann, 1962, p. 17 and following. It all started with a letter of T. C. Batty printed in the Times of London March 18, 1959.
How many different answers and reasons can you think of? Hoffmann's point is, of course, that this kind of low-quality test question by no means is an exception in high-stakes testing.

It seems to me that those who have been responsible for inventing this kind of brain teaser have been ignorant of the elementary philosophical fact that every thing is at once unique and a member of a wider class.

From a philosopher's letter, March 20th in the Times

The discussion in the Times petered out, possibly because no one resposible for question design joined it. That's a pity. In a recent book on item writing Haladyna unwittingly presented a new version of this kind of disastrous item (1999, p. 101).
ANYTHING STILL GOES IN 1999

The best way to improve the reliability of test scores is to
increase the length of the test.
improve the quality of items on the test.*
increase the difficulty of the test.
decrease the difficulty of the test.
increase the construct validity of the test.

The problem, of course, is that some non-existent 'most common' test is referred to. What kind of test might that be? Is it already a long test, does it have 'easy' items? Is it a psychometrically 'optimal' test? If so, why pose the question? Etcetera. Haladyna knows this item is not a very good one, and changes it into the multiple true-false format. Doing so will solve some ambiguities, but not all of them.

IGNORANT DESIGN

Emperor is the name of
a string quartet
a piano concerto
a violin sonata

Hoffmann, 1962, p. 22, no source mentioned.

The 'superior student,' as Hoffman calls him [it is a 'him'], knows of the Emperor Quartet of Haydn. Hoffman then gives a sharp insight into the kind of loss of trust this student must experience in the test as a fair instrument. For the student it is easy here to entertain the hypothesis the test designer is ignorant, and answer 2. is the correct one. But ...

... he has been led to call into question both the good will and the competence of the examiner; and this subjects him to a handicap, the severity of which will depend on how faulty or impeccable is the rest of the test. No longer is it possible for him to skim innocently ahead. Instead, he must proceed warily and dubiously, ever alert for intentional and unintentional pitfalls. And whenever he comes to a question for which he, with his superior ability, sees more than one reasonable answer, he must stop to evaluate afresh the degrees of malice and incompetence of the examiner. Such a test becomes for the superior student a highly subjective exercise in applied psychology - and, if he is sensitive, an agonizing one.

Hoffmann, 1962, p. 22

Ideally, testees should never be brought in this kind of position. In actual practice, however, almost every test has its defective items, to the despair of the testees. Every defective item met, whether it is a trick question or one where incidentally the correct alternative was not printed, is highly disturbing for the serious testee.
There is some research corroborating the observation of Hoffmann, f.e. see Crombag, Gaff and Chang (1975) mentioned in (Dutch) chapter 2, reporting 'superior students,', here students reading more than the course material itself, attaining lower grades than other students.

HOFFMAN's TRAP

The American colonies were separate and _____________ entities, each having its own government and being entirely _____________ .

A. incomplete - revolutionary

B. independent - interrelated

C. unified - competitive

D. growing - organized

E. distinct - independent

Hoffmann, 1962, p. 23 ff, question 17 in the 1956 descriptive booklet on the SAT.

Hoffman gives this question a lot of space in his book, because it is rather typical of tests like the SAT, and its defect can enrage 'those who work in relevant fields , such as history, sociology, and English.' The right answer meant is E., the specialist choice tends to be D. Its defect, of course, is that it is meant to be an easy question, but it is a difficult one for the mistrustful student weighing the meaning of words like 'entirely' and pondering its intentful presence. His chapter 14 'Return to the colonies' discusses the official defence of Educational Testing Service, the producer of the SAT.

Not all people who study this 'colonies' question regard it as seriously defective. But this fact does not make it acceptable. If a sizable number of qualified, intelligent people believe a question to be so worded that the wanted answer is unacceptable, that is sufficient reason for branding the question defective, for there will be intelligent examinees who wil realize the same thing, and they will be penalized for their perspicacity.

Hoffmann, 1962, p. 26

I am beginning to like this fellow Banesh Hoffmann. He is very perceptive of crucial quality issues in item design, and at no time uses extreme language to judge those responsible. Thus far at least.

IGNORANCE AGAIN

George Washington was born on February 22, 1732.
juist
onjuist

According to the Julian calendar used at the time of Washington's birth, and thus according to contemporary records of that event, he was born on February 11, not February 22. Which answer, then, should one pick: True or False? Remember: no explanations are allowed.

Hoffmann, 1962, p. 27, no source mentioned (supposedly, it is an item from an intelligence test).

A beauty. Remember also: examiners are in the habit of reproaching students for not having seen the exact meaning of this or that word in the question stem. What, in the example given, is the exact meaning of the abscence of information on the kind of calendar? Dear George could never have been born on a date according to a calendar not in use at the time of his birth. Saying he did so is just a convenient fiction, an untruthfulness nevertheless. Am I explaining too much here?
Yes-no questions are especially vulnerable on points like this. In the particular case the ambiguity should have been removed, for example, by deleting the day '22.'

Ebel, Robert L. (1965). Measuring educational achievement. Englewood Cliffs, New Jersey: Prentice-Hall.
Ch. 2. What should achievement tests measure? - Ch 9. How to judge the quality of a classroom test - Ch 11. How to improve test quality through item analysis - Ch. 12. The validity of classroom tests - Ch. 13. Marks and marking systems

8.6 Literature
R. C. Anderson (1972). How to construct achievement tests to assess comprehension. Review of Educational Research, 42, 145-170.
M. Bar-Hillel and R. Falk (1982). Some teasers concerning conditional probabilities. Cognition, 11, 109-122.
Denny Borsboom, Gideon J. Mellenbergh and Jaap van Heerden (2004). The concept of validity. Psychological Review, 111, 1061-1071. pdf
abstract This article advances a simple conception of test validity: A test is valid for measuring an attribute if (a) the attribute exists and (b) variations in the attribute causally produce variation in the measurement outcomes. This conception is shown to diverge from current validity theory in several respects. In particular, the emphasis in the proposed conception is on ontology, reference, and causality, whereas current validity theory focuses on epistemology, meaning, and correlation. It is argued that the proposed conception is not only simpler but also theoretically superior to the position taken in the existing literature. Further, it has clear theoretical and practical implications for validation research. Most important, validation research must not be directed at the relation between the measured attribute and other attributes but at the processes that convey the effect of the measured attribute on the test scores.
Case and Swanson (2001). Constructing Written Test Questions For the Basic and Clinical Sciences. National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104 http://www.nbme.org/PDF/2001iwg.pdf [dead link? 1-2009].
M. Bar-Hillel and R. Falk (1982). Some teasers concerning conditional probabilities. Cognition, 11, 109-122.
Randy Elliott Bennett and William C. Ward (Eds) (1993). Construction versus choice in cognitive measurement. Issues in constructed response, performance testing, and portfolio assessment. Hillsdale: Erlbaum.
Wim De Neys and Niki Verschuren (2006). Working Memory Capacity and a Notorious Brain Teaser. The Case of the Monty Hall Dilemma. Experimental Psychology, 53, 123-131. pdf
from the abstract Findings indicate that working memory capacity plays a key role in overcoming salient intuitions and selecting the correct switching response during MHD reasoning.

S. M. Downing (2002). Threats to the validity of locally developed multiple-choice tests in medical education: construct-irrelevant variance and construct underrepresentation. Adv Health Sci Educ Theory Pract, 7(3), 235-41.
abstract Construct-irrelevant variance (CIV) - the erroneous inflation or deflation of test scores due to certain types of uncontrolled or systematic measurement error - and construct underrepresentation (CUR) - the under-sampling of the achievement domain - are discussed as threats to the meaningful interpretation of scores from objective tests developed for local medical education use. Several sources of CIV and CUR are discussed and remedies are suggested. Test score inflation or deflation, due to the systematic measurement error introduced by CIV, may result from poorly crafted test questions, insecure test questions and other types of test irregularities, testwiseness, guessing, and test item bias. Using indefensible passing standards can interact with test scores to produce CIV. Sources of content underrepresentation are associated with tests that are too short to support legitimate inferences to the domain and which are composed of trivial questions written at low-levels of the cognitive domain. "Teaching to the test" is another frequent contributor to CUR in examinations used in medical education. Most sources of CIV and CUR can be controlled or eliminated from the tests used at all levels of medical education, given proper training and support of the faculty who create these important examinations.
Martin Gardner (2006). The colossal book of short puzzles and problems. Edited by Dana Richards. Norton.
W. K. B. Hofstee (1971). Begripsvalidatie van studietoetsen: een aanbeveling. Nederlands Tijdschrift voor de Psychologie, 26, 491-500.
Construct validation of achievement tests: A recommendation. Referred to by Richard E. Snow (1993). Construct validity and constructed-response tests. In Bennet & Ward Constrcution versus choice in cognitive measurement.. Erlbaum.
summary (p. 499) In educational testing, content is usually proposed as a solution to the validity problem. The present paper argues in favor of construct validation as a preferable alternative. Content validation, as described in many textbooks, falls short of scientific standards like explicitness and objectivity, a sound empirical basis, and clear conceptualization. It should be considered as a first step in the (construct) validation of an achievement test, not as a sufficient indication of its validity.
Procedures for assessing the construct validity of achievement tests are summarized, like: internal analysis, convergent and discriminant validation, comparisons between groups of ss, and experimental manipulation. Emphasis on construct validation implies the desirability of standardized testing, since it can hardly be expected that teacher-made tests are routinely validated in an elaborate fashion.
I am not sure whether Willem Hofstee still is of the opinion as expressed in the last sentence. A counterargument would be that there scarcely is any merit in investing in standardized tests to be more 'valid' than the instruction of individual teachers itself is. Think this over.
The point of mentioning (construct) validity here is that, ultimately, test use should be evaluated against suitable criteria. Test use, of course, includes test construction, as well as decisions taken on the 'evidence' from the tests as constructed. In the seven foregoing chapters the issue did not arise, because it was assumed tests should reflect educational purpose. In chapter 8 the issue is, does the assumption hold true?
Eventually I will summarize the points in the Hofstee article, part of which are new, part are cited from the literature.
Christina Huber, Martina Späni, Claudia Schmellentin und Lucien Criblez (2006). Bildungsstandards in Deutschland, Österreich, England, Australien, Neuseeland und Südostasien Literaturbericht zu Entwicklung, Implementation und Gebrauch von Standards in nationalen Schulsystemen. Fachhochschule Nordwestschweiz Pädagogische Hochschule Institut Forschung und Entwicklung Kasernenstr 5001 Aarau . http://www.edk.ch/PDF_Downloads/Harmos/Literaturanalyse_1.pdf [dead link? 1-2009]
Samuel Messick (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In Randy Elliot Bennett and William C. Ward Construction versus choice in cognitive measurement (p. 61-73). Erlbaum.
Samuel Messick (1989). Validity. In Robert L. Linn, Educational measurement (3rd. ed., pp. 13-103). New York: American Council on Education and MacMillan.
Samuel Messick (1995). Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749. From its abstract: "Six distinguishable aspects of construct validity are highlighted as a means of addressing central issues implicit in the notion of validity as a unified concept. These are content, substantive, structural, generalizability, external, and consequential aspects of construct validity. In effect, these six aspects function as general validity criteria or standards for all educational and psychological measurement, including performance assessments, which are discussed in some detail because of their incrasing emphasis in educational and employment settings."
NIP (1986). Richtlijnen voor ontwikkeling en gebruik van psychologische tests en studietoetsen. Amsterdam: Nederlands Instituut voor Psychologie. Tweede editie.
Deze richtlijnen hebben een nogal eenzijdige nadruk op wenselijkheden die aangedragen worden vanuit de psychometrie. Dat brengt met zich mee dat ze weinig of niets toevoegen aan de behandeling die Cohen (1981) over hetzelfde onderwerp geeft. Deze richtlijnen hebben echter wel een zekere bindende werking voor psychologen.
APA (1966/1974/1985/1999) Standards for educational and psychological tests, Washington, D.C: American Psychological Association.
Richtlijnen en Standards zijn niet online beschikbaar, wat ze wel horen te zijn. Er is op het internet wel enige informatie, o.a. een stuk van George Madaus, en een samenvatting van de Standards door ERIC.
George Madaus, Carolyn A. Lynch and Peter S. Lynch (2001). A Brief History of Attempts to Monitor. National Board on Educational Testing and Public Policy: Statements, Volume 2, Number 2. html Testing
ERIC Development Team (1994). Questions to ask when evaluating tests. ERIC/AE Digest. http://eric.ed.gov/ERICDocs/data/ericdocs2/content_storage_01/0000000b/80/2a/23/c0.pdf [dead link? 1-2009]
"This Digest identifies the key standards applicable to most test evaluation situations. Sample questions are presented to help in your evaluations."
Sandra Thompson, Martha Thurlow and David B. Malouf (2002). Creating better tests for everyone through Universally Designed Assessments. pdf
"Universally designed assessments are designed and developed to allow participation of the widest possible range of students, in a way that results in valid inferences about performance on grade-level standards for all students who participate in the assessment. This paper explores the development of universal design and considers its application to large-scale assessments."
Code of Fair Testing Practices in Education. Prepared by the joint Committee on Testing Practices. html
Note thtat this is NOT a summary of the Standards, it is a code on a special topic
=
Legal issues in grading. A section in Louis N. Pangaro and William C. McGaghie (Lead Authors) Evaluation and Grading of Students. pdf, being chapter 6 in Ruth-Marie E. Fincher (Ed.) (3rd edition) Guidebook for Clerkship Directors. downloadable [Alliance for Clinical Education, USA]
Edward F. Redish, Rachel E. Scherr, and Jonathan Tuminaro (2005). Reverse engineering the solution of a 'simple' physics problem: Why learning physics is harder than it looks. paper pdf Also published in The Physics Teacher, 44, 293-300.
abstract In this paper, we show an example of students working on a physics problem—an example that demonstrated to us that we had failed to understand the work they needed to do in order to solve a "simple" problem in electrostatics. Our critical misunderstanding was failing to realize the level of complexity that was built into our own knowledge about physics.
Jason Rosenhouse (2009). The Monty Hall problem. The remarkable story of math’s most contentious brain teaser. Oxford University Press.
Richard E. Snow (1993). Construct validity and constructed-response tests. In Randy Elliot Bennett and William C. Ward Construction versus choice in cognitive measurement (p. 61-73). Erlbaum.
R. E. Snow and D. F. Lohman (1989). Implications of cognitive psychology for educational measurement. In Robert L. Linn Educational Measurement (3rd ed., pp. 262-331). New York: American Council on Education and MacMillan.
Haggai Kupermintz, Vi-Nhuan Le and Richard E. Snow (1999). Construct validation of mathematics achievement: Evidence from interview procedures. CSE Technical Report 493. http://www.cse.ucla.edu/Reports/TECH493.PDF [dead link? 1-2009]
U.S. Supreme Court. Regents of University of Michigan v. Ewing, 474 U.S. 214 (1985). html
"The record unmistakably demonstrates that the decision to dismiss respondent was made conscientiously and with careful deliberation, based on an evaluation of his entire academic career at the University, including his singularly low score on the NBME Part I examination. The narrow avenue for judicial review of the substance of academic decisions precludes any conclusion that such decision was such a substantial departure from accepted academic norms as to demonstrate that the faculty did not exercise professional judgment"
Sandra J. Thompson, Christopher J. Johnstone and Martha L. Thurlow (2002). Universal Design Applied to Large Scale Assessments. The National Center on Educational Outcomes html
From the summary: ... seven elements of universally designed assessments are identified and described in this paper. The seven elements are:
Inclusive assessment population
Precisely defined constructs
Accessible, non-biased items
Amendable to accommodations
Simple, clear, and intuitive instructions and procedures
Maximum readability and comprehensibility
Maximum legibility
Each of the elements is explored in this paper. Numerous resources relevant to each of the elements are identified, with specific suggestions for ways in which assessments can be designed from the beginning to meet the needs of the widest range of students possible.
Andrew Watts (2006). Fostering communities of practice in examining. A rationale for developing the use of new technologies in support of examiners. Cambridge Assessment Network site pdf
abstract Examiners and assessors who work in teams to judge the quality of students' work in examinations, or of trainees' performance in assessments of competence, are frequently described as working in ‘communities of practice'. Following Wenger (Communities of Practice: Learning, Meaning and Identity. 1998), this concept is used to describe the way examiners acquire their craft and maintain their competence in it. This paper discusses some of the literature about the place of communities of practice in examining, and seeks to clarify the rationale for them. It does this in the light of the significant changes taking place because of the introduction of new technologies to examining. The concept of communities of practice has often been put forward as a description of the strategies and procedures which lead to reliable marking. The use of e-technology could support such an aim. The paper argues that, at the same time, the necessity of fostering communities of practice to provide a context for valid assessments, can also be supported by new technologies.
Dennie Palmer Wolf (1993). Assessment as an episode of learning. In Bennett and Ward (1993, 213-240).
[There is no online version of this paper/chapter. Dennie, please repair this.]
Afra Zomorodian (1998). The Monty Hall problem. (unpublished?) pdf
introduction This is a short report about the infamous “Monty Hall Problem.” The report contains two solutions to the problem: an analytic and a numerical one. The analytic solution will use probability theory and corresponds to a mathematician’s point of view in solving problems. The numerical solution simulates the problem on a large scale to arrive at the solution and therefore corresponds to a computer scientist’s point of view.

more literature

Principles for Fair Student Assessment Practices for Education in Canada pdf
Denise Jarrett and Robert McIntosh (2000). Teaching mathematical problem solving: Implementing the vision. A literature review. pdf
"This document reviews recent research and literature on the essential traits and processes of teaching and learning mathematics through open-ended problem solving. The literature and research on effective problem solving informed the design of the NWREL Mathematics Problem-Solving Model^TM.">
Douglas Fuchs and Lynn S. Fuchs (1986). Test procedure bias: A meta-analysis of examiner familiarity effects. Review of Educational Research, 56,, 243-262
"In the typical study, the effect of examiner familiarity raised test performance by .28 standard deviations."

Links

Bas Braams website http://www.math.nyu.edu/mfdd/braams/links/ Review of PISA Sample Science Unit 1: Stop That Germ, November 29, 2004. html; Review of PISA Sample Science Unit 2: Peter Cairney, November 29, 2004 html; Comments on the June, 2003, New York Regents Math A Exam html; Mathematics in the OECD PISA Assessment html; OECD PISA: Programme for International Student Assessment. html; Mathematics in the OECD PISA Assessment html.
FairTest The National Center for Fair & Open Testing site of the FairTest Examiner journal.
Het Examenblad http://www.eindexamen.nl/
is the official website about the exit exams in secundary education in the Netherlands.
There has been a lot of fuss in our little country about teaching in English instead of in Dutch, yet this important website is entirely in Dutch.
On the site one may find all important documents:
the law and all regulations following from it,
the educational goals tested for,
the tests themselves,
the correction prescriptions,
the corrections on the prescriptions because of errors in the examination papers,
and the grading on the curve that is practised in the Netherlands ('omzettingstabel normering') using a method that is not known publicly.

Yes, what is missing is the documentation of the design of the examination questions themselves. Well, let us say it is thought not to be in the best interests of the institutions involved to be candid about the design of the questions. A state of affairs that leaves something to be wished for.

8.7 Terminology

true-false format - TF - AC (Osterho, 1999) - two-choice - binary choice; answer categories: yes or no, right or wrong, true or false, correct or incorrect
response changes - answer changing: changing the answer given in the first reading of the MC test. [Reile and Briggs, JEP 1952: Should Students Change Their Initial Answers on Objective-type Tests?]
instructions, and their effects [Prieto and Delgado, 1999]
shift error, shift error detection: the student making a mistake on the scoring paper might result in alle following questions getting wrongly marked; how is this statistically detected? [ Shift Error Detection in Standardized Exams - Skiena, Sumazin (2000) , pdf files]
item order, item sequence effects: easy items first? random order?
cheating: detecting cheating on MC tests

January 10, 2011 \ contact ben at at at benwilbrink.nl http://www.benwilbrink.nl/projecten/06examples8.htm

right	wrong	1. During the program
right	wrong	2. If you can't go
right	wrong	3. Here are some flowers
right	wrong	4. While running down the street
right	wrong	5. That night it rained hard

A. incomplete	-	revolutionary
B. independent	-	interrelated
C. unified	-	competitive
D. growing	-	organized
E. distinct	-	independent