Original publication 'Toetsvragen schrijven' 1983 Utrecht: Het Spectrum, Aula 809, Onderwijskundige Reeks voor het Hoger Onderwijs ISBN 90-274-6674-0. The 2006 text of chapter 2 is a text currently under revision.

Item writing

Techniques for the design of items for teacher-made tests

2. Item types, transparency, item forms and abstraction level


Ben Wilbrink

this database of examples has yet to be fully constructed. Suggestions? Mail me.

inhoud en vorm

Figuur 1. Natural content of the question contrasted with the question format. Forcing the natural conctructed answer question into the choice format is a critical operation.

This chapter uses mainstream ideas on item writing. Some minor points of departure from American usage are to be noted, however, and a few major ones. A minor admonition is to be pure in the language to use, so 'wrong answers' in the MC-question are just that, wrong answers, not 'distractors.' The choice of options in MC questions should reflect choices or discriminations belonging to the instructional goals. A major addition is the 1970 proposal by De Groot html to add transparency as a citerion of quality in achievement tests. A major change is to replace psychological categories of understanding etc. with epistemological ones about how knowledge in a particular disciplinary domain is structured etcetera. And it will be shown that forced guessing under pass-fail scoring is unfair to testees (publication in preparation).

What item type to use for a particular question is a rather critical decision. The particular choice made can either detract or contribute to the validity of your questioning mastery of course content. Ultimately, item type and content tested for are intricately related. Somehow, however, a beginning must be made in the exposition of item design methods. The somewhat familiar theme of what item types there are to choose from, and what the general reasons might be to prefer one to the other, is a natural candidate to use for a start. Even then, it is critically important always to keep in mind that the kind of questioning used in formative and summative testing will influence the way students handle their course work and test preparation, in much the same way as content asked for will signal students what is important to study, and what may be neglected.

"Finally, the notion is often expressed that tests, or more broadly conceived assessment systems, signal what is held to be important to teachers, parents, and students. Thus, for the teaching of thinking to be recognized as important and given enough emphasis, it is necessary to develop assessment procedures that do justice to the goals." (...)
"Regardless of one's position in the long-standing debate about whether tests should follow or lead instuction, it is clear that tests can be an important factor in shaping instruction and learning."

Robert L. Linn (1988). Dimensions of thinking: Implications for testing. CSE Technical Report 282. http://www.cresst.org/Reports/r282.pdf [dead link? 1-2009]. Published in Beau Fly Jones and Lorna Idol (Eds) (1991). Dimensions of thinking and cognitive instruction. Erlbaum.

In America it is high-stakes testing that is influencing tactics and strategies of all stakeholders in the educational industry. In the Netherlands the exit-examinations in secondary education are state-controlled, the state here is explicitly directing the contents and levels of instruction in secundary education. The kinds of test questions used, and their quality in testing mastery of content, of all kinds of assesment, but especially state-controlled ones, are critical factors in the quality of the educational processes at large. See also Ludger Woessmann (2005) pdf.

summary of content

A general principle of test item design is to start with natural questioning formats, such as open ended questions. If it is necessary to use so called objective type test items, use a well-developed set of open-ended questions as your starting point for the design. For essay type tests much the same can be said: first carefully choose the problem or theme to be set, only then add the necessary directions, and subquestions, and develop the scoring scheme - if one is really needed.

It will probably be the case that many design decisions need not be made anew for every item separately. A lot of efficiency in the item design process may be gained making use of reusable item forms, and of techniques to produce multiple variants of the same item, to be stored for later use.

Regardless of the particular content, there are some design principles that almost always should be applied. The first is that questioning should be such that students are in a position to effectively prepare themselves for it - after all the business here is education (De Groot, 1970 html; somewhat related is 'universal design', Thompson, Johnstone and Thurlow, 2002). The US have wandered away from this principle by relying so much on general achievement tests. European systems generally use national exit exams, and these really are tails wagging dogs - sometimes too much so, of course. The second principle is that question design is not the same as reformulating instructional tests: questions should test mastery, not simple reproduction. A particular abuse that is fairly widespread is to ask for high level definitions and relations, instead of asking for the adequate use of them.

Many different sets of guidelines for writing items are available, I will however use Haladyna, Downing and Rodriguez (2002) as a reference article pdf, also because it tries to be evidence-based in presenting a canonical list of 31 guidelines, most of them specific to MC-questions. Their list is almost the same as the one in Haladyna (1999, p. 77), but does not condemn the true-false format, and does not endorse the 1999 guideline #4 to test only for a single idea or concept (rightly so. After all, relations between two or more concepts comprise a sizeable part of any course content, isn't it?). Guidelines tend to come in two kinds: what to do, and what not to do. The second kind belongs in chapter 8 on quality management in the item design process.

For a booklength treatment of item writing - in the medical field - see Case and Swanson (2001) pdf.

2.1 Open-ended or short answer questions


How did that make you feel?

America was in 1492 discovered by ____________.


Who discovered America in 1492? ____________

Open-ended questions are the ones expecting one word or number, a at the most a few of them, as an answer. Longer answers turn them gradually into the essay-question type. No sharp boundaries here.

The open-ended question may be of the simple fact-finding kind, or it may ask for complex calculations having a number as a definite answer; anything goes, as long as the answer expected is itself a short one.

In many situations and for much course content the open-ended question is a natural type of question. As such it's formulation is the ideal first step in item design. Using the open-ended question in tests, however, one should observe some simple rules of design. The general principle here and elsewhere is to always design questions so testees can efficiently handle the information given and needed. Literally observe the 'open end' character and abstain from using questions where the opening is in the middle or even the beginning of the question, because in such a design the testee is forced to read the question at least twice.

The first and foremost design principle, of course, is to be crisp and clear in the formulation of the question. If the question can be posed in straightforward language, do so. Do not use lengthy phrases, figures of speech, rare terms, if you can avoid to do so. Always remember to test for mastery of course content, not for intelligence.

The intended kind of mastery of course content rules the design of the (preliminary) questions. Chapters three to seven treat the many specific points, issues and possibilities in translating course content in this way into (preliminary) items. The negative formulation of this rule is to abstain from taking sentences from course materials, and transforming these into questions. If it is deemed necessary to ask for literal reproduction of, for example, definitions and lists, please look out for more sensible alternatives.

In the UK, for example, more and more attention is given to assessment for learning, instead of assessment of learning, in this way forcing the item design in the direction of questions primarily informing the testees about the learning they yet have to do, instead of informing the institution of the learning that already has taken place. (http://www.qca.org.uk/7659.html [dead link? 1-2009], or search Google for "achievement for learning")

The open-ended question, alternatively called the short answer item format, or the fill-in-the-blank format, itself belongs to the very broad category of constructed response item formats, most of these being essay-type, and some of them of the demonstrated skill type - performance, portfolio, research paper, dissertation, demonstration.

In the open-ended question format there is not much room for variants, other than the dimension from simple to more involved question formulations. Open-ended is open-ended. In contrast, the MC format allows may different types.

The examples in this paragraph will be examples of how to use the design principles mentioned above. The reverse case, not regarding one or more of these principles, is treated more fully in chapter eight. Rules derived from the more general principles, then, are the following.


  1. Open-ended questions belong to the so-called 'objective' questions. Why would that be?
    • high level of intersubjective agreement possible
  2. Could open-ended questions be multiple-choice without the multiple choices being given explicitly?
    • yes
  3. Explain what is meant by a question asking for a 'new use' of a bit of knowledge.
    • apply a concept or rule in a new situation etc
  4. Would you recognize an open-ended question as such, if you met one?
    • yes
  5. Is it possible to recognize open-ended questions to fail the 'new use' principle?
    • yes
  6. The same, if they were about unknown course material?
    • yes
  7. Does asking to tell in one's own words what the text says about 'A' classify as asking for a 'new use'?
    • no, this is only paraphrasing, not bad as such, but not a 'new use.'
  8. What is an open-ended question?
    • One that leaves a blank for the last word[s] in the sentence.
  9. Is this question open-ended?
    • no
  10. How would you call the above question?
    • short-answer
  11. Is there a limit to the number of possible questions about a non-trivial piece of text?
    • no
  12. Is it possible to design in a non-trivial way two or more different questions for the reproduction of the exact wording of a particular definition?
    • no. Test me on this one; send in an example showin git to be possible.

Most of the above questions are of a general type, and therefore do not allow for generating a large number of variants. They could be useful to generate discussion in class. Others (could) use particular examples, and these examples may be exchanged for new ones to result in 'new' questions asking for other 'new uses' of the particular notion.

be crisp and clear

An important cluster of guidelines can be characterised as be crisp and clear.

"Keep vocabulary simple for the group of students being tested."
"Use correct grammar, punctuation, capitalization, and spelling."
"Minimize the amount of reading in each item."
"Ensure that the directions in the stem [in the question] are very clear"
"Word the stem [the question] positively, avoid negatives such as NOT or EXCEPT. If negative words are used, use the word cautiously and always ensure that the word appears capitalized and boldface."

Halydyna et al. p. 4, guidelines 8, 12-14, 17

Note that the 'be crisp and clear' guideline forbids questions to be all kinds of things except about relevant course content. To the 'all kinds' belong: testing for intelligent behavior or command of the language, using trick questions or questions about trivial content.
Different guidelines in the example refer to different phenomena in the real world, some of them of a psychological character, others regarding a fair treatment of alle testees. Lots of research on these points are available, most of it is of a general character, however, meaning it is not researched using different kinds of poorly designed items. This state of affairs does not diminish the evidence-based strength of guidelines, such as to avoid difficult wording or items that overload the capacity of short term memory. Scientifically established theories themselves are evidence-based. Haladyna gives the impression to somewhat disregard these rich sources of knowledge on human behavior.

Inherently many of these guidelines also impact on reliability and validity of tests. For example, using way too much words limits the number of items that can be asked in the limited time available for the test, and therefore unnecessarily limits reliability and validity of the test.

ask for a new use, example, inference, explanation

In fact, this is a major theme of the book. No harm done to softly introduce some points here already, especially so since at this level there still is agreement with the approach of, for example, Haladyna (1999).

23 + 56 = _____

In arithmetics the idea to ask for a new calculation to make, not to reproduce an example from the course text, is self-evident. The thesis of this book is that course content in most, if not all, disciplines shares basic characteristics of what it is to be knowledgeable in that discipline. The arithmetics example asking for a new application of an algorithm, is a kind of open-ended question that in many courses also is a good possibility to ask for mastery of algorithms (schema's, procedures).

On national television Ayaan Hirshi Ali said she had given a false name on naturalisation; her grandfather is Magan, not Ali.
  1. What is the relevant 'fact' here? ________
  2. Question 2. Does Hirshi Ali's confession establish her name to be 'false' in the legal sense? ________

Background: The Dutch minister of immigration said (May 2006) she took the confession to establish falseness in the legal sense, announced in Parliament that Hirshi Ali (Magan?) never was naturalized, only to declare later in the 10-hour session, also on national tv, Hirshi Ali still was a citizen of the Netherlands, and a week later that surely she will stay one.

The above example illustrates how easy it is to find new examples of legal principles etcetera in your daily newspaper. Granted, the example given is a rather special one.

  1. Explain why the thing I am sitting on is called a chair.
  2. Point to any other chairs in this room.
  3. Is there a chair to be found in any of the pictures on the wall?

Socrates, a philosopher already (too) famous in his own days, was condemned to drink a poisoned cup. Did he die because of his drinking it?

Socrates is human (a new example of the human race), humans die drinking a poisoned cup (known characteristic of humans). Therefore, Socrates did die.

assessing answers

solve 364 - 79.

The figure shows the answers of 10 students.
10 answers
"Children who have a good understanding of our base ten system can quite easily find ways of operating on numbers that are nonstandard (see N3 above). Teachers should be able to recognize when a student's reasoning is correct. In the following example, a subtraction calculation is solved in a variety of ways. When prospective teachers encounter this example, they often recognize that their own limited understanding of mathematics is not sufficient to comprehend the reasoning processes used by these children, and they become more aware of why they need to know mathematics at a much deeper level than they currently do."

Judith T. Sowder (2006). Reconceptualizing Mathematics: Courses for Prospective and Practicing Teachers. pdf

The prospective teacher in the Sowder course should "identify: a. which students clearly understand what they are doing; b. which students might understand what they are doing; c. which students do not understand what they are doing." That, indeed, is what assessment should be.

2.2 multiple-choice (MC) questions

Natural questions might already be multiple-choice, but in most cases the MC-format will have to be designed on the basis of a short-answer natural question. The implication is that MC questions will be somewhat artificial, and in such a case should be used only if for economic reasons it is not feasible to use the short-answer format. Be aware of developments in character reading technology making it possible in such cases yet to use the short-answer question format instead of MC.

What sense is there in giving tests in which the candidate just picks answers without any opportunity to give reasons for his choices?

Hoffmann, 1962, p. 29

This is a good question. For well-designed items, however, this should not be an issue. For badly designed ones, it definitely is. Opportunity to appeal the test score does not quite solve this problem. The opportunity to ask for explanation during the testing doesn't either; issues of bad design need not be recognized as such by the students themselves, and for standardized tests the testing supervisors cannot possibly solve issues raised by clever testees. Why not follow Hoffmann's suggestion and allow students to annotate their answers?

To prevent disagreement over what grade to give, the multiple-choice tester asks the student merely to pick answers. He refuses to let the student explain his choices for a very simple reason: grading the various explanations would cause disagreement among the graders.

Hoffmann, 1962, p. 35

This definitely is the wrong reason to opt for using MC questions. It may not be immediately obvious, but changing from essay questions to MC questions in order to make tests more 'objective' without having each (essay) test multiply graded, is not a good reason either. There is in the educational measurement field a big misunderstanding of the role of testing in instruction. The British 'assessment for achievement (Paul Black) perfectly clarifies the issue. Surely, some testing of necessity is 'summative,' and must be fair - which is not exactly the same as 'objective.' What education is about, however, is growth in knowledge, and testing should be instrumental for this growth. There are multiple ways in which testing can be instrumental in this way, for example providing immediate information to direct the learning process, or motivating students to invest in learning - an age old principle of 'design' of ways of grading. Improper reasons to use the MC format for testing surely will destroy these kinds of instrumentality of assessment. And it is showing in American educational culture, where too many children - and parents - believe achievement more to depend on (inborn) ability than on effort.

Michael C. Rodriguez (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice, summer, 3-13.

pdf not for free - make your last concept available on your website, Michael!

The golden tip for everyone having to design MC questions is: make them two-choice, at most three-choice. Everybody will be happy: designing the MC question will be fun again, to students the items will be much more transparent, the test will have many items more and will therefore be more valid. Michael Rodriguez (2005) has collected the (American) empirical evidence in a forceful statement. Only his very first sentence that "item writing has been, is, and always will be an art," unnecessarily detracts from his otherwise clear message. Item writing should not be quackery.

In this particular case, and if you have constructed MC tests before, you may use your own data to prove Rodriguez right. Look up the statistical analyses of these items, and mark all wrong options having 'attracted' definitely less students than the other one or two wrong options. You see? A waste of resources, for everyone involved in the testing process.

My own thinking on the robustness of four- and five-choice item use in the face of empirical evidence of its inefficiency, is that misunderstanding the possible impact of guessing is the decisive factor here. It is somehow reassuring to have four and even five alternatives to fool badly prepared pupils, while it feels quite 'naked' to have only two alternatives. A bit of flipping coins or rolling dice should cure the misperception; use the computer: applet.

The extent to which a target stands out among a group of other targets is called the target's

  1. identity.
  2. novelty.
  3. salinity.
  4. salience.

The only correct answer choice for item 1 is answer D. This option is correct because salience is the term used to describe what makes a target more likely to be selected from the available data by perceptual processes and thus makes it more likely to be perceived among other potential targets. A is incorrect because the identity of a target can only be known by a perceiver after perceptual processes have been completed; B is incorrect because it is one of the factors that can contribute to the salience of a target and does not in itself describe the degree to which the target stands out among other potential perceptual targets in a situation; and C is incorrect because salinity describes the degree of salt in a solution and has nothing to do with perception theory. This item tests Bloom's basic cognitive level (Level 1: KNOWLEDGE) because it only requires recall of information (i.e. of common terms/basic concepts - here: salience) for the item to be answered correctly.

Item designs elaborately explained like the above one, are rare in the literature. This one is cited from Fellenz (2004), see also below on this publication.

In the above example it should be clear that option C is to be deleted immediately, because it has nothing to do with course content.

Drop alternative A also, if 'identity' is not a special term, like 'novelty' is, but only a convenient idea of the item writer.
I take it from the explanation that a two-choice item B) D) is a good choice item because it asks for knowing what saliency is, versus what it is that contributes to salience.

The item is rather abstract, but nevertheless it allows of a few variants using other characteristics than 'novelty' making for salience. Forget the Bloom crap, unless the taxonomy is used as a heuristic device only.

A 66-year-old woman had the abrupt onset of shortness of breath and left-sided pleuritic chest pain 1 hour ago. She had been recovering well since a colectomy for colon cancer 2 days ago. Blood pressure is 160/90 mm Hg, pulse is 120/min, and respirations are 32/min. She is diaphoretic. Breath sounds are audible bilaterally, but inspiratory effort is decreased, S1 and S2 are normal, and jugular veins are not distended. Which of the following is the most likely cause of her acute condition?
  1. Acute myocardial infarction
  2. Dissecting aortic aneurysm
  3. Pneumonia
  4. Pneumothorax
  5. Pulmonary embolus*

    Fincher (2006?) p. 202

Five options! If these are the options the clerk should think of anyhow, five is OK.

The stem contains a lot of window dressing, I suspect. I am hampered here by my lack of medical expertise. What I can say, though, is that separate bits of information in the stem evidently serve to exclude one ofrmore of the options. If such is the case, it is possible to break down this huge MC item into a series of lean two-choice items. The series can be made even larger by including 'why' questions. Do so. You will get much more information about the clerk's mastery in the same amount of testing time! Do not use these leaner items together in the same test, though. Because of the elephantasis stem the MC item is highly inefficient. If it is deemed important to use authentic questions like the one above, do not use the MC format, or at least use it in another way by posing a number of two-choice questions on the information given in the stem. But remember: this violates the desirability of independency between questions.

For the record: the example item asks for a new use.

Patients usually present with signs and symptoms, not a diagnosis. Therefore, write examination questions that replicate the process of clinical problem solving. Questions such as "Which of the following is true about polymyalgia rheumatica?" or worse yet, "Which of the following is not true about polymyalgia rheumatica?" do not elicit clinical thinking. They are a series of true-false statements.

Fincher (2006?) p. 203

The above citation is a beautiful formulation of what it is to ask for a new diagnosis, i.e. a new use of a particular diagnosis.


"The key to developing wrong answers is plausibility. Plausibility refers to the idea that the item should be correctly answered by those who possess a high degree of knowledge and incorrectly answered by those who possess a low degree of knowledge. A plausible distractor will look like a right answer to those who lack this knowledge. (...) Writing plausible distractors comes from hard work and is the most difficult part of MC writing."

Haladyna (19992) p. 97 guideline 28 'Make all distractors plausible.'

To paraphrase Haladyna: this guideline on distractors is not evidence-based and is the most disappointing sentence in his writing. It conflicts with the guideline forbidding trick questions: this distractor philosophy is a philosophy about tricking testees into the wrong answer. It is perfectly possible, as Haladyna demonstrates, to call wrong answers just that, or 'wrong options.' On a positive note: the choice between options should be an instructionally valid one, corresponding to the kind of choice or discrimination a student mastering the course content should be able to make. Not knowing what the right answer is, should not bring the student in the trick-question situation; it should be perfectly clear to her that she does not know the right answer, and had better leave the question unanswered. At least, if the scoring rule does not force her to guess on such question, a 'questionable' strategy for institutions to follow, see below.

This 'distractor philosophy' baffles me. Why is it that we tend to think these dark things about the design of MC questions, never having done so in the case of short-answer questions? Is it a case of the opportunity making the thief?


Figure 2.2.1. "Origins of ballistic theory: Attempt to adapt the construction of projectile trajectories to the knowledge of the artillerists. Taken from a book on artillery by Diego Ufano (1628)." [source: Max-Planck-Institut für Wissenschaftsgeschichte, Research Report 2000-2001, Department I] None of the trajectories is Newtonian, of course.

The Force Concept Inventory of David Hestenes is an example of a test where the wrong answers have been designed to be attractive to pupils having common naive views on the laws of motion - and the pupils are attracted in large numbers. These pupils are not 'distracted' at all: they are attracted to the options corresponding to their conceptual model of force and motion, which might not be the Newtonian view of classical physics. Other chapters will return to the work of Hestenes and other research on cognitive models. For now, the point is that his test is a rather spectacular example of well-designed items, well-designed now in the construct validity sense. The small and unpublished study by Adams and Slater (n.d.) illustrates the main issues here. For examples of the exact items of this test see Rebello et al. 2005 pdf, they are highly similar to the well known items used in mental models research on, for example, ballistic trajectories - or water from your waterhose.

specific guidelines
Two important guidelines specific to MC items have already been mentioned. Be content with two or three options for your item. Wrong options should be related to the goals of the course; specifically, do not try to think of 'distractors' that trick students not knowing the right answer into marking the wrong one. The special thing about MC items is the list of options. Guidelines here look like guidelines of style, be not mistaken about their real character though: to keep the option-part of the item crisp and clear so students have a level playing field. Again, choosing only the guidelines on things to do, not the things to avoid, from Haladyna et al., the following list obtains.

  1. Develop as many effective choices as you can, but research suggests three is adequate.
  2. Make sure that only one of these choices is the right answer.
  3. Vary the location of the right answer according to the number of choices.
  4. Place choices in logical or numerical order.
  5. Keep choices independent; choices should not be overlapping.
  6. Keep choices homogeneous in content and grammatical structure.
  7. Keep the length of choices about equal.
  8. None-of-the-above should be used carefully.
  9. Phrase choices positively; avoid negatives such as NOT.

    Haladyna et al. 2002 p. 312. Guidelines on 'distractors' left out here. Also 'Avoid giving clues to the right answer (...)" because this is an issue in the quality check, chapter eight.

For a recent article about guidelines - in the medical field - see Collins (2006) http://www.arrs.org/StaticContent/pdf/ajr/pdf.cfm?theFile=ajrWritingMultipleChoiceHandout.pdf [dead link? 1-2009].

A 7-year-old girl is brought to a physician's office by her mother complaining of chronic abdominal pain, irritability and crankiness. Her mother also hints there may be family problems. Which of the following would be most helpful to aid understanding of this patient's problem?
  1. Elicit further information about the family problems and other potential stressors.
  2. Perform a physical examination.
  3. Reassure the mother that it is a normal phase her daughter is going through.
  4. Refer the girl to a gastroenterologist.
  5. Refer parents for marital counseling.

The flaws in this item include:

Really this is an example belonging to the chapter 8 theme of item quality. I give it here to illustrate how the superficial looks of an MC question do not tell much of its quality as a test item. The flaws are signalled by Fincher. She could have added that 5 options probably is bad design here.

possible formats for the MC item

A summary of textbook presentations of types of MC formats is to be found in Haladyna et al. (2002). Basically, there are choice items, true-false items, matching items, and a category called 'complex MC.' True-false is an extreme type of the two-choice or alternate-choice item, and because of this special care is needed in the design of true-false items. Not many sentences in daily life, science, or school can be classified as 'absolutely' true or false. The matching question asks the testee to match every item of one list to those of another, preferably shorter, list. The matching question could be split up into multiple two-choice or three-choice items, and is therefore a trick to save on printing space, violating the guideline items in a test should preferably be independent of each other. Complex MC is the case where the lazy item writer confronts testees with some nerve-wrecking choices, never use this item 'format.'


Which of the following are fruits?

  1. Tomatoes
  2. Tomatillos
  3. Habanero peppers

  1. 1 & 2
  2. 2 & 3
  3. 1 & 3
  4. 1, 2, & 3

    Haladyna et al. 2000, p. 321 (p. 312"AVOID the complex MC (Type K) format.")


[Judge the truth of the two sentences.]

  1. Every diamond figure has four equal sides
  2. Every diamond figure has four equal corners

  1. 1 & 2 are both true
  2. 1 only is true
  3. 2 only is true
  4. 1 & 2 are both false

This kind of item is still, and regrettably, enormously popular in the Netherlands. The example is a translation from Timmer, in De Groot & Van Naerssen 1969, p. 149. Sandbergen, p. 116 in the same volume, presents its item form, commenting: "A much used and often useful 'trick.'" Sandbergen unwittingly pointed to the main problem in the complex item: the item writer's 'trick' makes the item tricky for the testee to answer. There was among the authors of this 1969 handbook no unanimity on complex items. Lans and Mellenbergh in their chapter on the formal aspects of construction and assessment of items do not mention the complex MC.

true-false items


" .... unlike a multiple-choice item, the true-false item does not provide any explicit alternative in relation to which the relative truth or falsity of the item can be judged. Each statement must be judged in isolation to be true enough to be called true (i.e., true enough so that the test constructor probably intended it to be true), or false enough to be called false. This lack of a comparative basis for decision contributes to the ambiguity of some true-false items."

Ebel, Robert L. (1965). Measuring educational achievement. Prentice-Hall.

Outside of logic, true-false questions tend to be pseudo-logical in character: the sentence P is either true or false. The logician will prove the falseness of P by assuming P to be true, and then deduct a contradiction. See for example Beth (1955) on his semantic tableau technique. Therefore, the true-false format is home to logic. This fact suggests how the true-false format might be used in other disciplines as well: if the student may be expected to be able to show a contradiction between (part of) the statement and a particular theory or particular givens in the problem. In most other cases the true-false format is suspect, if only for the reason mentioned by Ebel (the box above). The item writer should make it perfectly clear why the true-false format is adequate in each and every instance of its use. If it is possible to change the true-false item into a truly two-choice one, do so.

The Earth’s orbit around the Sun is a circle.                 true / false
Is the Earth’s orbit around the Sun a circle?                 yes / no
Justify your answer.
Is the Earth’s orbit around the Sun a circle?                 yes / no
Explain why the Earth’ orbit around the Sun is not a circle.
What is the Earth’s orbit around the Sun?

  1. a circle
  2. an ellipse.

True/false items require an examinee to select all the options that are 'true.' For these items, the examinee must decide where to make the cut-off - to what extent must a response be 'true' in order to be keyed as 'true.' While this task requires additional judgement (beyond what is required in selecting the one best answer), this additional judgment may be unrelated to clinical expertise or knowledge. Too often, examinees have to guess what the item writer had in mind because the options are not either completely true or completely false.

Case and Swanson (2001) p. 14 pdf

Case and Swanson do not cite empirical evidence to support the position taken here. I do not need empirical evidence to endorse their statement, however. An acceptable multiple true-false item should have options that are either 'totally wrong' or 'totally correct.'


Which of the following is/are X-linked recessive conditions?
  1. Hemophilia A (classic hemophilia)
  2. Cystic fibrosis
  3. Duchenne's muscular dystrophy
  4. Tay-Sachs disease

Case and Swanson (2001) p. 14 (1 and 3 'totally correct') pdf

Case and Swanson refer to a tradition that, "for true/false items, the options are numbered; for one-best-answer items, the options are lettered." Why is this: they are using the convention to obviate the need to explicitly mention the item to be multiple true-false, not multiplechoice. Unless students are thoroughly acquainted with this 'tradition,' do not use it.

In terms of durability, oil-based paint is better than latex-based paint.

    true / false

Haladyna, 1999. p. 100

Haladyna's advice, backed by Frisbie and Becker (1991), is to "make use of an internal comparison rather than an explicit comparison." The explicit comparison meant is "Oil-based painting is better than latex-based paint." In the example, leaving "in terms of durability" out, will make the item ambiguous.

I do not understand the approach taken here by Haladyna. The book gives examples that should be formatted as twocoice items instead of the true-false ones Haladyna has in mind. The reason simply is that formulating the twochoice question such as in the box above, and on top of that asking a true-false answer, is making the item more difficult to read than is necessary. The Haladyna format is kind of complex true-false ..... .


There is no such thing as a formula to correct for the guessing of individual students. At the individual level, guessing introduces noise that can not be filtered out. Aggregate scores of groups of students can statistically be corrected for guessing (formula scoring, versus number right scoring); this correction, however, is of no help at all if the high stakes primarily regard the individual student, not the institution. Better face the facts, and get rid of guessing, especially the forced guessing the Western world has become addicted to since Word War I. Do not think that this is an old discussion nobody is interested in any more. I will prove that forced guessing is unfair to the well prepared student having to sit pass-fail tests. And I will show that in this particular case students are in a position that they can lawfully force the institution to abandon scoring-under-forced-guessing. If you represent an institution, better take action today. Well, I am not the only one Having to prove a point here.

Patrick Sturges, Nick Allum, Patten Smith and Anna Woods (2004?). The Measurement of Factual Knowledge in Surveys. pdf

Jeffery J. Mondak and Damarys Canache (2004). Knowledge Variables in Cross-National Social Inquiry. Social Science Quarterly, 85, 539-558.


Figure. Hundred monkeys make a test of 10 three-choice questions .

The pictured simulation was done using the spa_module 1 applet. Try it for yourself.

guessing under pass-fail scoring


Figure. Forced guessing, i.e. there is no option to leave unknown questions unanswered, makes the pass decision less reliable .

While it is known (Lord & Novick, 1968, p. 304) that guessing lowers the validity of tests, other things being equal, it is generally not known that guessing heightens the risk of failing under pas-fail scoring for students having satisfactory mastery. The figure shows a typical situation. The test has 40 three-choice items, its cut-off score in the no-guessing condition is 25, in the three-choice items condition the cut-off score is 30. The remarkable thing is that the probability to fail the 25 score limit for a student having mastery .7 is 0.115, while the probability to fail the 30 score limit under forced guessing (pseudo-mastery now .8) is .165. [mastery is defined on the domain the items are sampled from]

The model is not strictly necessary to argue the case, of course, but it helps being able to quantify the argument. Suppose the student is allowed to omit questions she does not know, meaning she will not be punished for this behavior but instead will obtain a bonus of 1/3rd point for every question left unanswered. Students having satisfactory mastery will have a reasonable chance to pass the test. Those passing will do so while omitting a certain number of questions. It is perfectly clear that some of these students would fail the test if they yet had to guess on those questions. In the same way, some mastery students initially having failed the test, might pass it while guessing luckily. This second group is, however, much smaller than the first one, and they still have the option to guess. The propensity to guess is higher, the lower the expected score on tests, see Bereby-Meyer, Meyer, and Flascher (2002).

The amazing thing about this argument is that I do not know of a place in the literature where it is mentioned. There has of course been a lot of research on guessing, omissiveness, and on methods to 'correct' for guessing, but none whatsoever on this particular problem. That is remarkable, because students failing a test, might claim they have been put at a disadvantage by the scoring rule that answers left open will be scored as at fault. This is a kind of problem that should have been mentioned in every edition of the Educational Measurement handbook (its last edition 1989 by Robert L. Linn). Lord & Novick (1968, p. 304) mention the problem of examinees differing widely in their willingness to omit items; the interesting thing here is their warning that requiring every examinee to answer every item in the test introduces "a considerable amount of error in the test scores." The analysis above shows that in the particular situation of pass-fail scoring this added error puts mastery students at a disadvantage, a conclusion Lord and Novick failed to note..

Martin R. Fellenz (2004). Using assessment to support higher level learning: the multiple choice item development assignment. Assessment & Evaluation in Higher Education, 29, 703-719. Available for download at html

matching items
The matching item is a bit of a misnomer. It might be characterized as asking for the correct pair-wise combination of items from two lists. Books and authors, albums and groups, medical vignettes and diagnostic options, etcetera. A technical point in the design is to avoid the situation of exactly exhausting pairs, because the last pair then will automatically be correct if the other ones are. The instruction should make it clear which list is the question-list; in the medical example, see Fincher p. 206-207 pdf, it is the list of vignettes because clinical thinking departs from the vignette. In the books-and-authors case it could be either.

D. V. Budescu (1988). On the feasibility of multiple matching tests - variations on a theme by Gulliksen. Applied Psychological Measurement, 12, 5-14. [I have to look it up yet.]

S. H. Shana (1984). Matching-tests: reduced anxiety and increased test effectiveness. Educational and Psychological Measurement, 4, 869-.. .

Ruth-Marie E. Fincher (2006?). Writing multiple-choice questions. A section in Louis N. Pangaro and William C. McGaghie (Lead Authors) Evaluation and Grading of Students. pdf, being chapter 6 in Ruth-Marie E. Fincher (Ed.) (3rd edition) Guidebook for Clerkship Directors. downloadable [Alliance for Clinical Education, USA]

Robert Woods (1977). Multiple choice: A state of the art report Evaluation in Education. International Progress, 1, 191-280. for a fee pdf available [impossible Elsevier URL, I am sorry]


2.3 Essay type questions

A sobering empirical result to begin with: it really does matter how the student is asked to write her answer up. The reminder is: if this small a difference between writing or typing your answer does matter so much, there must be many more important differences between possible techniques of ssessment that making important differences in test results.

The first written examinations in Oxbridge in a sense followed the catechetical method, because no questions were put that allowed different interpretations: 'the way to achieve more accurate and certain means of evaluating a student's work was to narrow the range of likely disagreement and carefully define the area of knowledge students were expected to know' (Rothblatt, 1974, p. 292).

Taken from Ben Wilbrink (1977). Assessment in historical perspective. Studies in Educational Evaluation, 23, 31-48.
Sheldon Rothblatt (1974). The student sub-culture and the examination system in early 19th century Oxbridge. In Stone, L. (Ed.). The university in society. Vol I Oxford and Cambridge from the 14th to the early 19th century, p. 247-303. Princeton: Princeton University Press.

Because open questions are so terribly open, understandably but regrettably the tendency is to pinpoint these questions in such a way as to minimize any opportunities for discussion about the correctness of answers given. Teachers are loosing opportunities here to gain insight in the hearts and souls of their students, and especially for insightful feedback on the answers given.

opstelvragen uitwisselbaar met aanvul- of keuzevragen?

opstelvragen eerlijk nakijken

Het is niet ongebruikelijk dat bestuurlijke gremia dwingend voorschrijven dat er tevoren modelantwoorden voor het nakijken moeten worden opgesteld. Dat is schadelijke bureaucratie. Natuurlijk is het goed tevoren uit te werken welke varianten in antwoorden mogelijk zijn, dat kan de professionele docent prima zelf doen, bij voorkeur met enige intervisie. Het dwingende voorschrift lokt evenwel uit dat met zo'n modelantwoord alle eerlijkheid is gegarandeerd, wat een aanfluiting is. Helaas lokt het ook het honoreren van deelkennis uit, anders zullen studenten dat wel op basis van het modelantwoord gaan eisen.

Tabel 1. Beoordeling van tandheelkundige werkstukken door drie instructeurs

    werkstuk:   1   2   3   4   5   6   7   8   9  10  
instructeur a   8  11  14   7  10  11   7  14   9  10  
instructeur b   8  14   9   9  11  14  12   9   9  12  
instructeur c   6   9   6  13  10  14  13   8  11   9  
hoogste oordeel 8  14  14  13  11  14  13  14  11  12  
laagste oordeel 6   9   6   7  10  11   7   8   9   9  

Bron: Dick Tromp (1979). Het oordeel van studenten in een individueel studie systeem. Onderwijs Research Dagen, 1979. De gegevens van Tromp zijn uitgebreider dan de tabel kan laten zien.

Krediet geven voor goede deelantwoorden op een inzichtvraag ondergraaft het eigen karakter van inzichtvragen ten opzichte van kennisvragen. De toetsing degradeert dan tot kennistoetsing, en de bijzondere prikkel om door te studeren tot een hoog niveau van kennisbeheersing vervalt daarmee. Zie Biggs (1996) voor voorbeelden van docenten die door goede deelantwoorden te belonen handelen in strijd met hun intentie om inzicht te toetsen. Zij sporen studenten daarmee immers aan tot oppervlakkige verwerking van de stof.
Wilbrink, 1958, paragraaf Sturende werking

essay grading

The famous study here is Hartog and Rhodes (1936). A sample of more recent studies: DeCarlo (2005 pdf), Congdon and MeQueen (2000), Engelhard (1994). Huot (1990).

From Smith's prize competition at Cambridge

Another of Milner's strategies was to ask candidates for a particular proof and then long before the best candidate could possibly have finished writing ask all the candidates to stop. His rationale was simple. He believed he could judge from the half-finished answers what the completed ones would have been and in this way gain extra time for asking further questions.

Barrow-Green, 1999, p. 284. Milner was an examiner from 1798 to 1820. Among mathematician the Smith competition was more highly esteemed than the mathematical tripos.

Applaud this man for a clear insight into what it is for an exam to be time efficient.

more literature on essay type questions

Lawrence M. Rudner, Veronica Garcia and Catherine Welch (2005). An Evaluation of IntelliMetric™ Essay Scoring System Using Responses to GMAT® AWA Prompts. Graduate Management Admission Council GMAC Research Reports • RR-05-08 • October 26, 2005. pdf

Robert Linn, Eva L. Baker and Stephen B. Dunbar (1991). Complex, performance-based assessment: Expectatons and validation criteria. CSE Technical Report 331 pdf, Educational Researcher, 20(8), 15-21.

Richard H. Haswell (2005). Machine Scoring of Essays: A Bibliography. html.

Assessment Plus (2006) site

J. Elander, K. Harrington, L. Norton, H. Robinson and P. Reddy (2006). Complex skills and academic writing: a review of evidence about the types of learning required to meet core assessment criteria. Assessment and Evaluation in Higher Education 31(1), 71-90. http://www.aston.ac.uk/downloads/lhs/peelea/Elander2006.pdf [dead link? 1-2009]

Michael Russell and Tom Plati (2001). Effects of Computer Versus Paper Administration of a State-Mandated Writing Assessment. Teachers College Record On-line. For sale at http://www.tcrecord.org/Content.asp?ContentID=10709.

David M. Williamson, Isaac I. Bejar and Anne Sax (2004). Automated Tools for Subject Matter Expert Evaluation of Automated Scoring. ETS Research Report 04-14. pdf

2.4 Transparency

Abstract Because test information is important in attempting to hold schools accountable, the influence of tests on what is taught is potentially great. There is evidence that tests do influence teacher and student performance and that multiple-choice tests tend not to measure the more complex cognitive abilities. The more economical multiple-choice tests have nearly driven out other testing procedures that might be used in school evaluation. It is suggested that the greater costs of tests in other formats might be justified by their value for instruction - to encourage the teaching of higher level cognitive skills and to provide practice with feedback.

N. Frederiksen (1984). The real test bias: Influences of testing on teaching and learning. American Psychologist, 39, 193-202.

more literature on questions of transparency

2.5 Item templates

Isaac I. Bejar, René R. Lawless, Mary E. Morley, Michael E. Wagner and Randy E. Bennett (2002). A feasibility study of on-the-fly item generation in adaptive testing. ETS Research Report 02-23 / GRE Board Report No. 98-12P. pdf

2.6 Questions of validity——Valid questions

subdomains of validity

Figure 1. The scheme subdomains of validity summarizes the ten domains of concern for the evaluation and/or construction of validity in achievement test items. The subdomains, of course, are highly interconnected, therefore the connectivity between them has not been indicated in the scheme itself. The scheme summarizes the treatment of the validity concept in the (Dutch) paragraph 2.6 as recently (May 2008) developed.

Short characteristics of the subdomains:

Absolutely essential in test item design is that every item individually is a valid item, validity taken in the realistic sense as presented in Borsboom, Mellenbergh and Van Heerden (2004) as contrasted with the construct validity approach that at the level of the test—not the individual test item—seeks to relate scores on this test to scores on other tests irrespective of what is the state of affairs in the real world.

To counter the danger that a strict format of every item being valid might give rise to unintended side effects—such as are known from Verschaffel, Greer and De Corte's (2000) research on word problems—there should be room for meta-validity: items that as formulated are not valid, but might be made so in a reformulation or reconception produced by the testee. After all, problems in the real world will not always come in valid formats. The one famous example here is the 'what is the age of the captain' kind of word problem. As used in research on word problems, it is a meta-valid question. Arithmetics didactics is not yet ready for the use of this kind of meta-valid question in primary education—so much the worse for the didactics.

Strictly taken, meta-validity is validity also. However, it will pay the test item designer to be aware of the difference between the two kinds of validity.

The amount of writing on validity of psychological tests is terrifying. The authority on the subject is publication on standards of psychological tests by the American Psychological Association (in cooperation with other organizations), the latest edition being APA (1996). Ultimately—as for example in court—these standards are decisive. Nevertheless these Standards get updated from time to time, therefore there is room for improvement on the received view as documented in these Standards. The chapter on validity in the 1996 edition is strongly influenced by the work of Melvin Novick on constructive validity, emphasizing nomological networks, etcetera. In this approach the validity of tests is ascertained by researching how it correlates with other tests. This characterization of construct validity may not be fair, but it makes the point that construct validity does not touch on whether the events to be measured really exist or not.

In test item design, the question of validity touches on whether what the test item probes or measures—a particular piece of knowledge, insight—really does exist, let us say, in the brains—connectionist neural networks if you want—of the individual students. The good news—for test item design—is that issues of item validity do not directly touch on whatever it is that is called construct validity of tests.

The work of Denny Borsboom, Don Mellenbergh and Jaap van Heerden (2004) is of tremendous help in clarifying issues of validity at the level of the individual test items. The important point in the work of Borsboom et al. is that today's conceptions of construct validity do not touch on the one important question of validity: does this instrument measure what it is intended to measure, does this ruler measure the attribute of length, does this item measure the understanding it purports to measure? That 'understanding' is not a correlate of intelligence, mental fitness, or whatever—it is the real thing, the neural process if you want.

In the Dutch text of paragraph 2.6 an example item is taken from the Dutch Science Quiz. In its formulation the concept of mass is confused with that of weight, while at the same time the question exactly is about the difference between these concepts. Therefore, it is not possible to answer the item correctly without radically reinterpreting the question as asked. Therefore, the question is invalid. Making it valid is simple: remove the ambiguity. Why, then, did the item designer not do it himself in the first place? This would be a trivial question to ask, if not for the fact that this kind of invalidity in achievement tets items is ubiquitous. Something is lacking in the validity of the design process, it seems.

"Procedures currently in use for constructing and describing achievement tests are a mess. Conclusions about methods, variables, or procedures can hardly be taken seriously when you don't know what the test measures. Drastic action is indicated. Journal editors are admonished forthwith to reject papers unless they contain 1) a documented rationale for selecting the questions to be asked and 2) a fully-explicated analysis of the relationship between the questions and the preceeding instruction."

Richard Anderson (1972, p. 16)

Richard Anderson's 1972 warning on the problematic validity of testing in education is still as valid now as it was then. Circumstances have changed somewhat, of course, but the overall picture still is rather gloomy. Read for example Popham's (2005) warning in connection with the no Child Left Behind Act, and the pressure it will exert to use educational tests in ways they surely are not intended for, and should not be intended for.

Validity questions in educational assessment are not fundamentally different from those in other disciplines, as for example physics. I do not deny that issues of validity loom larger in psychology than in, for example, physics. Psychology is the younger science. In physics issues of validity are inherent in doing empirical research, see for example Olesko (1991) on one of the first experimental physics labs in the nineteenth century in Germany universities. Yes, indeed, there are at least family resemblances between validity of educational assessment, and what in the disciplines taught in educational institutions count as valid experiments, etcetera.

perfectly valid: basic arithmetic facts

2   +   4   =   ?
3   ×   2   =   ?
7   ×   3   =   ?
5   +   3   =   ?

The box shows adding and multiplying of numbers 1 to 9. There is strong theory available, e.g., Lebiere and Anderson (1998). Note that these basic arithmetics facts also get exercised and tested in exercises of adding and multiplying numbers bigger than 9. Note also that question format should be diversified to prevent this particular form inadvertently being 'connected' to mastery of these basic number facts, somewhat like the 'the age of the captain' phenomenon in word problems (Verschaffel c.s., 2000).

invalid: creatively written items

Writing items on the basis of creative ideas, as if it were an art, as it is characterized in the literature more often than not, results in invalid items by definition. The creative idea is hosted by the wrong brains, those of the item writer instead those of the testee. Whatever knowledge the creative test item asks for, is only accidentally related, if at all, to whatever knowledge the testee might have obtained from the course as intended.

non-construct validity

The idea of construct-validity is that validity somehow should be a relation between whatever it is that is assessed, and some theory or other. However, there are cases of assessment or experiment missing such a connection to any theory whatsoever. For example, look into some of the experiments done by Galileo Galilei, as analysed and reported by Stillman Drake (1990).

valid measurement, no theory

While other philosophers were speculating about the cause of objects falling, Galileo took refuge to what he could experimentally ascertain about the phenomenon of objects falling. He constructed a polished plane inclined by 1.7 degrees, and let a bronze ball roll down a groove. He marked the positions of the ball after equal intervals of time, and assessed the length of the trajectories between these points. Galileo kept the paper with the exact measurements, and this paper is stille available (Florence, Galilean manuscripts vol 72 folio 107v), but it was not printed in the collected works edited by Antonio Favaro (Drake, 1990, p. xvii).

    1     1           32
    4     2         130
    9     3         298
  16     4         526
  25     5         824
  36     6       1192
  49     7       1620
  64     8       2101

The numbers 1 until 8 are the eight equal periods of time, the numbers in the third column are the measured lengths of the distances traveled at those particular points of time (Galileo used a personal ruler equally divided into 60 units of what he called puntos, appr. 0.9 cm). The squares in the first column were added days later by Galileo. The regularity in the empirical data is that they are (almost) equal to the corresponding square times the first length measured (Galileo marked the aberrations above and below the calculated values).

There is no theory involved in the experiment, other than Galileo's choice not to involve any theory here. [NB. This might not be true. The squaring law was a known theory already, see Dijksterhuis 1950, IV 88, 89: "dat de in val uit rust afgelegde weg evenredig is met het vierkant van de sedert het begin van de beweging verstreken tijd." According to Dijksterhuis 1924 this experiment served to test the already known theory. Stillman Drake does not discuss this point, thereby giving the reader some room to suppose the squaring law to be discovered by Galileo. Such cannot possibly be the case, his experiments showed that theory to be true to the data. The closing sentences in Dijksterhuis (1950, IV 90) are remarkable: he calls it a persistent myth that Galileo would have found the validity of the squaring law by doing the inclines plane experiment. Dijksterhuis must not have known the existence of note paper f 107v, because it had not been published in the collected works.] For theories about falling objects abounded in Antiquity as well as in the Middle Ages, and well into the seventeenth century, emphasizing in one way or another what causes the falling movement. Galileo does not speculate on causes, at least not here, and lets his data do the talking. These data belong to a small set of most famous experimental data in the history of science, yet they have been published only recently by Stillman Drake (Dijksterhuis, for one, was not aware of the precise circumstances of Galileo's experiments using an anclined plane). Most students of physics in the past centuries have learned their theoretical mechanics without ever having heard of this experiment, or of the man Galileo.
So far for these valid measurements that resulted in a law of physics. No, they were not made to test that law, as Dijksterhuis (1924) had written.

Stillman Drake (1990) ch. 1, esp. p. 10 Figure 1 facsimile of Galileo's notes on f. 107v.

Assessing mastery of multiplication of numbers below ten is almost as straightforward as Galileo's assessment of speeds of falling. 'Almost:' there is a stochastic aspect involved here, because the same multiplication might be wrong now, and correct the next time, or the other way around. This stochastic element is however not a matter of 'error' in assessment, because it is a consequence of the way the brain is functioning. No error, therefore no unrelialibity. And of course not a threat to validity. There is more to say on this topic, especially so because the literature keeps rather silent on the topic ..... . It is quite amazing how it is possible that valid stochastic proceses nevertheless in psychometric theory get the 'unreliability' treatment. These are just binomial processes, sampling processes if you prefer to call them so, well known since at least the end of the seventeenth century (Christiaan Huygens, for example).

invalid observation, strong theory

Aristotle's theory that bodies twice as heavy as others, fall twice as fast (or something like that). The natural philosophers must have flattered themselves to really have seen such things happen, while in reality it is rather easy to observe that falling speeds are the same or nearly the same.

Dijksterhuis (1950).

Understanding free fall is not possible other than by description of the phenomena. Calling 'gravitation' the cause of falling is a game with words (see also Galileo Dialogo 2 [Dijksterhuis IV 85]

questions of abstractness

In the 1983 edition questions of validity were narrowed down to those of abstractness.

Another way to emphasize the point of this paragraph is to posit that transfer is the name of the game (Barnett and Ceci, 2002; Klahr and Li, 2005 pdf). Sure, there is this traditional notion that it is exactly the abstract character of Latin, chess, and all the rest of it, that exercises the brain. Forget the crap. There is in the beginning of the 21st century a lot of research going on that might be labeled as research on transfer of learning (Mestre, 2005). There is nothing mysterious about the idea of transfer. Everybody knows, or should know, that the kids having learned to swim, does not guarantee that they will swim to save their life. The same with the kids having learned Newton's laws: this does not guarantee they will act on that knowledge if need be. What is it, in instruction as well as in assessment, that will enable them to successfully apply Newton's laws in everyday - or in professional - circumstances? That is the transfer question. It is what education - and therefore assessment - is about. It will not do to let them learn the laws by heart, or train them them to use the laws in countless abstract - mathematical - ways. The ultimate question is - and the achievement test should touch on that - will they later be able to apply this knowledge in countless concrete situations? Construct validity, again.

"Physical and natural scientists could rely on experiments whose equivalent in human affairs would violate ethical limits. They could chop up bacteria, induce mutations in fruit flies, and blast molecules into smithereens, then observe the effects of their interventions. Anyone who tried the equivalent with human beings would soon be dead or behind bars." Tilly (2006), p. 129

The social sciences, unlike the physical sciences, are handicapped by ethical principles - broadly conceived - in doing their research, in obtaining unambiguous results from their research, and in communicating these results. The social sciences therefore seem to be somewhat less concrete than the physical sciences. Historically, social scientists have tried to escape from the dilemma by emphasizing quantification en operationalisation, and otherwise using techniques and instruments from the physical sciences as well. Somehow or other, this difference between the social and the physical sciences has implications for the design of achievement test items, or at least for the possibilities open to the designer.

To this day every student of elementary physics has to struggle with the same errors and misconceptions which then had to be overcome, and on a reduced scale, in the teaching of this branch of knowledge in schools, history repeats itself every year. The reason is obvious: Aristotle merely formulated the most commonplace experiences in the matter of motion as universal scientific propositions, whereas classical mechanics, with its principle of inertia and its proportionality of force and acceleration, makes assertions which not only are never confirmed by everyday experience, but whose direct experimental verification is impossible .... (p. 30).

Champgane, Gunstone and Klopfer (1985, p. 62), citing from E. J. Dijksterhuis (1950/1961). The mechanization of the world picture. London: Oxford University Press.

This citation will be used in other chapters also. Its message is highly disturbing. For now, contrast this with the citation from Tilly, above. The implication of the observation of Dijksterhuis - and very, very many others involved in teaching physics - is that many so-called demonstrations of physical laws are nothing of the kind. They may show strange effects unexpected to the naive viewer, but in no way count as proof of the natural law whose working they supposedly demonstrate. There is a - social scientific - experimental literature on the effects of physical demonstrations in education: probably they are nil. Summing up: the physical sciences might be handicapped also by their seemingly concrete results not revealing the physical laws supposedly causing them. Misplaced concreteness?

A lot less abstract is the research paper by

Brereton, Shepard and Leifer (1995). How students connect engineering fundamentals to hardware design: Observations and implications for the design of curriculum and assessment methods. pdf.

Wendy M. Williams, Paul B. Papierno, Matthew C. Makel, Stephen J. Ceci (2004). Thinking like a scientist about real-world problems: The Cornell Institute for Research on Children Science Education Program. Applied Developmental Psychology 25, 107–126. http://pubcms.cit.cornell.edu/che/HD/CIRC/Publications/upload/williamsetal.pdf [dead link? 1-2009]

2.7 An historical perspective

Most of this paragraph is perspective on US education history. The material presented does not leave much to be guessed. Even so, it might be useful to read Lagemann (2000) or Chapman (1988) for the broader perspective on the times, places and persons. It is really amazing to see how a handful of individuals were able to direct the course of history here. The same thing would have been unthinkable in the physical sciences - atoms don't talk back - but education is human activity that inevitably is influenced by the spirit of the time. That spirit, round about 1900, was quantification and measurement, Edward Thorndike's statistical imperialism crowding out the infinitely more sensitive philosophical approach of John Dewey.

Unless noted otherwise, the publications mentioned below are in my possession. If you have any specific question on them, mail me. I have ordered them chronologically.

C. Stray (2001). The Shift from Oral to Written Examination: Cambridge and Oxford 1700-1900. Assessment in Education: Principles, Policy and Practice, 8, 33-50.

F. V. Edgeworth (1890). The element of chance in competitive examinations. Journal of the Royal Statistical Society, 53, 460-475, 644-663.

Edward L. Thorndike (1904). Theory of mental and social measurements. New York: The Science Press.

OBJECTIONABLE (1899, 1912)

How many thumbs have 6 boys?

How many days in the year, leaving out December?

Cited in Cronbach and Patrick Suppes, 1969, p. 107, from Charles E. White (1899). Number lessons, a book for second and third year pupils. Boston: Heath, p. 22; and James C. Byrnes, Julia Richman and John Roberts (1912). The pupil's arithmetic, primary book, part one. New York: Macmillan, p. 39.

  1. IRRELEVANT (1912)

    Hudson discovered the Hudson River in ________; Fulton sailed the first steamboat on that river in _______. How many years between those events?
  2. Make up a similar problem about Washington and Columbus.
  3. Do the same, using Washington and Lincoln.
  4. Do the same, using Columbus and Hudson.
Cited in Cronbach and Patrick Suppes, 1969, p. 107, from James C. Byrnes, Julia Richman and John Roberts (1912). The pupil's arithmetic, primary book, part one. New York: Macmillan, p. 36.


If 12 is ¼ of a number, what is the number?

I am thinking of a certain number. Half of the number of which I am thinking is 5. Tell me the number.

Cited in Cronbach and Patrick Suppes, 1969, p. 107, from Joseph C. Brown and Albert C. Eldredge (1925). The Brown-Eldredge arithmetics, grade 5. Chicago: Row, Peterson and Co, p. 94

The nonsense of 'How many thumbs have 6 boys' illustrates the kind of questioning at the turn of the 19th into the 20th century. The contrast between the second and third box, the first an example from 1912, the second from 1925, is the publication of Edward L. Thorndike's (1922). The psychology of arithmetic. New York: Academic Press. This publication caused a landslide in arithmetics teaching and testing, it ended the counting of thumbs, as well as the fancy questioning still figuring in 1912. The principle to be crisp and clear, and avoid window dressing, might not have done the trick at that time, Thorndike used some powerful psychology to move the minds of textbook writers..


In 1912 it was not unusual to use words like committee, insurance, charity, premises, installment, treasury in test items. In 1925 , after Thorndike's book appeared, "there are fewer hard words and there is a noticeable attempt to use the same words repeatedly in word problems. Thus, a majority of verbal problems for the first year concern familiar objects: fish, rabbits, puppies, boys, girls, and various kinds of fruit."

Cronbach and Patrick Suppes, 1969, p. 108

E. L. Thorndike (1922). The psychology of arithmetic. New York: Academic Press.

Lee J. Cronbach and Patrick Suppes (Eds) (1969). Research for tomorrow's schools: Disciplined inquiry for education. London: Collier-Macmillan Limited.

Daniel Starch (1916). Educational measurements. Macmillan.

Cyril Burt (1921/1933). Mental and scholastic tests. London: P. S. King. fourth impression of the original text.

J. R. Gerberich (1956). Specimen objective test items. A guide to achievement test construction. Longmans.

more historical literature

G. M. Wilson and Kremer J. Hoke (1920). How to measure. New York: The Macmillan Company. two fold-out grading scales (spelling, drawing).

G. M. Ruch (1924). The improvement of the written examination. New York: Scott, Foresman and Company.

G. M. Ruch and George D. Stoddard (1927). Tests and measurements in high school instruction. New-York: World Book Company.

C. W. Odell (1928). Traditional examinations and new-type tests. New York: The Century Co. (Selected and annotated bibliography p. 439-469)

G. M. Ruch (1929). The objective or new-type examination: an introduction to educational measurement. Chicago : Scott, Foresman.

G. M. Ruch and G. A. Rice (1930). Specimen objective examinations. A collection of examinations awarded prizes in a national contest in the construction of objective or new-type examinations, 1927-1928. Chicago : Scott, Foresman.

Robert L. Ebel (1965). Measuring educational achievement. Englewood Cliffs, New Jersey: Prentice-Hall.

Paul Black (2001). Dreams, Strategies and Systems: portraits of assessment past, present and future. Assessment in Education: Principles, Policy & Practice, 8, 65-85.

Lorrie Shepard (2000). The role of assessment in a learning culture. Educational Researcher, 29, no. 7, 1-14. http://edtech.connect.msu.edu/aera/pubs/er/arts/29-07/shep01.htm http://edtech.connect.msu.edu/aera/pubs/er/pdf/vol29_07/AERA290702.pdf [dead links? 1-2009]

2.8 literature

Richard C. Anderson (1972). How to construct achievement tests to assess comprehension. Review of Educational Research, 42, 145-170.

APA (1966/1974/1985/1999). Standards for educational and psychological tests, Washington, D.C: American Psychological Association.

S. M. Barnett and S. J. Ceci (2002). When and where do we apply what we learn? A taxonomy for far transfer. Psychological Bulletin, 128, 612-637.

June Barrow-Green (1999). 'A corrective to the spirit of too exclusively pure mathematics': Robert Smith (1689-1768) and his prizes at the Cambridge University. Annals of Science, 56, 271-316.

Evert W. Beth (1955). Semantic entailment and formal derivability. Mededelingen van de Koninklijke Nederlandse Akademie van Wetenschappen, Afdeling Letterkunde, N. R. Vol. 18, no. 13 (Amsterdam), pp. 309-342, reprinted 1961. Reprinted in Jaakko Hintikka (1969). The philosophy of mathematics (pp. 9-41). Oxford University Press.

John H. Bishop (2004). Drinking from the Fountain of Knowledge: Student Incentive to Study and Learn-Externalities, Information Problems and Peer Pressure. Cornell, Center for Advanced Human Resource Studies Working paper 04-15 pdf

Denny Borsboom (2005). Measuring the mind: Conceptual issues in contemporary psychometrics. Cambridge: Cambridge University Press.

Denny Borsboom, Gideon J. Mellenbergh and Jaap van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061-1071. pdf

Case and Swanson (2001). Constructing Written Test Questions For the Basic and Clinical Sciences. National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104 http://www.nbme.org/PDF/2001iwg.pdf [dead link? 1-2009].

Audrey B. Champagne, Richard F. Gunstone and Leopold E. Klopfer (1985). Instructional consequences of students' knowledge about physical phenomena. In Leo H. T. West and A. Leon Lines: Cognitive structure and conceptual change (pp. 61-90). Academic Press.

Paul Davis Chapman (1988). Schools as sorters. Lewis M. Terman, Applied Psychology, and the Intelligence Testing Movement, 1890-1930. New York: New York University Press.

Jannette Collins (2006). Education techniques for lifelong learning: writing multiple-choice questions for continuing medical education activities and self-assessment modules. Radiographics, Mar-Apr 26(2), 543-51. pdf

Peter J. Congdon and Joy MeQueen (2000). The Stability of Rater Severity in Large-Scale Assessment Programs. Journal of Educational Measurement, 37, p. 163

Lawrence T. DeCarlo (2005). A Model of Rater Behavior in Essay Grading Based on Signal Detection Theory. Journal of Educational Measurement, 42, 53- . pdf

E. J. Dijksterhuis (1924). Val en worp. Een bijdrage tot de geschiedenis der mechanica van Aristoteles tot Newton. Groningen: Noordhoff.

E. J. Dijksterhuis (1950/1961). The mechanization of the world picture. London: Oxford University Press.

Stillman Drake (1990) Galileo: Pioneer scientist. University of Toronto Press.

George Engelhard, Jr (1994). Examining Rater Errors in the Assessment of Written Composition With a Many-Faceted Rasch Model. Journal of Educational Measurement, 31, p. 93 - . [and 33, p. 115]

Martin R. Fellenz (2004). Using assessment to support higher level learning: the multiple choice item development assignment. Assessment & Evaluation in Higher Education, 29, 703-719. Available for download at html

Frederiksen, N. (1984). The real test bias. Influences of testing on teaching and learning. American Psychologist, 39, 193-202.

D. A. Frisbie and D. F. Becker (1991). An analysis of textbook advice about true-false tests. Applied Measurement in Education, 4, 67-83. The publisher wants profit from his pdf-files.

A. D. de Groot (1970). Some badly needed non-statistical concepts in applied psychometrics. Nederlands Tijdschrift voor de Psychologie, 25, 360-376.

A. D. de Groot en R. F. van Naerssen (Red.) (1969). Studietoetsen, construeren, afnemen, analyseren. Den Haag, Mouton.

Thomas M. Haladyna (1999 2nd). Developing and validating multiple-choice test items. Erlbaum. (2004 3rd)

Thomas Haladyna, Steven M. Downing, and Michael C. Rodriguez (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15, 309-334. pdf

Ph. Hartog and E. C. Rhodes (1936 2nd). An examination of examinations. International Institute Examinations Enquiry. London: MacMillan.

Jaakko Hintikka (2007). Socratic epistemology. Explorations of knowledge-seeking by questioning. Cambridge University Press.

Banesh Hoffmann (1962/1978). The tyranny of testing. Crowell-Collier. Reprint 1978. Westport, Connecticut: Greenwood Press.

B. Huot (1990). The literature of direct writing assessment: major concerns and prevailing trends. Review of Educational Research, 60, 237-263.

David Klahr and Junlei Li (2005). Cognitive Research and Elementary Science Instruction: From the Laboratory, to the Classroom, and Back. Journal of Science Education and Technology, 14,

Ellen Condliffe Lagemann (2000). An elusive science: The troubling history of education research. University of Chicago Press.

Christian Lebiere and John R. Anderson (1998). Cognitive arithmetic. In John R. Anderson, Christian Lebiere, and others: The atomic components of thought (297-342). London: Lawrence Erlbaum. questia

Frederick M. Lord and Melvin R. Novick (1968). Statistical theories of mental test scores. Addison-Wesley.

Jose P. Mestre (Ed.) (2005). Transfer of learning: from a modern multidisciplinary perspective. San Francisco: Sage. comment and summary

Kathryn M. Olesko (1991). Physics as a calling. Discipline and practice in the Königsberg Seminar for Physics. Ithaca: Cornell University Press.

W. James Popham (2005). America's 'failing' schools. How parents and teachers can cope with No Child Left Behind. Routledge.

N. Sanjay Rebello, Dean A. Zollman, Alicia R. Allbaugh, Paula V. Engelhardt, Kara E. Gray, Zdeslav Hrepic and Salomon F. Itza-Ortiz (2005). Dynamic Transfer: A Perspective from Physics Education Research. pdf) To appear in Jose P. Mestre: Transfer of learning: from a modern multidisciplinary perspective (p. 217-250). San Francisco: Sage.

Michael C. Rodriguez (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice, summer, 3-13.

Gale H. Roid and Thomas M. Haladyna (1982). A technology for test-item writing. London: Academic Press.

Sandra J. Thompson, Christopher J. Johnstone and Martha L. Thurlow (2002). Universal Design Applied to Large Scale Assessments. The National Center on Educational Outcomes html

Charles Tilly (2006). Why? What happens when people give reasons ... and why. Princeton University Press.

Lieven Verschaffel, Brian Greer and Erik de Corte (2000). Making sense of word problems. Lisse: Swets & Zeitlinger.


more literature chapter 2

These items are as yet on my 'to do' list: they are mentioned here, but not used in the above text yet.

Mark Raymond, Bob Neiers, Jerry B. Reid (2003). Test-item development for radiologic technology. The american Registry for Radiologic Technology.

M. Birenbaum and K. K. Tatsuoka (1987). Open-ended versus multiple-choice formats - It does make a difference for diagnostic purposes. Applied Psychological Measurement, 11, 329-341.

Charles L. Briggs (1986) Learning how to ask. A sociolingual appraisal of the role of the interview in social science research. Cambridge: Cambridge University Press.

James G. Holland, Carol Solomon, Judith Doran and Daniel A. Frezza (1976). The analysis of behavior in planning instruction. Reading, Massachusetts: Addison-Wesley.

Lynn Arthur Steen (Ed.) (2006). Supporting Assessment in Undergraduate Mathematics. The Mathematical Association of America. pdf

Sidney H. Irvine and Patrick C. Kyllonen (Eds) (2002). Item generation for test development.  Erlbaum.

Andrew Boyle (nd). Sophisticated Tasks in E-Assessment: What are they? And what are their benefits? London: Research and Statistics team, Qualifications and Curriculum Authority (QCA) pdf

Albert Burgos (2004). Guessing and gambling. Economics Bulletin, 4, No. 4 pp. 1-10. http://www.economicsbulletin.com/2004/volume4/EB-04D80001A.pdf

Steven M. Downing and Thomas M. Haladyna (Eds) (2006). Handbook of test development. Erlbaum. https://www.erlbaum.com/shop/tek9.asp?pg=products&specific=0-8058-5264-6 [dead link? 1-2009]

Lucy Cheser Jacobs and Clinton I. Chase (1992). Developing and using tests effectively. A guide for faculty. San Francisco: Jossey-Bass.

William L. Kuechler and Mark G. Simkin (2003). How Well Do Multiple Choice Tests Evaluate Student Understanding in Computer Programming Classes? Journal of Information Systems Education html

Frederick M. Lord (1964). The effect of random guessing on test validity. Educational and Psychological Measurement, 24, 745-747. [Deze jaargang in Leiden niet aanwezig. Ik zoek nog een kopie]

Robert Lukhele, David Thissen and Howard Wainer (1994). On the Relative Value of Multiple-Choice, Constructed Response, and Examinee-Selected Items on Two Achievement Tests. Journal of Educational Measurement, 31, 234.

Geoff Norman (2002). The long case versus objective structured clinical examinations. BMJ, 324, 748-749 Editorial

Lambert W. T. Schuwirth and Cees P. M. van der Vleuten (2003). Written assessment. BMJ 2003;326:643-645 ( 22 March ) html pdf

David M. Williamson. Issac I. Bejar, Anne Sax (2004).

2.9 links

Cathleen A. Kennedy (2005). The BEAR Assessment System: A Brief Summary for the Classroom Context. Berkeley Evaluation & Assessment Research Center pdf

The National Assessment of Educational Progress (NAEP) - "the Nation's Report Card" - Search NAEP Questions site

College Board Advanced PlacementFree-response questions site

The College Entrance Examination Board: SAT Preparation Center site

Test Prep Review ACT practice site. Provides links to a host of other American test practice pages as well.

SketchUp, een vrij 3D tekenprogramma van Google

answers.com question

TIMMS 2007 Trends in International Mathematics and Science Study pdf 3Mb, example mathematics items pdf, example science items pdf

PIRLS 2006 Progress in International Reading Study Assessment Framework and Specifications, 2nd Editionpdf 1.8Mb, sample passages, questions, and scoring guides pdf

De Wetenschapsquiz 2005. Bespreking van het ontwerp van de vragen in deze quiz hier.

De Grote Geschiedenis Quiz 2006. Bespreking van het ontwerp van de vragen in deze quiz hier.

CAA Centre Computer-asisted assessment in higher education site, handleiding ontwerpen keuzetoetsen pdf

Tandi Clausen-May (). An approach to test development. nfer

Jon Mueller Authentic Assessment Toolbox site

MERLOT Multimedia Educational Resource for Learning and Online Teaching site

January 10, 2011 \ contact ben at at at benwilbrink.nl    

Valid HTML 4.01!   http://www.benwilbrink.nl/projecten/06examples2.htm