Original publication 'Toetsvragen schrijven' 1983 Utrecht: Het Spectrum, Aula 809, Onderwijskundige Reeks voor het Hoger Onderwijs ISBN 90-274-6674-0. The 2006 text of chapter 2 is a text currently under revision.

Item writing

Techniques for the design of items for teacher-made tests

2. Item types, transparency, item forms and abstraction level

Examples

Ben Wilbrink

this database of examples has yet to be fully constructed. Suggestions? Mail me.

inhoud en vorm

Figuur 1. Natural content of the question contrasted with the question format. Forcing the natural conctructed answer question into the choice format is a critical operation.

This chapter uses mainstream ideas on item writing. Some minor points of departure from American usage are to be noted, however, and a few major ones. A minor admonition is to be pure in the language to use, so 'wrong answers' in the MC-question are just that, wrong answers, not 'distractors.' The choice of options in MC questions should reflect choices or discriminations belonging to the instructional goals. A major addition is the 1970 proposal by De Groot html to add transparency as a citerion of quality in achievement tests. A major change is to replace psychological categories of understanding etc. with epistemological ones about how knowledge in a particular disciplinary domain is structured etcetera. And it will be shown that forced guessing under pass-fail scoring is unfair to testees (publication in preparation).

What item type to use for a particular question is a rather critical decision. The particular choice made can either detract or contribute to the validity of your questioning mastery of course content. Ultimately, item type and content tested for are intricately related. Somehow, however, a beginning must be made in the exposition of item design methods. The somewhat familiar theme of what item types there are to choose from, and what the general reasons might be to prefer one to the other, is a natural candidate to use for a start. Even then, it is critically important always to keep in mind that the kind of questioning used in formative and summative testing will influence the way students handle their course work and test preparation, in much the same way as content asked for will signal students what is important to study, and what may be neglected.

"Finally, the notion is often expressed that tests, or more broadly conceived assessment systems, signal what is held to be important to teachers, parents, and students. Thus, for the teaching of thinking to be recognized as important and given enough emphasis, it is necessary to develop assessment procedures that do justice to the goals." (...)
"Regardless of one's position in the long-standing debate about whether tests should follow or lead instuction, it is clear that tests can be an important factor in shaping instruction and learning."

Robert L. Linn (1988). Dimensions of thinking: Implications for testing. CSE Technical Report 282. http://www.cresst.org/Reports/r282.pdf [dead link? 1-2009]. Published in Beau Fly Jones and Lorna Idol (Eds) (1991). Dimensions of thinking and cognitive instruction. Erlbaum.

In America it is high-stakes testing that is influencing tactics and strategies of all stakeholders in the educational industry. In the Netherlands the exit-examinations in secondary education are state-controlled, the state here is explicitly directing the contents and levels of instruction in secundary education. The kinds of test questions used, and their quality in testing mastery of content, of all kinds of assesment, but especially state-controlled ones, are critical factors in the quality of the educational processes at large. See also Ludger Woessmann (2005) pdf.

summary of content

A general principle of test item design is to start with natural questioning formats, such as open ended questions. If it is necessary to use so called objective type test items, use a well-developed set of open-ended questions as your starting point for the design. For essay type tests much the same can be said: first carefully choose the problem or theme to be set, only then add the necessary directions, and subquestions, and develop the scoring scheme - if one is really needed.

It will probably be the case that many design decisions need not be made anew for every item separately. A lot of efficiency in the item design process may be gained making use of reusable item forms, and of techniques to produce multiple variants of the same item, to be stored for later use.

Regardless of the particular content, there are some design principles that almost always should be applied. The first is that questioning should be such that students are in a position to effectively prepare themselves for it - after all the business here is education (De Groot, 1970 html; somewhat related is 'universal design', Thompson, Johnstone and Thurlow, 2002). The US have wandered away from this principle by relying so much on general achievement tests. European systems generally use national exit exams, and these really are tails wagging dogs - sometimes too much so, of course. The second principle is that question design is not the same as reformulating instructional tests: questions should test mastery, not simple reproduction. A particular abuse that is fairly widespread is to ask for high level definitions and relations, instead of asking for the adequate use of them.

Many different sets of guidelines for writing items are available, I will however use Haladyna, Downing and Rodriguez (2002) as a reference article pdf, also because it tries to be evidence-based in presenting a canonical list of 31 guidelines, most of them specific to MC-questions. Their list is almost the same as the one in Haladyna (1999, p. 77), but does not condemn the true-false format, and does not endorse the 1999 guideline #4 to test only for a single idea or concept (rightly so. After all, relations between two or more concepts comprise a sizeable part of any course content, isn't it?). Guidelines tend to come in two kinds: what to do, and what not to do. The second kind belongs in chapter 8 on quality management in the item design process.

For a booklength treatment of item writing - in the medical field - see Case and Swanson (2001) pdf.

2.1 Open-ended or short answer questions

ANOTHER KIND OF OPEN-ENDED QUESTION

How did that make you feel?

INTENDED KIND OF OPEN-ENDED QUESTION

America was in 1492 discovered by

____________.

SHORT ANSWER FORMAT

Who discovered America in 1492?

____________

Open-ended questions are the ones expecting one word or number, a at the most a few of them, as an answer. Longer answers turn them gradually into the essay-question type. No sharp boundaries here.

The open-ended question may be of the simple fact-finding kind, or it may ask for complex calculations having a number as a definite answer; anything goes, as long as the answer expected is itself a short one.

In many situations and for much course content the open-ended question is a natural type of question. As such it's formulation is the ideal first step in item design. Using the open-ended question in tests, however, one should observe some simple rules of design. The general principle here and elsewhere is to always design questions so testees can efficiently handle the information given and needed. Literally observe the 'open end' character and abstain from using questions where the opening is in the middle or even the beginning of the question, because in such a design the testee is forced to read the question at least twice.

The first and foremost design principle, of course, is to be crisp and clear in the formulation of the question. If the question can be posed in straightforward language, do so. Do not use lengthy phrases, figures of speech, rare terms, if you can avoid to do so. Always remember to test for mastery of course content, not for intelligence.

The intended kind of mastery of course content rules the design of the (preliminary) questions. Chapters three to seven treat the many specific points, issues and possibilities in translating course content in this way into (preliminary) items. The negative formulation of this rule is to abstain from taking sentences from course materials, and transforming these into questions. If it is deemed necessary to ask for literal reproduction of, for example, definitions and lists, please look out for more sensible alternatives.

In the UK, for example, more and more attention is given to assessment for learning, instead of assessment of learning, in this way forcing the item design in the direction of questions primarily informing the testees about the learning they yet have to do, instead of informing the institution of the learning that already has taken place. (http://www.qca.org.uk/7659.html [dead link? 1-2009], or search Google for "achievement for learning")

The open-ended question, alternatively called the short answer item format, or the fill-in-the-blank format, itself belongs to the very broad category of constructed response item formats, most of these being essay-type, and some of them of the demonstrated skill type - performance, portfolio, research paper, dissertation, demonstration.

In the open-ended question format there is not much room for variants, other than the dimension from simple to more involved question formulations. Open-ended is open-ended. In contrast, the MC format allows may different types.

The examples in this paragraph will be examples of how to use the design principles mentioned above. The reverse case, not regarding one or more of these principles, is treated more fully in chapter eight. Rules derived from the more general principles, then, are the following.

SOME UNEDITED QUESTIONS ABOUT THE ABOVE TEXT

Open-ended questions belong to the so-called 'objective' questions. Why would that be?
- high level of intersubjective agreement possible
Could open-ended questions be multiple-choice without the multiple choices being given explicitly?
- yes
Explain what is meant by a question asking for a 'new use' of a bit of knowledge.
- apply a concept or rule in a new situation etc
Would you recognize an open-ended question as such, if you met one?
- yes
Is it possible to recognize open-ended questions to fail the 'new use' principle?
- yes
The same, if they were about unknown course material?
- yes
Does asking to tell in one's own words what the text says about 'A' classify as asking for a 'new use'?
- no, this is only paraphrasing, not bad as such, but not a 'new use.'
What is an open-ended question?
- One that leaves a blank for the last word[s] in the sentence.
Is this question open-ended?
- no
How would you call the above question?
- short-answer
Is there a limit to the number of possible questions about a non-trivial piece of text?
- no
Is it possible to design in a non-trivial way two or more different questions for the reproduction of the exact wording of a particular definition?
- no. Test me on this one; send in an example showin git to be possible.

Most of the above questions are of a general type, and therefore do not allow for generating a large number of variants. They could be useful to generate discussion in class. Others (could) use particular examples, and these examples may be exchanged for new ones to result in 'new' questions asking for other 'new uses' of the particular notion.

be crisp and clear

An important cluster of guidelines can be characterised as be crisp and clear.

"Keep vocabulary simple for the group of students being tested."
"Use correct grammar, punctuation, capitalization, and spelling."
"Minimize the amount of reading in each item."
"Ensure that the directions in the stem [in the question] are very clear"
"Word the stem [the question] positively, avoid negatives such as NOT or EXCEPT. If negative words are used, use the word cautiously and always ensure that the word appears capitalized and boldface."

Halydyna et al. p. 4, guidelines 8, 12-14, 17

Note that the 'be crisp and clear' guideline forbids questions to be all kinds of things except about relevant course content. To the 'all kinds' belong: testing for intelligent behavior or command of the language, using trick questions or questions about trivial content.
Different guidelines in the example refer to different phenomena in the real world, some of them of a psychological character, others regarding a fair treatment of alle testees. Lots of research on these points are available, most of it is of a general character, however, meaning it is not researched using different kinds of poorly designed items. This state of affairs does not diminish the evidence-based strength of guidelines, such as to avoid difficult wording or items that overload the capacity of short term memory. Scientifically established theories themselves are evidence-based. Haladyna gives the impression to somewhat disregard these rich sources of knowledge on human behavior.

Inherently many of these guidelines also impact on reliability and validity of tests. For example, using way too much words limits the number of items that can be asked in the limited time available for the test, and therefore unnecessarily limits reliability and validity of the test.

ask for a new use, example, inference, explanation

In fact, this is a major theme of the book. No harm done to softly introduce some points here already, especially so since at this level there still is agreement with the approach of, for example, Haladyna (1999).

23 + 56 = _____

In arithmetics the idea to ask for a new calculation to make, not to reproduce an example from the course text, is self-evident. The thesis of this book is that course content in most, if not all, disciplines shares basic characteristics of what it is to be knowledgeable in that discipline. The arithmetics example asking for a new application of an algorithm, is a kind of open-ended question that in many courses also is a good possibility to ask for mastery of algorithms (schema's, procedures).

On national television Ayaan Hirshi Ali said she had given a false name on naturalisation; her grandfather is Magan, not Ali.

What is the relevant 'fact' here? ________
Question 2. Does Hirshi Ali's confession establish her name to be 'false' in the legal sense? ________

Background: The Dutch minister of immigration said (May 2006) she took the confession to establish falseness in the legal sense, announced in Parliament that Hirshi Ali (Magan?) never was naturalized, only to declare later in the 10-hour session, also on national tv, Hirshi Ali still was a citizen of the Netherlands, and a week later that surely she will stay one.

The above example illustrates how easy it is to find new examples of legal principles etcetera in your daily newspaper. Granted, the example given is a rather special one.

Explain why the thing I am sitting on is called a chair.
Point to any other chairs in this room.
Is there a chair to be found in any of the pictures on the wall?

Socrates, a philosopher already (too) famous in his own days, was condemned to drink a poisoned cup. Did he die because of his drinking it?

Socrates is human (a new example of the human race), humans die drinking a poisoned cup (known characteristic of humans). Therefore, Socrates did die.

assessing answers

solve 364 - 79.

The figure shows the answers of 10 students.
10 answers

"Children who have a good understanding of our base ten system can quite easily find ways of operating on numbers that are nonstandard (see N3 above). Teachers should be able to recognize when a student's reasoning is correct. In the following example, a subtraction calculation is solved in a variety of ways. When prospective teachers encounter this example, they often recognize that their own limited understanding of mathematics is not sufficient to comprehend the reasoning processes used by these children, and they become more aware of why they need to know mathematics at a much deeper level than they currently do."

Judith T. Sowder (2006). Reconceptualizing Mathematics: Courses for Prospective and Practicing Teachers. pdf

The prospective teacher in the Sowder course should "identify: a. which students clearly understand what they are doing; b. which students might understand what they are doing; c. which students do not understand what they are doing." That, indeed, is what assessment should be.

2.2 multiple-choice (MC) questions

Natural questions might already be multiple-choice, but in most cases the MC-format will have to be designed on the basis of a short-answer natural question. The implication is that MC questions will be somewhat artificial, and in such a case should be used only if for economic reasons it is not feasible to use the short-answer format. Be aware of developments in character reading technology making it possible in such cases yet to use the short-answer question format instead of MC.

What sense is there in giving tests in which the candidate just picks answers without any opportunity to give reasons for his choices?

Hoffmann, 1962, p. 29

This is a good question. For well-designed items, however, this should not be an issue. For badly designed ones, it definitely is. Opportunity to appeal the test score does not quite solve this problem. The opportunity to ask for explanation during the testing doesn't either; issues of bad design need not be recognized as such by the students themselves, and for standardized tests the testing supervisors cannot possibly solve issues raised by clever testees. Why not follow Hoffmann's suggestion and allow students to annotate their answers?

To prevent disagreement over what grade to give, the multiple-choice tester asks the student merely to pick answers. He refuses to let the student explain his choices for a very simple reason: grading the various explanations would cause disagreement among the graders.

Hoffmann, 1962, p. 35

This definitely is the wrong reason to opt for using MC questions. It may not be immediately obvious, but changing from essay questions to MC questions in order to make tests more 'objective' without having each (essay) test multiply graded, is not a good reason either. There is in the educational measurement field a big misunderstanding of the role of testing in instruction. The British 'assessment for achievement (Paul Black) perfectly clarifies the issue. Surely, some testing of necessity is 'summative,' and must be fair - which is not exactly the same as 'objective.' What education is about, however, is growth in knowledge, and testing should be instrumental for this growth. There are multiple ways in which testing can be instrumental in this way, for example providing immediate information to direct the learning process, or motivating students to invest in learning - an age old principle of 'design' of ways of grading. Improper reasons to use the MC format for testing surely will destroy these kinds of instrumentality of assessment. And it is showing in American educational culture, where too many children - and parents - believe achievement more to depend on (inborn) ability than on effort.

Michael C. Rodriguez (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice, summer, 3-13.

pdf not for free - make your last concept available on your website, Michael!

Do not think this to apply to teacher-made tests only: "Haladyna and Downing (1988) similarly examined a high-quality national standardized achievement test for physicians and found that 11 of the 200 5-option items had four functional distractors (49 items had one functional distractor and 13 had none)." (p. 5)

The golden tip for everyone having to design MC questions is: make them two-choice, at most three-choice. Everybody will be happy: designing the MC question will be fun again, to students the items will be much more transparent, the test will have many items more and will therefore be more valid. Michael Rodriguez (2005) has collected the (American) empirical evidence in a forceful statement. Only his very first sentence that "item writing has been, is, and always will be an art," unnecessarily detracts from his otherwise clear message. Item writing should not be quackery.

In this particular case, and if you have constructed MC tests before, you may use your own data to prove Rodriguez right. Look up the statistical analyses of these items, and mark all wrong options having 'attracted' definitely less students than the other one or two wrong options. You see? A waste of resources, for everyone involved in the testing process.

My own thinking on the robustness of four- and five-choice item use in the face of empirical evidence of its inefficiency, is that misunderstanding the possible impact of guessing is the decisive factor here. It is somehow reassuring to have four and even five alternatives to fool badly prepared pupils, while it feels quite 'naked' to have only two alternatives. A bit of flipping coins or rolling dice should cure the misperception; use the computer: applet.

The extent to which a target stands out among a group of other targets is called the target's

identity.
novelty.
salinity.
salience.

The only correct answer choice for item 1 is answer D. This option is correct because salience is the term used to describe what makes a target more likely to be selected from the available data by perceptual processes and thus makes it more likely to be perceived among other potential targets. A is incorrect because the identity of a target can only be known by a perceiver after perceptual processes have been completed; B is incorrect because it is one of the factors that can contribute to the salience of a target and does not in itself describe the degree to which the target stands out among other potential perceptual targets in a situation; and C is incorrect because salinity describes the degree of salt in a solution and has nothing to do with perception theory. This item tests Bloom's basic cognitive level (Level 1: KNOWLEDGE) because it only requires recall of information (i.e. of common terms/basic concepts - here: salience) for the item to be answered correctly.

Item designs elaborately explained like the above one, are rare in the literature. This one is cited from Fellenz (2004), see also below on this publication.

In the above example it should be clear that option C is to be deleted immediately, because it has nothing to do with course content.

Drop alternative A also, if 'identity' is not a special term, like 'novelty' is, but only a convenient idea of the item writer.
I take it from the explanation that a two-choice item B) D) is a good choice item because it asks for knowing what saliency is, versus what it is that contributes to salience.

The item is rather abstract, but nevertheless it allows of a few variants using other characteristics than 'novelty' making for salience. Forget the Bloom crap, unless the taxonomy is used as a heuristic device only.

A 66-year-old woman had the abrupt onset of shortness of breath and left-sided pleuritic chest pain 1 hour ago. She had been recovering well since a colectomy for colon cancer 2 days ago. Blood pressure is 160/90 mm Hg, pulse is 120/min, and respirations are 32/min. She is diaphoretic. Breath sounds are audible bilaterally, but inspiratory effort is decreased, S1 and S2 are normal, and jugular veins are not distended. Which of the following is the most likely cause of her acute condition?

Acute myocardial infarction
Dissecting aortic aneurysm
Pneumonia
Pneumothorax
Pulmonary embolus*

Fincher (2006?) p. 202

Five options! If these are the options the clerk should think of anyhow, five is OK.

The stem contains a lot of window dressing, I suspect. I am hampered here by my lack of medical expertise. What I can say, though, is that separate bits of information in the stem evidently serve to exclude one ofrmore of the options. If such is the case, it is possible to break down this huge MC item into a series of lean two-choice items. The series can be made even larger by including 'why' questions. Do so. You will get much more information about the clerk's mastery in the same amount of testing time! Do not use these leaner items together in the same test, though. Because of the elephantasis stem the MC item is highly inefficient. If it is deemed important to use authentic questions like the one above, do not use the MC format, or at least use it in another way by posing a number of two-choice questions on the information given in the stem. But remember: this violates the desirability of independency between questions.

For the record: the example item asks for a new use.

Patients usually present with signs and symptoms, not a diagnosis. Therefore, write examination questions that replicate the process of clinical problem solving. Questions such as "Which of the following is true about polymyalgia rheumatica?" or worse yet, "Which of the following is not true about polymyalgia rheumatica?" do not elicit clinical thinking. They are a series of true-false statements.

Fincher (2006?) p. 203

The above citation is a beautiful formulation of what it is to ask for a new diagnosis, i.e. a new use of a particular diagnosis.

DISTRACTOR PHILOSOPHY?

"The key to developing wrong answers is plausibility. Plausibility refers to the idea that the item should be correctly answered by those who possess a high degree of knowledge and incorrectly answered by those who possess a low degree of knowledge. A plausible distractor will look like a right answer to those who lack this knowledge. (...) Writing plausible distractors comes from hard work and is the most difficult part of MC writing."

Haladyna (1999^²) p. 97 guideline 28 'Make all distractors plausible.'

To paraphrase Haladyna: this guideline on distractors is not evidence-based and is the most disappointing sentence in his writing. It conflicts with the guideline forbidding trick questions: this distractor philosophy is a philosophy about tricking testees into the wrong answer. It is perfectly possible, as Haladyna demonstrates, to call wrong answers just that, or 'wrong options.' On a positive note: the choice between options should be an instructionally valid one, corresponding to the kind of choice or discrimination a student mastering the course content should be able to make. Not knowing what the right answer is, should not bring the student in the trick-question situation; it should be perfectly clear to her that she does not know the right answer, and had better leave the question unanswered. At least, if the scoring rule does not force her to guess on such question, a 'questionable' strategy for institutions to follow, see below.

This 'distractor philosophy' baffles me. Why is it that we tend to think these dark things about the design of MC questions, never having done so in the case of short-answer questions? Is it a case of the opportunity making the thief?

Figure 2.2.1. "Origins of ballistic theory: Attempt to adapt the construction of projectile trajectories to the knowledge of the artillerists. Taken from a book on artillery by Diego Ufano (1628)." [source: Max-Planck-Institut für Wissenschaftsgeschichte, Research Report 2000-2001, Department I] None of the trajectories is Newtonian, of course.

The Force Concept Inventory of David Hestenes is an example of a test where the wrong answers have been designed to be attractive to pupils having common naive views on the laws of motion - and the pupils are attracted in large numbers. These pupils are not 'distracted' at all: they are attracted to the options corresponding to their conceptual model of force and motion, which might not be the Newtonian view of classical physics. Other chapters will return to the work of Hestenes and other research on cognitive models. For now, the point is that his test is a rather spectacular example of well-designed items, well-designed now in the construct validity sense. The small and unpublished study by Adams and Slater (n.d.) illustrates the main issues here. For examples of the exact items of this test see Rebello et al. 2005 pdf, they are highly similar to the well known items used in mental models research on, for example, ballistic trajectories - or water from your waterhose.

Jeffrey P. Adams and Timothy F. Slater (not dated). Student-supplied rationale for multiple-choice responses on the force concept inventory (unpublished manuscript). http://www.physics.montana.edu/physed/documents/FCI-rationale.pdf [dead link? 1-2009]
- abstract One hundred and twenty six first year college students were administered a modified FCI, which required both multiple-choice response and a written explanation explaining the underlying reasoning. The vast majority of students provided rationales that were completely consistent with their multiple-choice selections. Analysis of the data suggests that the FCI distracters are consistent with students' misconceptions and therefore the test does provide valid discrimination. However, further study has revealed that individual items may not reveal incremental progress in students' progression towards a complete Newtonian view.
- From the introduction: "Since its introduction in 1992, the force concept inventory (FCI) has had a major influence on the field of physics education research and has provided the impetus for a great number of curricular reforms. The great attention afforded this 29 item multiple choice test derives from four factors: (i) there is widespread agreement within the community as to the importance of the content being assessed; (ii) the test items are deceptively simple leading most instructors to greatly overestimate the likely success rate of their students; (iii) the results are highly consistent across a large constituency ranging from high school classes to courses for physics majors at large research institutions; and (iv) students' responses are highly resistant to traditional instruction. The FCI has done much to awaken the community of physics educators to the fact that though traditional didactic instruction may train students to produce correct solutions to seemingly complex problems, such instruction often does little to alter students' fundamental conceptual thinking."

specific guidelines
Two important guidelines specific to MC items have already been mentioned. Be content with two or three options for your item. Wrong options should be related to the goals of the course; specifically, do not try to think of 'distractors' that trick students not knowing the right answer into marking the wrong one. The special thing about MC items is the list of options. Guidelines here look like guidelines of style, be not mistaken about their real character though: to keep the option-part of the item crisp and clear so students have a level playing field. Again, choosing only the guidelines on things to do, not the things to avoid, from Haladyna et al., the following list obtains.

WRITING GUIDELINES

Develop as many effective choices as you can, but research suggests three is adequate.
Make sure that only one of these choices is the right answer.
Vary the location of the right answer according to the number of choices.
Place choices in logical or numerical order.
Keep choices independent; choices should not be overlapping.
Keep choices homogeneous in content and grammatical structure.
Keep the length of choices about equal.
None-of-the-above should be used carefully.
Phrase choices positively; avoid negatives such as NOT.

Haladyna et al. 2002 p. 312. Guidelines on 'distractors' left out here. Also 'Avoid giving clues to the right answer (...)" because this is an issue in the quality check, chapter eight.

For a recent article about guidelines - in the medical field - see Collins (2006) http://www.arrs.org/StaticContent/pdf/ajr/pdf.cfm?theFile=ajrWritingMultipleChoiceHandout.pdf [dead link? 1-2009].

A 7-year-old girl is brought to a physician's office by her mother complaining of chronic abdominal pain, irritability and crankiness. Her mother also hints there may be family problems. Which of the following would be most helpful to aid understanding of this patient's problem?

Elicit further information about the family problems and other potential stressors.
Perform a physical examination.
Reassure the mother that it is a normal phase her daughter is going through.
Refer the girl to a gastroenterologist.
Refer parents for marital counseling.

The flaws in this item include:

The clinical findings in the stem are inadequate
The stem does not pose a clear question
One cannot arrive at the correct answer without looking at the options
The options are heterogeneous
The distracters (wrong answers) are not of similar length or complexity
The wording in the stem is unclear. Who is complaining of pain, patient or mother?

above citation is from Fincher (2006?, p. 204)

Really this is an example belonging to the chapter 8 theme of item quality. I give it here to illustrate how the superficial looks of an MC question do not tell much of its quality as a test item. The flaws are signalled by Fincher. She could have added that 5 options probably is bad design here.

possible formats for the MC item

A summary of textbook presentations of types of MC formats is to be found in Haladyna et al. (2002). Basically, there are choice items, true-false items, matching items, and a category called 'complex MC.' True-false is an extreme type of the two-choice or alternate-choice item, and because of this special care is needed in the design of true-false items. Not many sentences in daily life, science, or school can be classified as 'absolutely' true or false. The matching question asks the testee to match every item of one list to those of another, preferably shorter, list. The matching question could be split up into multiple two-choice or three-choice items, and is therefore a trick to save on printing space, violating the guideline items in a test should preferably be independent of each other. Complex MC is the case where the lazy item writer confronts testees with some nerve-wrecking choices, never use this item 'format.'

NEVER USE THIS COMPLEX MC FORMAT

Which of the following are fruits?

Tomatoes
Tomatillos
Habanero peppers

1 & 2
2 & 3
1 & 3
1, 2, & 3

Haladyna et al. 2000, p. 321 (p. 312"AVOID the complex MC (Type K) format.")

NEVER USE THIS COMPLEX MC FORMAT

[Judge the truth of the two sentences.]

Every diamond figure has four equal sides
Every diamond figure has four equal corners

1 & 2 are both true
1 only is true
2 only is true
1 & 2 are both false

This kind of item is still, and regrettably, enormously popular in the Netherlands. The example is a translation from Timmer, in De Groot & Van Naerssen 1969, p. 149. Sandbergen, p. 116 in the same volume, presents its item form, commenting: "A much used and often useful 'trick.'" Sandbergen unwittingly pointed to the main problem in the complex item: the item writer's 'trick' makes the item tricky for the testee to answer. There was among the authors of this 1969 handbook no unanimity on complex items. Lans and Mellenbergh in their chapter on the formal aspects of construction and assessment of items do not mention the complex MC.

true-false items

ambiguity

" .... unlike a multiple-choice item, the true-false item does not provide any explicit alternative in relation to which the relative truth or falsity of the item can be judged. Each statement must be judged in isolation to be true enough to be called true (i.e., true enough so that the test constructor probably intended it to be true), or false enough to be called false. This lack of a comparative basis for decision contributes to the ambiguity of some true-false items."

Ebel, Robert L. (1965). Measuring educational achievement. Prentice-Hall.

Outside of logic, true-false questions tend to be pseudo-logical in character: the sentence P is either true or false. The logician will prove the falseness of P by assuming P to be true, and then deduct a contradiction. See for example Beth (1955) on his semantic tableau technique. Therefore, the true-false format is home to logic. This fact suggests how the true-false format might be used in other disciplines as well: if the student may be expected to be able to show a contradiction between (part of) the statement and a particular theory or particular givens in the problem. In most other cases the true-false format is suspect, if only for the reason mentioned by Ebel (the box above). The item writer should make it perfectly clear why the true-false format is adequate in each and every instance of its use. If it is possible to change the true-false item into a truly two-choice one, do so.

The Earth’s orbit around the Sun is a circle. true / false

Is the Earth’s orbit around the Sun a circle? yes / no

Justify your answer.
Is the Earth’s orbit around the Sun a circle? yes / no

Explain why the Earth’ orbit around the Sun is not a circle.

What is the Earth’s orbit around the Sun?

a circle
an ellipse.

True/false items require an examinee to select all the options that are 'true.' For these items, the examinee must decide where to make the cut-off - to what extent must a response be 'true' in order to be keyed as 'true.' While this task requires additional judgement (beyond what is required in selecting the one best answer), this additional judgment may be unrelated to clinical expertise or knowledge. Too often, examinees have to guess what the item writer had in mind because the options are not either completely true or completely false.

Case and Swanson (2001) p. 14 pdf

Case and Swanson do not cite empirical evidence to support the position taken here. I do not need empirical evidence to endorse their statement, however. An acceptable multiple true-false item should have options that are either 'totally wrong' or 'totally correct.'

ACCEPTABLE MULTIPLE TRUE-FALSE

Which of the following is/are X-linked recessive conditions?

Hemophilia A (classic hemophilia)
Cystic fibrosis
Duchenne's muscular dystrophy
Tay-Sachs disease

Case and Swanson (2001) p. 14 (1 and 3 'totally correct') pdf

Case and Swanson refer to a tradition that, "for true/false items, the options are numbered; for one-best-answer items, the options are lettered." Why is this: they are using the convention to obviate the need to explicitly mention the item to be multiple true-false, not multiplechoice. Unless students are thoroughly acquainted with this 'tradition,' do not use it.

In terms of durability, oil-based paint is better than latex-based paint.

true / false

Haladyna, 1999. p. 100

Haladyna's advice, backed by Frisbie and Becker (1991), is to "make use of an internal comparison rather than an explicit comparison." The explicit comparison meant is "Oil-based painting is better than latex-based paint." In the example, leaving "in terms of durability" out, will make the item ambiguous.

I do not understand the approach taken here by Haladyna. The book gives examples that should be formatted as twocoice items instead of the true-false ones Haladyna has in mind. The reason simply is that formulating the twochoice question such as in the box above, and on top of that asking a true-false answer, is making the item more difficult to read than is necessary. The Haladyna format is kind of complex true-false ..... .

guessing

There is no such thing as a formula to correct for the guessing of individual students. At the individual level, guessing introduces noise that can not be filtered out. Aggregate scores of groups of students can statistically be corrected for guessing (formula scoring, versus number right scoring); this correction, however, is of no help at all if the high stakes primarily regard the individual student, not the institution. Better face the facts, and get rid of guessing, especially the forced guessing the Western world has become addicted to since Word War I. Do not think that this is an old discussion nobody is interested in any more. I will prove that forced guessing is unfair to the well prepared student having to sit pass-fail tests. And I will show that in this particular case students are in a position that they can lawfully force the institution to abandon scoring-under-forced-guessing. If you represent an institution, better take action today. Well, I am not the only one Having to prove a point here.

Patrick Sturges, Nick Allum, Patten Smith and Anna Woods (2004?). The Measurement of Factual Knowledge in Surveys. pdf

Experimental research comparing 'don't know,' guessing, and 'best guess' after first having chosen 'don't know.'
Note that the 'don't know' option is a variant of the bonus point option, or penalty point scoring, as alternatives for forced guessing.
The relevant issue in this research is the strength of the 'propensity to guess' concept: do subgroups really differ in propensity to guess - in which case forced guessing might be used to counteract this tendency - or does forced guessing add so much noise is should nor be used anyway?
This research is in reaction to the work of Mondak and colleagues, see below.

Jeffery J. Mondak and Damarys Canache (2004). Knowledge Variables in Cross-National Social Inquiry. Social Science Quarterly, 85, 539-558.

This is an interesting exercise, because Mondak and Canache think that validity of questionnaires will be higher when forced guessing forces respondents to equally capitalize on their partial knowledge. This goes against the grain of the psychometric view that guessing adds noise.
abstract This article examines the impact of 'don't know' responses on cross-national measures of knowledge regarding science and the environment. Specifically, we explore cross-national variance in aggregate knowledge levels and the gender gap in knowledge in each of 20 nations to determine whether response-set effects contribute to observed variance.
Analyses focus on a 12-item true-false knowledge battery asked as part of a 1993 International Social Survey Program environmental survey. Whereas most research on knowledge codes incorrect and 'don't know' responses identically, we differentiate these response forms and develop procedures to identify and account for systematic differences in the tendency to guess.
Substantial cross-national variance in guessing rates is identified, variance that contributes markedly to variance in observed 'knowledge' levels. Also, men are found to guess at higher rates than women, a tendency that exaggerates the magnitude of the observed gender gap in knowledge.
Recent research has suggested that 'don't know' responses pose threats to the validity of inferences derived from measures of political knowledge in the United States. Our results indicate that a similar risk exists with cross-national measures of knowledge of science and the environment. It follows that considerable caution must be exercised when comparing data drawn from different nations and cultures.
You will have to pay for a pdf file. Do you have one? Share it with me.

Figure. Hundred monkeys make a test of 10 three-choice questions .

The pictured simulation was done using the spa_module 1 applet. Try it for yourself.

guessing under pass-fail scoring

Figure. Forced guessing, i.e. there is no option to leave unknown questions unanswered, makes the pass decision less reliable .

While it is known (Lord & Novick, 1968, p. 304) that guessing lowers the validity of tests, other things being equal, it is generally not known that guessing heightens the risk of failing under pas-fail scoring for students having satisfactory mastery. The figure shows a typical situation. The test has 40 three-choice items, its cut-off score in the no-guessing condition is 25, in the three-choice items condition the cut-off score is 30. The remarkable thing is that the probability to fail the 25 score limit for a student having mastery .7 is 0.115, while the probability to fail the 30 score limit under forced guessing (pseudo-mastery now .8) is .165. [mastery is defined on the domain the items are sampled from]

The model is not strictly necessary to argue the case, of course, but it helps being able to quantify the argument. Suppose the student is allowed to omit questions she does not know, meaning she will not be punished for this behavior but instead will obtain a bonus of 1/3rd point for every question left unanswered. Students having satisfactory mastery will have a reasonable chance to pass the test. Those passing will do so while omitting a certain number of questions. It is perfectly clear that some of these students would fail the test if they yet had to guess on those questions. In the same way, some mastery students initially having failed the test, might pass it while guessing luckily. This second group is, however, much smaller than the first one, and they still have the option to guess. The propensity to guess is higher, the lower the expected score on tests, see Bereby-Meyer, Meyer, and Flascher (2002).

The amazing thing about this argument is that I do not know of a place in the literature where it is mentioned. There has of course been a lot of research on guessing, omissiveness, and on methods to 'correct' for guessing, but none whatsoever on this particular problem. That is remarkable, because students failing a test, might claim they have been put at a disadvantage by the scoring rule that answers left open will be scored as at fault. This is a kind of problem that should have been mentioned in every edition of the Educational Measurement handbook (its last edition 1989 by Robert L. Linn). Lord & Novick (1968, p. 304) mention the problem of examinees differing widely in their willingness to omit items; the interesting thing here is their warning that requiring every examinee to answer every item in the test introduces "a considerable amount of error in the test scores." The analysis above shows that in the particular situation of pass-fail scoring this added error puts mastery students at a disadvantage, a conclusion Lord and Novick failed to note..

Martin R. Fellenz (2004). Using assessment to support higher level learning: the multiple choice item development assignment. Assessment & Evaluation in Higher Education, 29, 703-719. Available for download at html

abstract This paper describes the multiple choice item development assignment (MCIDA) that was developed to support both content and higher level learning. The MCIDA involves students in higher level learning by requiring them to develop multiple choice items, write justifications for both correct and incorrect answer options and determine the highest cognitive level that the item is testing. The article discusses the benefits and limitations of the scheme and presents data on the largely positive student reactions to the scheme. The development of the MCIDA also serves as an example for how traditional summatively oriented assessment procedures can be developed into tools that directly support student learning.
A highly creative approach to the MC item design problem. It also probably is highly dependent on the person of the teacher, in this case Martin Fellenz. I like to recommend teachers to invite their students to submit test questions that might be used, for example, in future end-of-course tests, but I have never met teachers succesfully doing so. There are many points in the Fellenz approach I'd rather not endorse, however, unless he himself is the teacher. Technical points I take issue with are the forced four-choice format - if the number should be fixed, take two - and the reliance on Bloom's cognitive taxonomy on the flimsy evidence of one random publication - reviews have already around 1980 shown that the Bloom levels can NOT be discriminated on an acceptable level by independent raters. [Haladyna, 1999, p. ix: "Although the Bloom taxonomy has continued to be favored by many practitioners, scientific evidence supporting its use is lacking (Seddon, 1978). G. M. Seddon (1978). The properties of Bloom's taxonomy of educational objectives for the cognitive domain. Review of Educational Research, 48, 303-323.] This brings me to the final point, that Fellenz does not offer his students much help in this difficult task; I can understand that, because the literature does not offer many clues on how to design relevant MC items, and Fellenz may excuse himself here because nevertheless the method works very well in an instructional sense, at least such is what he claims.

matching items
The matching item is a bit of a misnomer. It might be characterized as asking for the correct pair-wise combination of items from two lists. Books and authors, albums and groups, medical vignettes and diagnostic options, etcetera. A technical point in the design is to avoid the situation of exactly exhausting pairs, because the last pair then will automatically be correct if the other ones are. The instruction should make it clear which list is the question-list; in the medical example, see Fincher p. 206-207 pdf, it is the list of vignettes because clinical thinking departs from the vignette. In the books-and-authors case it could be either.

D. V. Budescu (1988). On the feasibility of multiple matching tests - variations on a theme by Gulliksen. Applied Psychological Measurement, 12, 5-14. [I have to look it up yet.]

abstract [ERIC SLD] A multiple matching test--a 24-item Hebrew vocabulary test--was examined, in which distractors from several items are pooled into one list at the test's end. Construction of such tests was feasible. Reliability, validity, and reduction of random guessing were satisfactory when applied to data from 717 applicants to Israeli universities.

S. H. Shana (1984). Matching-tests: reduced anxiety and increased test effectiveness. Educational and Psychological Measurement, 4, 869-.. .

(Students overwhelmingly preferred the matching -test formats, scored equally high or significantly better on them, and experienced significantly less debilitating test anxiety. traditional dependence on multiple-choice tests to the avoidance of matching items is questionable.)

Ruth-Marie E. Fincher (2006?). Writing multiple-choice questions. A section in Louis N. Pangaro and William C. McGaghie (Lead Authors) Evaluation and Grading of Students. pdf, being chapter 6 in Ruth-Marie E. Fincher (Ed.) (3rd edition) Guidebook for Clerkship Directors. downloadable [Alliance for Clinical Education, USA]

"Learning to write multiple-choice items using the formats of the National Board of Medical Examiners (NBME) produces higher-quality items. Therefore, all faculty who write multiplechoice items for examinations should master the principles of item writing.
The NBME uses items that are "one best answer" (type A, or matching); therefore, I will discuss only these types. I recommend avoiding other types of multiple-choice items, such as K-type (1, 2, and 3 only, 1 and 3 only, etc), multiple true/false, or A-B-Both-Neither. You should also provide the reference range of laboratory values unless you are testing the students' recall of the values. Generally, the goal is to assess the students' understanding of any divergence from normal, not whether they know the normal values. Normal values are given on USMLE Step examinations and NBME Subject Tests. "

Robert Woods (1977). Multiple choice: A state of the art report Evaluation in Education. International Progress, 1, 191-280. for a fee pdf available [impossible Elsevier URL, I am sorry]

I do not pay fees. I have not seen this article, the Journal is not in any Dutch library. Woods is mentioned in Haladyna, 1999, p. viii.

objectivity

2.3 Essay type questions

A sobering empirical result to begin with: it really does matter how the student is asked to write her answer up. The reminder is: if this small a difference between writing or typing your answer does matter so much, there must be many more important differences between possible techniques of ssessment that making important differences in test results.

The first written examinations in Oxbridge in a sense followed the catechetical method, because no questions were put that allowed different interpretations: 'the way to achieve more accurate and certain means of evaluating a student's work was to narrow the range of likely disagreement and carefully define the area of knowledge students were expected to know' (Rothblatt, 1974, p. 292).

Taken from Ben Wilbrink (1977). Assessment in historical perspective. Studies in Educational Evaluation, 23, 31-48.
Sheldon Rothblatt (1974). The student sub-culture and the examination system in early 19th century Oxbridge. In Stone, L. (Ed.). The university in society. Vol I Oxford and Cambridge from the 14th to the early 19th century, p. 247-303. Princeton: Princeton University Press.

Because open questions are so terribly open, understandably but regrettably the tendency is to pinpoint these questions in such a way as to minimize any opportunities for discussion about the correctness of answers given. Teachers are loosing opportunities here to gain insight in the hearts and souls of their students, and especially for insightful feedback on the answers given.

opstelvragen uitwisselbaar met aanvul- of keuzevragen?

opstelvragen eerlijk nakijken

Het is niet ongebruikelijk dat bestuurlijke gremia dwingend voorschrijven dat er tevoren modelantwoorden voor het nakijken moeten worden opgesteld. Dat is schadelijke bureaucratie. Natuurlijk is het goed tevoren uit te werken welke varianten in antwoorden mogelijk zijn, dat kan de professionele docent prima zelf doen, bij voorkeur met enige intervisie. Het dwingende voorschrift lokt evenwel uit dat met zo'n modelantwoord alle eerlijkheid is gegarandeerd, wat een aanfluiting is. Helaas lokt het ook het honoreren van deelkennis uit, anders zullen studenten dat wel op basis van het modelantwoord gaan eisen.

Tabel 1. Beoordeling van tandheelkundige werkstukken door drie instructeurs

    werkstuk:   1   2   3   4   5   6   7   8   9  10  
-----------------------------------------------------   
instructeur a   8  11  14   7  10  11   7  14   9  10  
instructeur b   8  14   9   9  11  14  12   9   9  12  
instructeur c   6   9   6  13  10  14  13   8  11   9  
-----------------------------------------------------   
hoogste oordeel 8  14  14  13  11  14  13  14  11  12  
laagste oordeel 6   9   6   7  10  11   7   8   9   9

Bron: Dick Tromp (1979). Het oordeel van studenten in een individueel studie systeem. Onderwijs Research Dagen, 1979. De gegevens van Tromp zijn uitgebreider dan de tabel kan laten zien.

Krediet geven voor goede deelantwoorden op een inzichtvraag ondergraaft het eigen karakter van inzichtvragen ten opzichte van kennisvragen. De toetsing degradeert dan tot kennistoetsing, en de bijzondere prikkel om door te studeren tot een hoog niveau van kennisbeheersing vervalt daarmee. Zie Biggs (1996) voor voorbeelden van docenten die door goede deelantwoorden te belonen handelen in strijd met hun intentie om inzicht te toetsen. Zij sporen studenten daarmee immers aan tot oppervlakkige verwerking van de stof.
Wilbrink, 1958, paragraaf Sturende werking

essay grading

The famous study here is Hartog and Rhodes (1936). A sample of more recent studies: DeCarlo (2005 pdf), Congdon and MeQueen (2000), Engelhard (1994). Huot (1990).

From Smith's prize competition at Cambridge

Another of Milner's strategies was to ask candidates for a particular proof and then long before the best candidate could possibly have finished writing ask all the candidates to stop. His rationale was simple. He believed he could judge from the half-finished answers what the completed ones would have been and in this way gain extra time for asking further questions.

Barrow-Green, 1999, p. 284. Milner was an examiner from 1798 to 1820. Among mathematician the Smith competition was more highly esteemed than the mathematical tripos.

Applaud this man for a clear insight into what it is for an exam to be time efficient.

2.4 Transparency

Abstract Because test information is important in attempting to hold schools accountable, the influence of tests on what is taught is potentially great. There is evidence that tests do influence teacher and student performance and that multiple-choice tests tend not to measure the more complex cognitive abilities. The more economical multiple-choice tests have nearly driven out other testing procedures that might be used in school evaluation. It is suggested that the greater costs of tests in other formats might be justified by their value for instruction - to encourage the teaching of higher level cognitive skills and to provide practice with feedback.

N. Frederiksen (1984). The real test bias: Influences of testing on teaching and learning. American Psychologist, 39, 193-202.

2.5 Item templates

Isaac I. Bejar, René R. Lawless, Mary E. Morley, Michael E. Wagner and Randy E. Bennett (2002). A feasibility study of on-the-fly item generation in adaptive testing. ETS Research Report 02-23 / GRE Board Report No. 98-12P. pdf

abstract The goal of this study was to assess the feasibility of an approach to adaptive testing based on item models. A simulation study was designed to explore the affects of item modeling on score precision and bias, and two experimental tests were administered - an experimental, on-the-fly, adaptive quantitative-reasoning test as well as a linear test. Results of the simulation study showed that under different levels of isomorphicity, there was no bias, but precision of measurement was eroded, especially in the middle range of the true-score scale. However, the comparison of adaptive test scores with operational Graduate Record Examinations (GRE) test scores matched the test-retest correlation observed under operational conditions. Analyses of item functioning on linear forms suggested a high level of isomorphicity across items within models. The current study provides a promising first step toward significant cost and theoretical improvement in test creation methodology for educational assessment.
This research uses item models (item templates). While some explanation is given on the item model concept, and some examples provided, the bulk of the article is terrible psychometrics. Just skip the terrible parts.

2.6 Questions of validity——Valid questions

subdomains of validity

Figure 1. The scheme subdomains of validity summarizes the ten domains of concern for the evaluation and/or construction of validity in achievement test items. The subdomains, of course, are highly interconnected, therefore the connectivity between them has not been indicated in the scheme itself. The scheme summarizes the treatment of the validity concept in the (Dutch) paragraph 2.6 as recently (May 2008) developed.

Short characteristics of the subdomains:

Corpus: the whole of knowledge (of mathematics, etc), or special sections of it, as known (published, etcetera) in the particular field
Expertise; what it is to be a chess master, to do expert mathematics, etcetera.
Situatedness: most knowledge is in a number of different ways situated in particular contexts etc.
Neurocognition: hoe knowledge—knowing how, etcetera—is available in or constructed by the brain and the senses
Student model: a model of characteristic ways of knowing, the level and quality of mastery achieved already, the profitable field of next development, etcetera.
Learning model or model of growth: how is that students learn new concepts, change existing concepts, etcetera
Diagnostic system: a systematics of things that might be amiss or in one way or another may require special attention in the case of the individual student
Questions: This is home: how to design questions under the restrictions of what is specified in de subdomains of validity. Also a systematics of questioning, based on the postion that all questions essentially ask to explain A in terms of theory T and specific givens G (for example: Hintikka, 2007), meaning that superficially simpler questions might be expanded into the fully explaining format.
Strategies: students will not passively undergo their examinations. On the contrary, they will prepare themselves strategically. Their teachers, in turn, will anticipate student strategies. Also, teachers designing items typically behave strategically, for example trying to avoid discussion about assessments by using 'objective' item formats.
Technicalities: Even when everything else is perfectly valid, it might happen that the printing or the scoring key or whatever contains errors introduced by faulty handling of whoever is involved in the production process of a test or test item. Gone is your validity ... . Or things like using the green-red color contrast in questions, disregarding the large number of students being red-green colorblind. Etcetera.

Absolutely essential in test item design is that every item individually is a valid item, validity taken in the realistic sense as presented in Borsboom, Mellenbergh and Van Heerden (2004) as contrasted with the construct validity approach that at the level of the test—not the individual test item—seeks to relate scores on this test to scores on other tests irrespective of what is the state of affairs in the real world.

To counter the danger that a strict format of every item being valid might give rise to unintended side effects—such as are known from Verschaffel, Greer and De Corte's (2000) research on word problems—there should be room for meta-validity: items that as formulated are not valid, but might be made so in a reformulation or reconception produced by the testee. After all, problems in the real world will not always come in valid formats. The one famous example here is the 'what is the age of the captain' kind of word problem. As used in research on word problems, it is a meta-valid question. Arithmetics didactics is not yet ready for the use of this kind of meta-valid question in primary education—so much the worse for the didactics.

Strictly taken, meta-validity is validity also. However, it will pay the test item designer to be aware of the difference between the two kinds of validity.

The amount of writing on validity of psychological tests is terrifying. The authority on the subject is publication on standards of psychological tests by the American Psychological Association (in cooperation with other organizations), the latest edition being APA (1996). Ultimately—as for example in court—these standards are decisive. Nevertheless these Standards get updated from time to time, therefore there is room for improvement on the received view as documented in these Standards. The chapter on validity in the 1996 edition is strongly influenced by the work of Melvin Novick on constructive validity, emphasizing nomological networks, etcetera. In this approach the validity of tests is ascertained by researching how it correlates with other tests. This characterization of construct validity may not be fair, but it makes the point that construct validity does not touch on whether the events to be measured really exist or not.

In test item design, the question of validity touches on whether what the test item probes or measures—a particular piece of knowledge, insight—really does exist, let us say, in the brains—connectionist neural networks if you want—of the individual students. The good news—for test item design—is that issues of item validity do not directly touch on whatever it is that is called construct validity of tests.

The work of Denny Borsboom, Don Mellenbergh and Jaap van Heerden (2004) is of tremendous help in clarifying issues of validity at the level of the individual test items. The important point in the work of Borsboom et al. is that today's conceptions of construct validity do not touch on the one important question of validity: does this instrument measure what it is intended to measure, does this ruler measure the attribute of length, does this item measure the understanding it purports to measure? That 'understanding' is not a correlate of intelligence, mental fitness, or whatever—it is the real thing, the neural process if you want.

In the Dutch text of paragraph 2.6 an example item is taken from the Dutch Science Quiz. In its formulation the concept of mass is confused with that of weight, while at the same time the question exactly is about the difference between these concepts. Therefore, it is not possible to answer the item correctly without radically reinterpreting the question as asked. Therefore, the question is invalid. Making it valid is simple: remove the ambiguity. Why, then, did the item designer not do it himself in the first place? This would be a trivial question to ask, if not for the fact that this kind of invalidity in achievement tets items is ubiquitous. Something is lacking in the validity of the design process, it seems.

"Procedures currently in use for constructing and describing achievement tests are a mess. Conclusions about methods, variables, or procedures can hardly be taken seriously when you don't know what the test measures. Drastic action is indicated. Journal editors are admonished forthwith to reject papers unless they contain 1) a documented rationale for selecting the questions to be asked and 2) a fully-explicated analysis of the relationship between the questions and the preceeding instruction."

Richard Anderson (1972, p. 16)

Richard Anderson's 1972 warning on the problematic validity of testing in education is still as valid now as it was then. Circumstances have changed somewhat, of course, but the overall picture still is rather gloomy. Read for example Popham's (2005) warning in connection with the no Child Left Behind Act, and the pressure it will exert to use educational tests in ways they surely are not intended for, and should not be intended for.

Validity questions in educational assessment are not fundamentally different from those in other disciplines, as for example physics. I do not deny that issues of validity loom larger in psychology than in, for example, physics. Psychology is the younger science. In physics issues of validity are inherent in doing empirical research, see for example Olesko (1991) on one of the first experimental physics labs in the nineteenth century in Germany universities. Yes, indeed, there are at least family resemblances between validity of educational assessment, and what in the disciplines taught in educational institutions count as valid experiments, etcetera.

perfectly valid: basic arithmetic facts

2 + 4 = ?
3 × 2 = ?
7 × 3 = ?
5 + 3 = ?

The box shows adding and multiplying of numbers 1 to 9. There is strong theory available, e.g., Lebiere and Anderson (1998). Note that these basic arithmetics facts also get exercised and tested in exercises of adding and multiplying numbers bigger than 9. Note also that question format should be diversified to prevent this particular form inadvertently being 'connected' to mastery of these basic number facts, somewhat like the 'the age of the captain' phenomenon in word problems (Verschaffel c.s., 2000).

invalid: creatively written items

Writing items on the basis of creative ideas, as if it were an art, as it is characterized in the literature more often than not, results in invalid items by definition. The creative idea is hosted by the wrong brains, those of the item writer instead those of the testee. Whatever knowledge the creative test item asks for, is only accidentally related, if at all, to whatever knowledge the testee might have obtained from the course as intended.

non-construct validity

The idea of construct-validity is that validity somehow should be a relation between whatever it is that is assessed, and some theory or other. However, there are cases of assessment or experiment missing such a connection to any theory whatsoever. For example, look into some of the experiments done by Galileo Galilei, as analysed and reported by Stillman Drake (1990).

valid measurement, no theory

While other philosophers were speculating about the cause of objects falling, Galileo took refuge to what he could experimentally ascertain about the phenomenon of objects falling. He constructed a polished plane inclined by 1.7 degrees, and let a bronze ball roll down a groove. He marked the positions of the ball after equal intervals of time, and assessed the length of the trajectories between these points. Galileo kept the paper with the exact measurements, and this paper is stille available (Florence, Galilean manuscripts vol 72 folio 107v), but it was not printed in the collected works edited by Antonio Favaro (Drake, 1990, p. xvii).

1 1 32
4 2 130
9 3 298
16 4 526
25 5 824
36 6 1192
49 7 1620
64 8 2101

The numbers 1 until 8 are the eight equal periods of time, the numbers in the third column are the measured lengths of the distances traveled at those particular points of time (Galileo used a personal ruler equally divided into 60 units of what he called puntos, appr. 0.9 cm). The squares in the first column were added days later by Galileo. The regularity in the empirical data is that they are (almost) equal to the corresponding square times the first length measured (Galileo marked the aberrations above and below the calculated values).

There is no theory involved in the experiment, other than Galileo's choice not to involve any theory here. [NB. This might not be true. The squaring law was a known theory already, see Dijksterhuis 1950, IV 88, 89: "dat de in val uit rust afgelegde weg evenredig is met het vierkant van de sedert het begin van de beweging verstreken tijd." According to Dijksterhuis 1924 this experiment served to test the already known theory. Stillman Drake does not discuss this point, thereby giving the reader some room to suppose the squaring law to be discovered by Galileo. Such cannot possibly be the case, his experiments showed that theory to be true to the data. The closing sentences in Dijksterhuis (1950, IV 90) are remarkable: he calls it a persistent myth that Galileo would have found the validity of the squaring law by doing the inclines plane experiment. Dijksterhuis must not have known the existence of note paper f 107v, because it had not been published in the collected works.] For theories about falling objects abounded in Antiquity as well as in the Middle Ages, and well into the seventeenth century, emphasizing in one way or another what causes the falling movement. Galileo does not speculate on causes, at least not here, and lets his data do the talking. These data belong to a small set of most famous experimental data in the history of science, yet they have been published only recently by Stillman Drake (Dijksterhuis, for one, was not aware of the precise circumstances of Galileo's experiments using an anclined plane). Most students of physics in the past centuries have learned their theoretical mechanics without ever having heard of this experiment, or of the man Galileo.
So far for these valid measurements that resulted in a law of physics. No, they were not made to test that law, as Dijksterhuis (1924) had written.

Stillman Drake (1990) ch. 1, esp. p. 10 Figure 1 facsimile of Galileo's notes on f. 107v.

Assessing mastery of multiplication of numbers below ten is almost as straightforward as Galileo's assessment of speeds of falling. 'Almost:' there is a stochastic aspect involved here, because the same multiplication might be wrong now, and correct the next time, or the other way around. This stochastic element is however not a matter of 'error' in assessment, because it is a consequence of the way the brain is functioning. No error, therefore no unrelialibity. And of course not a threat to validity. There is more to say on this topic, especially so because the literature keeps rather silent on the topic ..... . It is quite amazing how it is possible that valid stochastic proceses nevertheless in psychometric theory get the 'unreliability' treatment. These are just binomial processes, sampling processes if you prefer to call them so, well known since at least the end of the seventeenth century (Christiaan Huygens, for example).

invalid observation, strong theory

Aristotle's theory that bodies twice as heavy as others, fall twice as fast (or something like that). The natural philosophers must have flattered themselves to really have seen such things happen, while in reality it is rather easy to observe that falling speeds are the same or nearly the same.

Dijksterhuis (1950).

Understanding free fall is not possible other than by description of the phenomena. Calling 'gravitation' the cause of falling is a game with words (see also Galileo Dialogo 2 [Dijksterhuis IV 85]

questions of abstractness

In the 1983 edition questions of validity were narrowed down to those of abstractness.

Another way to emphasize the point of this paragraph is to posit that transfer is the name of the game (Barnett and Ceci, 2002; Klahr and Li, 2005 pdf). Sure, there is this traditional notion that it is exactly the abstract character of Latin, chess, and all the rest of it, that exercises the brain. Forget the crap. There is in the beginning of the 21st century a lot of research going on that might be labeled as research on transfer of learning (Mestre, 2005). There is nothing mysterious about the idea of transfer. Everybody knows, or should know, that the kids having learned to swim, does not guarantee that they will swim to save their life. The same with the kids having learned Newton's laws: this does not guarantee they will act on that knowledge if need be. What is it, in instruction as well as in assessment, that will enable them to successfully apply Newton's laws in everyday - or in professional - circumstances? That is the transfer question. It is what education - and therefore assessment - is about. It will not do to let them learn the laws by heart, or train them them to use the laws in countless abstract - mathematical - ways. The ultimate question is - and the achievement test should touch on that - will they later be able to apply this knowledge in countless concrete situations? Construct validity, again.

"Physical and natural scientists could rely on experiments whose equivalent in human affairs would violate ethical limits. They could chop up bacteria, induce mutations in fruit flies, and blast molecules into smithereens, then observe the effects of their interventions. Anyone who tried the equivalent with human beings would soon be dead or behind bars." Tilly (2006), p. 129

The social sciences, unlike the physical sciences, are handicapped by ethical principles - broadly conceived - in doing their research, in obtaining unambiguous results from their research, and in communicating these results. The social sciences therefore seem to be somewhat less concrete than the physical sciences. Historically, social scientists have tried to escape from the dilemma by emphasizing quantification en operationalisation, and otherwise using techniques and instruments from the physical sciences as well. Somehow or other, this difference between the social and the physical sciences has implications for the design of achievement test items, or at least for the possibilities open to the designer.

To this day every student of elementary physics has to struggle with the same errors and misconceptions which then had to be overcome, and on a reduced scale, in the teaching of this branch of knowledge in schools, history repeats itself every year. The reason is obvious: Aristotle merely formulated the most commonplace experiences in the matter of motion as universal scientific propositions, whereas classical mechanics, with its principle of inertia and its proportionality of force and acceleration, makes assertions which not only are never confirmed by everyday experience, but whose direct experimental verification is impossible .... (p. 30).

Champgane, Gunstone and Klopfer (1985, p. 62), citing from E. J. Dijksterhuis (1950/1961). The mechanization of the world picture. London: Oxford University Press.

This citation will be used in other chapters also. Its message is highly disturbing. For now, contrast this with the citation from Tilly, above. The implication of the observation of Dijksterhuis - and very, very many others involved in teaching physics - is that many so-called demonstrations of physical laws are nothing of the kind. They may show strange effects unexpected to the naive viewer, but in no way count as proof of the natural law whose working they supposedly demonstrate. There is a - social scientific - experimental literature on the effects of physical demonstrations in education: probably they are nil. Summing up: the physical sciences might be handicapped also by their seemingly concrete results not revealing the physical laws supposedly causing them. Misplaced concreteness?

A lot less abstract is the research paper by

Brereton, Shepard and Leifer (1995). How students connect engineering fundamentals to hardware design: Observations and implications for the design of curriculum and assessment methods. pdf.

"We videotaped third and fourth year engineering students in classes and observed them in situ in their laboratories and dormitories as they worked in groups on design and dissection projects. These video studies and in situ ethnography revealed the ways in which students actually use the concepts they have studied, when faced with practical tasks."
"In design projects, students often had difficulty working with relationships between variables like power, torque, force, work and speed. We observed two causes. First, physical world realities like friction, uneven surfaces and irregular objects cast doubt on the nature of relationships between variables. Second, students seemed to have little experience in qualitative reasoning about what should vary and what should not. In typical analysis problem sets they are used to being told what the independent variable is."
"When project performance was emphasized over understanding, students often abandoned reference to concepts and used simple reasoning like try bigger wheels or use a bigger motor. Often they got their projects to work in this way. But although they gain confidence about design through doing, they may not learn to leverage their theoretical knowledge effectively."

Wendy M. Williams, Paul B. Papierno, Matthew C. Makel, Stephen J. Ceci (2004). Thinking like a scientist about real-world problems: The Cornell Institute for Research on Children Science Education Program. Applied Developmental Psychology 25, 107–126. http://pubcms.cit.cornell.edu/che/HD/CIRC/Publications/upload/williamsetal.pdf [dead link? 1-2009]

abstract We describe a new educational program developed by the Cornell Institute for Research on Children (CIRC), a research and outreach center funded by the National Science Foundation. Thinking Like A Scientist targets students from groups historically underrepresented in science (i.e., girls, people of color, and people from disadvantaged backgrounds), and trains these individuals to reason scientifically about everyday problems. In contrast to programs that are content based and which rely on disciplinary concepts and vocabulary within a specific domain of science (e.g., biology), Thinking Like A Scientist is domain-general and attempts to promote the transfer of scientific reasoning. Although the full evaluation of Thinking Like A Scientist must await more data, this article introduces the conceptual basis for the program and provides descriptions of its core themes and implementation.
A dream program. "Contrary to the domain-specific approach, the domain-general approach to studying scientific thinking investigates experimental design and hypothesis testing. Of primary interest to this approach is the development of reasoning skills about causal relationships that are transferable to other domains of knowledge (Zimmerman, 2000).
Corinne Zimmerman (2000). The development of scientific reasoning skills. What psychologists contribute to an understanding of elementary science learning. Final draft of a report to the National Research Council Committee on Science Leanring Kindergarten through Eighth Grade. Developmental Review, 20, 99–149. pdf "

2.7 An historical perspective

Most of this paragraph is perspective on US education history. The material presented does not leave much to be guessed. Even so, it might be useful to read Lagemann (2000) or Chapman (1988) for the broader perspective on the times, places and persons. It is really amazing to see how a handful of individuals were able to direct the course of history here. The same thing would have been unthinkable in the physical sciences - atoms don't talk back - but education is human activity that inevitably is influenced by the spirit of the time. That spirit, round about 1900, was quantification and measurement, Edward Thorndike's statistical imperialism crowding out the infinitely more sensitive philosophical approach of John Dewey.

Unless noted otherwise, the publications mentioned below are in my possession. If you have any specific question on them, mail me. I have ordered them chronologically.

C. Stray (2001). The Shift from Oral to Written Examination: Cambridge and Oxford 1700-1900. Assessment in Education: Principles, Policy and Practice, 8, 33-50.

abstract The typical modern examination involves the production of written answers to printed questions in a secluded physical location. In 16th century England university examinations were conducted in public, orally and in Latin, with the participation of the academic community. The paper gives an account of the shift from oral to written examinations at Oxford and Cambridge in the 18th and 19th centuries. Cambridge took the lead in this shift, largely because of the domination of its curriculum by Newtonian mathematics. Practice in Oxford began to converge in the 19th century, but oral testing was retained into the 20th century. Four factors are identified as crucial in the oral-written shift: the move from group socio-moral to individual cognitive assessment in the later 18th century; the differential difficulty of oral testing in different subjects; the impact of increased student numbers; the internal politics of Oxford and Cambridge.
The history of the mathematical tripos, about 1800 in Cambridge, England, can be used to highlight the influence of a number of examination pressures on the question format used in the examination: it should leave as little room for subjectivity - meaning as little room for discussion about the scoring of the exam work - as possible. Because this choice of question format in its turn influences student strategic behavior, it forms itself a major pressure on the format and content of the curriculum.

F. V. Edgeworth (1890). The element of chance in competitive examinations. Journal of the Royal Statistical Society, 53, 460-475, 644-663.

[search JSTOR for online version; http://links.jstor.org/sici?sici=0952-8385%28189009%2953%3A3%3C460%3ATEOCIC%3E2.0.CO%3B2-G]
About the statistics of independent assessments of the answers to open wuestions.
Edgeworth, one of the founding fathers of modern statistics, experimented somewhat with assessments. In this case (p. 469), "a piece of Latin prose which the editor of the Journal of Education kindly allowed me to insert in a recent number of that periodical." He solicited the cooperation of assessors willing to independently grade the work, assuming it to be part of the examination for the India Civil Service. The maximum grade should be 100.
the actual marks

45, 59, 67.5, 70, 70, 72,5, 75, 75, 75, 75, 75, 75, 77, 80, 80, 80, 80, 80, 82, 82, 85, 85, 87.5, 88, 90, 100, 100.
The mean being 78, and the 'central value [median, b.w.] 78.5 'the presumably correct mark,'
"the probable error on either side of the correct mark is 5." This error is not much greater than that on two other experiments spoken of earlier, on work in geometry and in history. "It is interesting to observe that the two highest marks, 100, are (each) more than double the lowest mark, 45. An occurrence of this kind is not at all uncommon in the marking of advanced work, where there is room for great diversity of taste in the examiners."
Much more material in this fascinating article, that started with the analysis of what weight and length people thought Edgeworth to have. This choice was predicated on the 19th century feeling that the physical measurement of length and weight were showpieces of scientific precision.
Edgeworth was perfectly aware of the arbitrariness of calling the mean or median score in a large set of independently obtained scores the 'presumably correct' score.
p. 461: "If anyone denies that the mean of a great number of judgments constitutes a true standard of taste, he is free to indulge his own philosophical views. He may regard the assumption here made as a mere working hypothesis which is useful in solving questions like the following: How far is it likely that any two examiners will differ from each other in their numerical estimate of the same work? [etcetera]"
There is something paradoxical in trying to quantify error; Edgeworth refers the reader to his more technical paper on the subject, Problems and probabilities, in the 1890 London Philosophical Magazine [not in JSTOR].

Edward L. Thorndike (1904). Theory of mental and social measurements. New York: The Science Press.

Edward Thorndike - son Robert is the editor of the second edition (1971) of Educational Measurement - here tries to establish psychological and educational testing als measurement comparable to that used in the physical sciences. The book does not present examples of test items, instead t presents techniques to table, plot, analyze and correlate item scores and test scores. The reliability of measures is important to Thorndike; with him it is the reliability of the mean of observed values, not yet the reliability as defined in psychometrics. Thorndike emphasizes accurate measurement, neglecting issues such as 'what is that we would like to measure, to what purpose then should we measure it, what techniques are suitable to implement measurements true to that purpose? To him, anything goes as long as it results in well-behaved scores that lend themselves to statistical treatment. Is this a characteristic of psychometrics also?
This is the first American publication on 'educational measurement,' coining the term itself, setting the tone and the direction of developments to come for at least a century. Psychological measurement also has been shaped by Thorndikes and the likes of him. Psychometrics, therefore, finds his strongest root in the work of Edward Thorndike.
Psychometrics' faults, therefore, could be blamed on Thorndike, as has been done by many writers, the most incisive studies probably being those by Joel Michell. See for example his :
Joel Michell (2000). Normal Science, Pathological Science and Psychometrics. Theory & Psychology, 10, 639-667. pdf

OBJECTIONABLE (1899, 1912)

How many thumbs have 6 boys?

How many days in the year, leaving out December?

Cited in Cronbach and Patrick Suppes, 1969, p. 107, from Charles E. White (1899). Number lessons, a book for second and third year pupils. Boston: Heath, p. 22; and James C. Byrnes, Julia Richman and John Roberts (1912). The pupil's arithmetic, primary book, part one. New York: Macmillan, p. 39.

IRRELEVANT (1912)

Hudson discovered the Hudson River in ________; Fulton sailed the first steamboat on that river in _______. How many years between those events?
Make up a similar problem about Washington and Columbus.
Do the same, using Washington and Lincoln.
Do the same, using Columbus and Hudson.

Cited in Cronbach and Patrick Suppes, 1969, p. 107, from James C. Byrnes, Julia Richman and John Roberts (1912). The pupil's arithmetic, primary book, part one. New York: Macmillan, p. 36.

ADEQUATE (1925)

If 12 is ¼ of a number, what is the number?

I am thinking of a certain number. Half of the number of which I am thinking is 5. Tell me the number.

Cited in Cronbach and Patrick Suppes, 1969, p. 107, from Joseph C. Brown and Albert C. Eldredge (1925). The Brown-Eldredge arithmetics, grade 5. Chicago: Row, Peterson and Co, p. 94

The nonsense of 'How many thumbs have 6 boys' illustrates the kind of questioning at the turn of the 19th into the 20th century. The contrast between the second and third box, the first an example from 1912, the second from 1925, is the publication of Edward L. Thorndike's (1922). The psychology of arithmetic. New York: Academic Press. This publication caused a landslide in arithmetics teaching and testing, it ended the counting of thumbs, as well as the fancy questioning still figuring in 1912. The principle to be crisp and clear, and avoid window dressing, might not have done the trick at that time, Thorndike used some powerful psychology to move the minds of textbook writers..

WORDS

In 1912 it was not unusual to use words like committee, insurance, charity, premises, installment, treasury in test items. In 1925 , after Thorndike's book appeared, "there are fewer hard words and there is a noticeable attempt to use the same words repeatedly in word problems. Thus, a majority of verbal problems for the first year concern familiar objects: fish, rabbits, puppies, boys, girls, and various kinds of fruit."

Cronbach and Patrick Suppes, 1969, p. 108

E. L. Thorndike (1922). The psychology of arithmetic. New York: Academic Press.

Read Patrick Suppes (1982) on the tremendous influence of Thorndike on the American arithmetics curriculum: On the effectiveness of educational research. pdf

Lee J. Cronbach and Patrick Suppes (Eds) (1969). Research for tomorrow's schools: Disciplined inquiry for education. London: Collier-Macmillan Limited.

In Ch. 3.: Some claims of significant inquiry: Thorndike's impact on the teaching of arithmetic pp. 96-110.
Uses Florence A. Yeldham (1936). The teaching of arithmetic through 400 years, 1535-1935. London: Harrap. (She also wrote (1926). The story of reckoning in the middle ages

Daniel Starch (1916). Educational measurements. Macmillan.

The early developments in the 20th century have definitively shaped what almost a century later still is called 'educational measurement.'
The objective of people like Starch, and the politicians funding them, was to develop instruments and methods to 'measure' individual differences between students. Not because the educational climate was that competitive, it was not at the time, but because of the fascination of psychologists with measuring individual differences in arithmetic, writing, spelling, length (in Ben Wood, not in Starch), invariably called 'ability in arithmetic', 'ability in writing' etc. The undertaking could not have been less connected to whatever it is that education is for. Instructional objectives and processes seem to have been taken for granted. The measurements come first, they will teach us about education: "... it is hoped that more general application of educational measurements will create a deeper scientific interest in educational matters." (p. 1). The Starch philosophy very strongly is that of 'to measure is to know,' see his introductory statements on p. 2. The psychologist can do much and much better what teachers are doing in marking school work, therefore the first chapter of Starch is about marks as measures of school work; his advice is their distribution should be bell-shaped. Recognize something here? I will have to do some research on this, of course. For the time being, remember that our preferred ways to question students might be seriously infected by intelligence testing technology. If you think I am crazy, see Robert Sternberg on the SAT I, definitely an intelligence test in his eyes (and he has the sharpest eyes in the field) (info: see html)
The word 'objective' is used in the third sentence of chapter one. Otherwise, 'objective examinatians' and 'new type tests' are yet to come. The focus of Starch is on the grading of school work, and how using objective measurements can improve thh quality - reliability - of grading. There are only two books on the subject mentioned in the bibliography: C. T. Gray (1913) ' Variations in the grades of high school pup[ils,' Warwick and York; and F. J. Kelly (1914) 'Teachers' marks.' Columbia University. There is no mention of developments in the field of individual differences in intelligence, but undoubtedle standardization techniques used by Starch have been taken from the laboratoru of Edward L. Thorndike and others in the intelligence testing field.
""These tests will furnish tools for evaluating quantitatively the results of methods and factors in teaching and learning, and for examining various aspects of efficiency of instruction and adminsitration of school systems." Pp. 3-4). Sounds familiar.
Chapters IV to XIV are about abilities in reading, writing, spelling, grammar, arithmetic, composition, drawing, latin, german, french and physics, containing numerous examples of questions and even whole tests, and a kind of norm tables for them.
The questions themselves do noot look revolutionary to us: they are mainly of the short answer type. Nevertheless, many questions ask students to choose whether or not something is the case, a correct word in a sentence, etecetera. An occasional multiple choice item is present, lik (p. 59): "Here are some names of things. Put a line around the name of the one which is most nearly round in every way like a ball." "saucer - teacup - orange - pear - arm."

measurement of ability in reading: speed

Once there was a little girl who lived with her mother.
They were very poor.
Sometimes they had no supper.
Then they went to bed hungry.
One day the little girl went into the woods.
She wanted sticks for the fire.
She was so hungry and sad!
"Oh, I wish I had some sweet porridge!" she said.
"I wish I had a pot full for mother and me. We could eat it all up."
Just then she saw an old woman with a little black pot.
She said, "Little girl, why are you so sad?"
"I am hungry," said the little girl.

Measurement: the number of words read per second. Available time: 30 seconds. "Then have the pupils make a mark with a pencil after the last word read to indicate how far they have read."

Reading Test Series A, Test no. 1 (grade 1). Starch, 1916, p. 22.

measurement of ability in reading: comprehension

[the same text as in the example above]

"Then [after making the mark of the last word read] have them turn the blanks over immediately and write on the back all that they remember having read." Measurement: "Comprehension is determined by counting the number of words written which correctly reproduce the thought. The written account is carefully read, and all words which either reproduce the ideas of the test passage or repeat ideas previously recorded are crossed out. The remaining words are counted and used as the index of comprehension."

Reading Test Series A, Test no. 1 (grade 1). Starch, 1916, p. 22.

To count is to know. Indeed, counting errors for centuries had been the favored method of scoring - and grading - the pupils' achievements. All of this looks quite simple minded to the present day reader. Remember that these standardized tests were used to hold schools accountable! Starch claims 75.000 students have been tested this way. It is so damn easy to manipulate test scores here!
English vocabulary test
1. acta
2. agriculturist
3. ambulacrum
4. abnormal
5. Araneida
6. assagai
7. awaft
  ....... [two lists of 100 words]
The pupil marks the words he or she is sure to know the meaning of, or writes the meaning fater the words he or she is not sure of. "A pupil's score is the average number of words designated correctly in the two lists."
"Each list of 100 was selected by taking the first word on every 23d page of Webster's New International Dictionary (1910)."

Starch, 1916, p. 38.
Torturing pupils is easily accomplished this way. It is really fantastic to see the ideology of 'objectivity' in measurement at work here! Of course this test 'measures' vocabulary, albeit in an idiosyncratic way. No wonder Starch finds an enormous overlap in scores of differend grade groups (p. 41). So much for 'objectivity.' This is just mindless counting in the name of assessment of ability. There is not the slightest inclination here to turn testing into assessment for learning. What is the link to education here? Starch philosophizes "Should the pupils be reclassified into higher or lower classes according to their capacities?" Makes one shiver.
reading test

Look at each word and write the letter F under every word that means a flower.
Then look at each word again and write the letter N under every word that means a boy's name.
[and six more such instructions]

4. camel, samuel, kind, lily, cruel

5. cowardly, dominoes, kangaroo, pansy, tennis

[and seven more series]

Thorndike Reading Scale A Visual Vocabulary, as reproduced in Starch, 1916, p. 43-4. Starch just gives this test, as well as Scale Alpha for Measuring the Understanding of Sentences, without its scoring etcetera. Thorndike asks for the month of birth - Starch did not do so in his own reading tests. Month of birth is absolutely necessary for proper interpretation of ability test results!

On its face validity, one would say this is not a reading test, but a stress test.
silent reading test

Count the letters in each of the words written below. You will find that pumpkin has seven letters, and thanks has six letters. One of the words has five letters in it. If you can find the one having five letters, drw a line around it.

breeze thansk yours pumpkin duck

Test devised by J. F. Kelly, as reproduced in Starch, 1916, p. 48-58 (three tests of 16 items each).

This example shows how much confusion there was at the time about how to measure what. Even Starch criticises this kind of question in a test of reading comprehension.
In modern terms, what is happening here is that the test designer is taking literally the warning to keep questions crisp and clear, because otherwise they will turn into items of verbal 'ability.' Kelly deliberatley tries to confuse pupils and believes the resulting item score is an indication of the ability to comprehend text. It is the authoritarian test designer's fallacy: to measure, anything goes as long as it results in reliable scores. Is this too harsh a judgment? I do not think so. After all, mankind already had lots - many centuries at least - of experience in assessing pupils. This kind of aberration should never have been allowed by the professionals and the parents in the field.

Cyril Burt (1921/1933). Mental and scholastic tests. London: P. S. King. fourth impression of the original text.

The third part of this book, pages 257-338, is about tests of educational attainments. Burt developed these tests and their norms tables, both reproduced in the book, himself. The main text presents Burt's experiences, analyses, case studies etc. concerning London school children.

J. R. Gerberich (1956). Specimen objective test items. A guide to achievement test construction. Longmans.

Testing skills - testing knowledges - measuring concepts - measuring understandings - measuring applications - determining activities - evaluating appreciations - evaluating attitudes - evaluating interests - evaluating adjustments - complex measurement and evaluation.
The example items: are designed for achievement measurement, are intended for pupil use, and can be constructed by the teacher.
This book probably has not made use of the cognitive taxonomy of Bloom c.s. (1956), because the published sources used did not have that taxonomy available.
The examples are not designed by the author, but taken from published printed sources.

more historical literature

G. M. Wilson and Kremer J. Hoke (1920). How to measure. New York: The Macmillan Company. two fold-out grading scales (spelling, drawing).

The authors evidently assume this short title makes it absolutely clear what the book is about: standardized tests for use in the clasroom. Labels like 'standardized' and 'scientific' are used for all kinds of tests boasting tabels of scores for tens of thousands of students in several grades. These tests definitely are achievement tests, achievement being generalized over, for example, spelling. Today this kind of test probably would be called a progress test. Be warned that these so-called standardized tests shared their standardization philosophy with psychological tests: exactly the same test was to be used again and again, year in, year out. There was, therefore, a tremendous risk of teachers teaching to the test in the most literal sense.
The book allows to hypothesize what the mental model of assessment in education was of people like Wilson and Hoke. Reconstruction of this mental model is of some importance in the understanding of how and why assessment in education in the 20th century evolved in the direction it did. It will certainly help to gain perspective on the dangers of a lack of construct validity in what came to be known as the 'new-type' of tests, whether teacher-made or standardized.

G. M. Ruch (1924). The improvement of the written examination. New York: Scott, Foresman and Company.

This book is a program to exchange the essay examination for the new objective type test. The test may be teacher-made.
p. 10: "The next step is obviously that of refining the examination to a point where it will begin to approach the accuracy of measurement in physics, chemistry, and the quantitative sciences generally." This is a momentous misconception, the kernel of Odell's mental model on assessment.
The book, then, attempts to carry out the program hinted at in the p. 10 citation. The list of quality criteria on p. 11 looks familiar, except for the fourth item which expresses the contrast with labor intensive scoring and grading of essay-type tests:
1. Validity
2. Reliability
3. Objectivity
4. Ease of adminsitration and scoring
5. Standards.

G. M. Ruch and George D. Stoddard (1927). Tests and measurements in high school instruction. New-York: World Book Company.

part One: Status, uses, limitations, and selection of tests in secondary school instruction
Part Two: Descriptions of high school tests by subjects.
- These are commercially available tests.
- e.g. Geometry: Minnick Geometry Tests - Schorling-Sanford Achievement Test in Plane Geometry - Hawkes-Wood Plane Geometry Examination - Geometry Tests of Thurstone Vocational Guidance Tests
- Characteristics of these tests are described, no individual items are reproduced
Part Three: Informal objective examination methods
- The smallest of the four parts. Contains only a handful of examples of objective test questions.
- There is some discussion of the guessing problem, though, summing up the debate as of 1927
Part Four: The construction of educational and mental tests
- A rather technical exposition of validity and the derivation of norms. It is evident the authors mean business: tests should be standardized tests.

C. W. Odell (1928). Traditional examinations and new-type tests. New York: The Century Co. (Selected and annotated bibliography p. 439-469)

'New type tests' are teacher-made tests consisting of new-type questions - or objective questions - and that includes short-anwer items. Odell writes to instruct and assist the teacher writer her own tests. It is not a treatise on standardized tests, such as the Wilson and Hoke book.

G. M. Ruch (1929). The objective or new-type examination: an introduction to educational measurement. Chicago : Scott, Foresman.

Chapters: Points of view - The criteria of a good test or examination - Objections to the traditional examination - Advantages and limitations of objective examinations - Students' attitudes toward examinations (a rather short chapter) - Relative values of standardized and non-standardized tests - The building of an objective test or examination - Illustrative types of objective tests - Selected complete examinations (p. 213-264) - Rules for drafting objective test items - Experimental studies of new-type examinations - Chance and guessing in recognition tests - The negative and other suggestion effects in the true-false tests - Examinations, marks, and marking systems - Statistical problems related to measurement - General bibliography (p. 447-471, 377 items in all)
No mention of the Starch 1916 monograph, nor - in the first section - any other older than 1921.

G. M. Ruch and G. A. Rice (1930). Specimen objective examinations. A collection of examinations awarded prizes in a national contest in the construction of objective or new-type examinations, 1927-1928. Chicago : Scott, Foresman.

Robert L. Ebel (1965). Measuring educational achievement. Englewood Cliffs, New Jersey: Prentice-Hall.

Ch. 2. What should achievement tests measure? - Ch 4. The characteristics and uses of essay tests - Ch 5. How to use true-false tests - Ch 6. How to write multiple-choice test items

Paul Black (2001). Dreams, Strategies and Systems: portraits of assessment past, present and future. Assessment in Education: Principles, Policy & Practice, 8, 65-85.

abstract Systems of testing and assessment are shaped in part by personalities and institutions who pursue research insights and technical innovations. Out of these they fashion 'dreams' which drive their efforts to improve these systems. This paper develops this perspective, whilst acknowledging that it overlaps with and complements analyses of assessment systems from social and cultural perspectives. Four different examples are considered. Two from past and current history are the growth, from an origin in IQ testing, of standardised multiple choice tests and the dream of raising standards by external testing. The other two, nascent with their influence yet to be determined, are the dream of improvement by formative assessment and the dream that recent developments in psychology can provide a basis for new and improved assessment practices.
Do I have a copy available, yet? Someone got one for me?

Lorrie Shepard (2000). The role of assessment in a learning culture. Educational Researcher, 29, no. 7, 1-14. http://edtech.connect.msu.edu/aera/pubs/er/arts/29-07/shep01.htm http://edtech.connect.msu.edu/aera/pubs/er/pdf/vol29_07/AERA290702.pdf [dead links? 1-2009]

"Looking at any collection of tests from early in the century [e.g., Ruch, 1929], one is immediately struck by how much the questions emphasized rote recall. To be fair, at the time, this was not a distortion of subject matter caused by the adoption of objective-item formats. One hundred years ago, various recall, completion, matching, and multiple-choice test types, along with some essay questions, fit closely with what was deemed important to learn. However, once curriculum became encapsulated and represented by these types of items, it is reasonable to say that these formats locked in a particular and outdated conception of subject matter."

2.8 literature

Richard C. Anderson (1972). How to construct achievement tests to assess comprehension. Review of Educational Research, 42, 145-170.

APA (1966/1974/1985/1999). Standards for educational and psychological tests, Washington, D.C: American Psychological Association.

S. M. Barnett and S. J. Ceci (2002). When and where do we apply what we learn? A taxonomy for far transfer. Psychological Bulletin, 128, 612-637.

abstract Despite a century's worth of research, spanning over 5,000 articles, chapters, and books, the claims and counterclaims surrounding the question of whether far transfer occurs are no nearer resolution today than at the turn of the previous century. We argue the reason for this confusion is a failure to specify various dimensions along which transfer can occur, resulting in comparisons of 'apples and oranges'. We provide a framework that describes nine relevant dimensions (6 for context and 3 for content), and show that the literature can productively be classified along these dimensions, with each study situated at the intersection of various dimensions. Estimation of a single effect size for far transfer is misguided in view of this complexity. Against the backdrop of this taxonomic framework, the past 100 years of research shows that evidence for transfer under some conditions is substantial but critical conditions for many key questions are as yet untested.

June Barrow-Green (1999). 'A corrective to the spirit of too exclusively pure mathematics': Robert Smith (1689-1768) and his prizes at the Cambridge University. Annals of Science, 56, 271-316.

Evert W. Beth (1955). Semantic entailment and formal derivability. Mededelingen van de Koninklijke Nederlandse Akademie van Wetenschappen, Afdeling Letterkunde, N. R. Vol. 18, no. 13 (Amsterdam), pp. 309-342, reprinted 1961. Reprinted in Jaakko Hintikka (1969). The philosophy of mathematics (pp. 9-41). Oxford University Press.

John H. Bishop (2004). Drinking from the Fountain of Knowledge: Student Incentive to Study and Learn-Externalities, Information Problems and Peer Pressure. Cornell, Center for Advanced Human Resource Studies Working paper 04-15 pdf

from the abstract This paper reviews an emerging economic literature on the effects of and determinants of student effort and cooperativeness and how putting student motivation and behavior at center of one's theoretical framework changes one's view of how schools operate and how they might be made more effective. (...) Student effort, engagement and discipline vary a lot within schools, across schools and across nations and have significant effects on learning. Higher extrinsic rewards for learning are associated the taking of more rigorous courses, teachers setting higher standards and more time devoted to homework. Taking more rigorous courses and studying harder increase student achievement. Post World War II trends in study effort and course rigor are positively correlated with achievement trends.
Even though, greater rigor improves learning, parents and students prefer easy teachers. They pressure tough teachers to lower standards and sign up for courses taught by easy graders. Curriculum-based external exit examinations improve the signaling of academic achievement to colleges and the labor market and this increases extrinsic rewards for learning. Cross section studies suggest that CBEEES result in greater focus on academics, more tutoring of lagging students, more homework and higher levels of achievement. Minimum competency examinations do not have significant effects on learning or dropout rates but they do appear to have positive effects on the reputation of high school graduates. As a result, students from MCE states earn significantly more than students from non-MCE states and the effect lasts at least eight years.

Denny Borsboom (2005). Measuring the mind: Conceptual issues in contemporary psychometrics. Cambridge: Cambridge University Press.

This book is the commercial edition of his 2003 dissertation. That dissertation, in its turn, rests on a series of publications in the Psychological Review, available for download at Borsboom's website.

Denny Borsboom, Gideon J. Mellenbergh and Jaap van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061-1071. pdf

Case and Swanson (2001). Constructing Written Test Questions For the Basic and Clinical Sciences. National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104 http://www.nbme.org/PDF/2001iwg.pdf [dead link? 1-2009].

Gives lots of examples of well-designed items.
p. 117: Appendix A. The Graveyard of NBME Item Formats.
p. 129: Appendix B. Sample Item-Writing Templates, Items, Lead-Ins, and Option Lists For the Basic and Clinical Sciences

Audrey B. Champagne, Richard F. Gunstone and Leopold E. Klopfer (1985). Instructional consequences of students' knowledge about physical phenomena. In Leo H. T. West and A. Leon Lines: Cognitive structure and conceptual change (pp. 61-90). Academic Press.

Paul Davis Chapman (1988). Schools as sorters. Lewis M. Terman, Applied Psychology, and the Intelligence Testing Movement, 1890-1930. New York: New York University Press.

Jannette Collins (2006). Education techniques for lifelong learning: writing multiple-choice questions for continuing medical education activities and self-assessment modules. Radiographics, Mar-Apr 26(2), 543-51. pdf

"The article provides an overview of established guidelines for writing effective MCQs, a discussion of writing appropriate educational objectives and MCQs that match those objectives, and a brief review of item analysis."

Peter J. Congdon and Joy MeQueen (2000). The Stability of Rater Severity in Large-Scale Assessment Programs. Journal of Educational Measurement, 37, p. 163

abstract The purpose of this study was to investigate the stability of rater severity over an extended rating period. Multifaceted Rasch analysis was applied to ratings of 16 raters on writing performances of 8, 285 elementary school students. Each performance was rated by two trained raters over a period of seven rating days. Performances rated on the first day were re-rated at the end of the rating period. Statistically significant differences between raters were found within each day and in all days combined. Daily estimates of the relative severity of individual raters were found to differ significantly from single, on-average estimates for the whole rating period. For 10 raters, severity estimates on the last day were significantly different from estimates on the first day. These fndings cast doubt on the practice of using a single calibration of rater severity as the basis for adjustment of person measures.

Lawrence T. DeCarlo (2005). A Model of Rater Behavior in Essay Grading Based on Signal Detection Theory. Journal of Educational Measurement, 42, 53- . pdf

from the abstract SDT offers a basis for understanding rater behavior with respect to the scoring of construct responses, in that it provides a theory for psychological processes underlying the raters's behavior. (...) Results from a simulation study of a 5-class SDT model with eight raters are also presented.

E. J. Dijksterhuis (1924). Val en worp. Een bijdrage tot de geschiedenis der mechanica van Aristoteles tot Newton. Groningen: Noordhoff.

E. J. Dijksterhuis (1950/1961). The mechanization of the world picture. London: Oxford University Press.

Stillman Drake (1990) Galileo: Pioneer scientist. University of Toronto Press.

George Engelhard, Jr (1994). Examining Rater Errors in the Assessment of Written Composition With a Many-Faceted Rasch Model. Journal of Educational Measurement, 31, p. 93 - . [and 33, p. 115]

abstract This study describes several categories of rater errors (rater severity, halo effect, central tendency, and restriction of range). Criteria are presented for evaluating the quality of ratings based on a many-faceted Rasch measurement (FACETS) model for analyzing judgments. A random sample of 264 compositions rated by 15 raters and a validity committee from the 1990 administration of the Eighth Grade Writing Test in Georgia is used to illustrate the model. The data suggest that there are significant differences in rater severity. Evidence of a halo effect is found for two raters who appear to be rating the compositions holistically rather than analytically. Approximately 80% of the ratings are in the two middle categories of the rating scale, indicating that the error of central tendency is present. Restriction of range is evident when the unadjusted raw score distribution is examined, although this rater error is less evident when adjusted estimates of writing competence are used

Frederiksen, N. (1984). The real test bias. Influences of testing on teaching and learning. American Psychologist, 39, 193-202.

D. A. Frisbie and D. F. Becker (1991). An analysis of textbook advice about true-false tests. Applied Measurement in Education, 4, 67-83. The publisher wants profit from his pdf-files.

A. D. de Groot (1970). Some badly needed non-statistical concepts in applied psychometrics. Nederlands Tijdschrift voor de Psychologie, 25, 360-376.

[In time I will provide the main points from the article here, if not the whole article as a pdf-document.]
see for the somewhat related concept of 'universal design' as applied to assessment: Thompson, Johnstone and Thurlow (2002). html

A. D. de Groot en R. F. van Naerssen (Red.) (1969). Studietoetsen, construeren, afnemen, analyseren. Den Haag, Mouton.

Thomas M. Haladyna (1999 2nd). Developing and validating multiple-choice test items. Erlbaum. (2004 3rd)

Thomas Haladyna, Steven M. Downing, and Michael C. Rodriguez (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15, 309-334. pdf

A taxonomy of 31 multiple-choice item-writing guidelines was validated through a logical process that included two sources of evidence: the consensus achieved from reviewing what was found in 27 textbooks on educational testing and the results of 27 research studies and reviews published since 1990. This taxonomy is mainly intended for classroom assessment. Because textbooks have potential to educate teachers and future teachers, textbook writers are encouraged to consider these findings in future editions of their textbooks.This taxonomy may also have usefulness for developing test items for large-scale assessments. Finally, research on multiple-choice item writing is discussed both from substantive and methodological viewpoints.

Ph. Hartog and E. C. Rhodes (1936 2nd). An examination of examinations. International Institute Examinations Enquiry. London: MacMillan.

Jaakko Hintikka (2007). Socratic epistemology. Explorations of knowledge-seeking by questioning. Cambridge University Press.

Banesh Hoffmann (1962/1978). The tyranny of testing. Crowell-Collier. Reprint 1978. Westport, Connecticut: Greenwood Press.

Mainly about MC tests, and especially quality defects in individual items.

B. Huot (1990). The literature of direct writing assessment: major concerns and prevailing trends. Review of Educational Research, 60, 237-263.

David Klahr and Junlei Li (2005). Cognitive Research and Elementary Science Instruction: From the Laboratory, to the Classroom, and Back. Journal of Science Education and Technology, 14,

abstract Can cognitive research generate usable knowledge for elementary science instruction? Can issues raised by classroom practice drive the agenda of laboratory cognitive research? Answering yes to both questions, we advocate building a reciprocal interface between basic and applied research. We discuss five studies of the teaching, learning, and transfer of the "Control of Variables Strategy" in elementary school science. Beginning with investigations motivated by basic theoretical questions, we situate subsequent inquiries within authentic educational debates—contrasting hands-on manipulation of physical and virtual materials, evaluating direct instruction and discovery learning, replicating training methods in classroom, and narrowing science achievement gaps. We urge research programs to integrate basic research in "pure" laboratories with field work in "messy" classrooms. Finally, we suggest that those engaged in discussions about implications and applications of educational research focus on clearly defined instructional methods and procedures, rather than vague labels and outmoded "-isms."

Ellen Condliffe Lagemann (2000). An elusive science: The troubling history of education research. University of Chicago Press.

a.o.: sections 2 and 3: 2. Specialization and Isolation: Education Research Becomes a Profession - John Dewey's Youth and Early Career - Dewey at the Laboratory School - A Creative Community: The Social Sources of Dewey's Thought - Edward L. Thorndike: "Conquering the New World of Pedagogy" - Thorndike and Teachers College: A Reciprocal Relationship - Dewey Displaced: Charles Hubbard Judd at the University of Chicago - 3. Technologies of Influence: Testing and School Surveying - The History and Philosophy of Education: From Center to Periphery - Dignity amidst Disdain: Ellwood Patterson Cubberley and the First Generation of Scholars of School Administration - Leonard P. Ayres, the Russell Sage Foundation, and the School Survey Movement The Cleveland Survey - Lewis M. Terman and the Testing Movement - Consensus and Community: A Science for School Administration
Les McLean http://leo.oise.utoronto.ca/~lmclean/elureview.html [dead link? 1-2009]
David C. Berliner (1993). The 100-year journey of educational psychology. From interes, to disdain, to respect for practice. In Fagan and VandenBos: Exploring applied psychology: Origins and critical analysis. Washington, DC; American Psychological Association. html

Christian Lebiere and John R. Anderson (1998). Cognitive arithmetic. In John R. Anderson, Christian Lebiere, and others: The atomic components of thought (297-342). London: Lawrence Erlbaum. questia

Christian Lebiere (1998). The Dynamics of Cognition: An ACT-R Model of Cognitive Arithmetic. Dissertation Carnegie Mellon University pfd.

Frederick M. Lord and Melvin R. Novick (1968). Statistical theories of mental test scores. Addison-Wesley.

Nog steeds het handboek, al zijn er bij de geavanceerde technieken natuurlijk tal van latere ontwikkelingen geweest. Zie daarvoor bijv.:
De Gruyter and Van der Kamp (2005). Statistical test theory for education and psychology. http://website.leidenuniv.nl/~gruijterdnmde/statistical%20test%20theory%20for%20education%20and%20psychology.pdf [dead link? 1-2009]

Jose P. Mestre (Ed.) (2005). Transfer of learning: from a modern multidisciplinary perspective. San Francisco: Sage. comment and summary

Kathryn M. Olesko (1991). Physics as a calling. Discipline and practice in the Königsberg Seminar for Physics. Ithaca: Cornell University Press.

W. James Popham (2005). America's 'failing' schools. How parents and teachers can cope with No Child Left Behind. Routledge.

N. Sanjay Rebello, Dean A. Zollman, Alicia R. Allbaugh, Paula V. Engelhardt, Kara E. Gray, Zdeslav Hrepic and Salomon F. Itza-Ortiz (2005). Dynamic Transfer: A Perspective from Physics Education Research. pdf) To appear in Jose P. Mestre: Transfer of learning: from a modern multidisciplinary perspective (p. 217-250). San Francisco: Sage.

Michael C. Rodriguez (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice, summer, 3-13.

abstract Multiple-choice items are a mainstay of achievement testing. The need to adequately cover the content domain to certify achievement proficiency by producing meaningful precise scores requires many high-quality items. More 3-option items can be administered than 4- or 5-option items per testing time while improving content coverage, without detrimental effects on psychometric quality of test scores. Researchers have endorsed 3-option items for over 80 years with empirical evidence—the results of which have been synthesized in an effort to unify this endorsement and encourage its adoption.
The pdf is for sale html for a ridiculous amount.

Gale H. Roid and Thomas M. Haladyna (1982). A technology for test-item writing. London: Academic Press.

Roid and Haladyna has been reviewed by Jason Millman: Writing test items scientifically. Contemporary Psychology, 1982, 27, 966-7; and Anthony J. Nitko, Journal of Educational Measurement, 1984, 21, 201-204.

Sandra J. Thompson, Christopher J. Johnstone and Martha L. Thurlow (2002). Universal Design Applied to Large Scale Assessments. The National Center on Educational Outcomes html

Charles Tilly (2006). Why? What happens when people give reasons ... and why. Princeton University Press.

Lieven Verschaffel, Brian Greer and Erik de Corte (2000). Making sense of word problems. Lisse: Swets & Zeitlinger.

Zie de wordproblems.htm pagina voor aantekeningen

Ludger Woessmann (2004). THE EFFECT HETEROGENEITY OF CENTRAL EXAMS: EVIDENCE FROM TIMSS, TIMSS-REPEAT AND PISA. CESIFO WORKING PAPER NO. 1330 pdf.

abstract This paper uses extensive student-level micro databases of three international student achievement tests to estimate heterogeneity in the effect of external exit exams on student performance along three dimensions. First, quantile regressions show that the effect tends to increase with student ability. But it does not differ substantially for most measured familybackground characteristics. Second, central exams have complementary effects to school autonomy. Third, the effect of central exit exams increases during the course of secondary education, and regular standardised examination exerts additional positive effects. Thus, there is substantial heterogeneity in the central exam effect along student, school and time dimensions.
"More and more evidence is accumulating that the existence of central exit exams is strongly positively related to students' academic performance (cf. Bishop, 2004, for a survey pdf)." (p. 1)

more literature chapter 2

These items are as yet on my 'to do' list: they are mentioned here, but not used in the above text yet.

Mark Raymond, Bob Neiers, Jerry B. Reid (2003). Test-item development for radiologic technology. The american Registry for Radiologic Technology.

A manual. free download: pdf

M. Birenbaum and K. K. Tatsuoka (1987). Open-ended versus multiple-choice formats - It does make a difference for diagnostic purposes. Applied Psychological Measurement, 11, 329-341.

Tatsuoka, 1993, in Bennett and War, p. 107: "Birenbaum and K. tatsuoka (1987) examined the effect of response format on diagnosis and concluded that multiple-choice items may not provide appropriate information for identifying students' misconceptions."

Charles L. Briggs (1986) Learning how to ask. A sociolingual appraisal of the role of the interview in social science research. Cambridge: Cambridge University Press.

James G. Holland, Carol Solomon, Judith Doran and Daniel A. Frezza (1976). The analysis of behavior in planning instruction. Reading, Massachusetts: Addison-Wesley.

In fact, a book on test item writing!

Lynn Arthur Steen (Ed.) (2006). Supporting Assessment in Undergraduate Mathematics. The Mathematical Association of America. pdf

For additional online case studies see this page
Bonnie Gold, Sandra Z. Keith, and William A. Marion (1999). MAANotes #49: Assessment Practices in Undergraduate Mathematics. This book is available in the form of a host of separate pdf-files on this site "The book is unique among assessment books in representing the point of view of mathematicians exploring and examining methods of learning in their field."
Yes, both books have been published as books also, they might be available as such in your institutional library.

Sidney H. Irvine and Patrick C. Kyllonen (Eds) (2002). Item generation for test development. Erlbaum.

sales talk Since the mid-80s several laboratories around the world have been developing techniques for the operational use of tests derived from item-generation. According to the experts, the major thrust of test development in the next decade will be the harnessing of item generation technology to the production of computer developed tests. This is expected to revolutionize the way in which tests are constructed and delivered.

In November 1998, the late Sam Messick, Sidney Irvine, and Patrick Kyllonen assembled a symposium, at ETS in Princeton, attended by the world's foremost experts in item-generation theory and practice. This book is a compilation of the papers presented at that meeting.

This book's goal is to present the major applications of cognitive principles in the construction of ability, aptitude, and achievement tests. It is an intellectual contribution to test development that is unique, with great potential for changing the ways tests are generated. It will be a publishing landmark in the history of test development. The intended market includes professional educators and psychologists interested in test generation.

Andrew Boyle (nd). Sophisticated Tasks in E-Assessment: What are they? And what are their benefits? London: Research and Statistics team, Qualifications and Curriculum Authority (QCA) pdf

Albert Burgos (2004). Guessing and gambling. Economics Bulletin, 4, No. 4 pp. 1-10. http://www.economicsbulletin.com/2004/volume4/EB-04D80001A.pdf

The Burgos case is that of multiple choice testing where the student either may leave unanswered questions she is uncertain about or doesn't know the answer of, or guess the answer. This is a problematic case where the student has partial knowledge and at the same time is risk aversive: the achievement test becomes somewhat a test of personality.. In the Netherlands this kind of situation usually is avoided by forcing students to always answer test items, if need be by guessing. In the US the GRE and the SAT follow different rules, the GRE counts the number correct (students therefore should mark all items), the SAT punishes wrong answers (students may leave questions unmarked). Nevertheless, the article is quite insightful where it comes to problems of guessing on achievement test items, a problem not, of course, unique to the multiple choice format.

abstract: Scoring methods in multiple-choice tests are usually designed as fair bets, and thus random guesswork yields zero expected return. This causes the undesired result of forcing risk averse test-takers to pay a premium in the sense of letting unmarked answers for which they have partial but not full knowledge. In this note I use a calibrated model of prospect theory [Tversky and Kahneman (1992, 1995))] to compute a fair rule which is also strategically neutral, (i.e. under partial knowledge answering is beneficial for the representative calibrated agent, while under total uncertainty it is not). This rule is remarkably close to an old rule presented in 1969 by Traub et al. in which there is no penalty for wrong answers but omitted answers are rewarded by 1/M if M is the number of possible answers.

Steven M. Downing and Thomas M. Haladyna (Eds) (2006). Handbook of test development. Erlbaum. https://www.erlbaum.com/shop/tek9.asp?pg=products&specific=0-8058-5264-6 [dead link? 1-2009]

Dit boek is waarschijnlijk de komende jaren de standaard op het gebied van toetsvragen en testvragen. Heel Amerikaans, stand van zaken, vergelijkbaar materiaal is overigens op een andere manier wel online beschikbaar zoals in deze literatuurlijst aangegeven, maar wel alles handig bij elkaar. Niet blind bestellen: de hardcover is onbetaalbaar, de softcover is duur.

Lucy Cheser Jacobs and Clinton I. Chase (1992). Developing and using tests effectively. A guide for faculty. San Francisco: Jossey-Bass.

William L. Kuechler and Mark G. Simkin (2003). How Well Do Multiple Choice Tests Evaluate Student Understanding in Computer Programming Classes? Journal of Information Systems Education html

abstract espite the wide diversity of formats with which to construct class examinations, there are many reasons why both university students and instructors prefer multiple-choice tests over other types of exam questions. The purpose of the present study was to examine this multiple-choice/constructed-response debate within the context of teaching computer programming classes. This paper reports the analysis of over 150 test scores of students who were given both multiple-choice and short-answer questions on the same midterm examination. We found that, while student performance on these different types of questions was statistically correlated, the scores on the coding questions explained less than half the variability in the scores on the multiple choice questions. Gender, graduate status, and university major were not significant. This paper also provides some caveats in interpreting our results, suggests some extensions to the present work, and perhaps most importantly in light of the uncovered weak statistical relationship, addresses the question of whether multiple-choice tests are "good enough."

Frederick M. Lord (1964). The effect of random guessing on test validity. Educational and Psychological Measurement, 24, 745-747. [Deze jaargang in Leiden niet aanwezig. Ik zoek nog een kopie]

Robert Lukhele, David Thissen and Howard Wainer (1994). On the Relative Value of Multiple-Choice, Constructed Response, and Examinee-Selected Items on Two Achievement Tests. Journal of Educational Measurement, 31, 234.

abstract Using analyses based on fitting item response models to data from the College Board's Advanced Placement exams in chemistry and United States history, we found that the constructed response portion of the tests yielded little information over and above that provided by the multiple-choice sections. These tests also allow examinees to select subsets of the constructed response items; we found that scoring on the basis of the selections themselves provided almost as much information as did scoring on the basis of the answers

Geoff Norman (2002). The long case versus objective structured clinical examinations. BMJ, 324, 748-749 Editorial

V. Wass, R. Jones and C. van der Vleuten (2001). Standardized or real patients to test clinical competence? The long case revisited. Med Educ. 2001 Apr;35(4):321-5 abstract

Lambert W. T. Schuwirth and Cees P. M. van der Vleuten (2003). Written assessment. BMJ 2003;326:643-645 ( 22 March ) html pdf

abstract Some misconceptions about written assessment may still exist, despite being disproved repeatedly by many scientific studies. Probably the most important misconception is the belief that the format of the question determines what the question actually tests. Multiple choice questions, for example, are often believed to be unsuitable for testing the ability to solve medical problems. The reasoning behind this assumption is that all a student has to do in a multiple choice question is recognise the correct answer, whereas in an open ended question he or she has to generate the answer spontaneously. Research has repeatedly shown, however, that the question's format is of limited importance and that it is the content of the question that determines almost totally what the question tests.
Paragraphs about 'true or false' questions - 'single best option' multiple choice questions - multiple true or false questions [example] - 'short answer' open ended questions - essays - key feature questions [example] - extended matching questions [example]

David M. Williamson. Issac I. Bejar, Anne Sax (2004).

2.9 links

Cathleen A. Kennedy (2005). The BEAR Assessment System: A Brief Summary for the Classroom Context. Berkeley Evaluation & Assessment Research Center pdf

The National Assessment of Educational Progress (NAEP) - "the Nation's Report Card" - Search NAEP Questions site

Tests grade levels 4, 8 and 12

College Board Advanced PlacementFree-response questions site

The College Entrance Examination Board: SAT Preparation Center site

Test Prep Review ACT practice site. Provides links to a host of other American test practice pages as well.

SketchUp, een vrij 3D tekenprogramma van Google

Het professionele zusje is niet gratis
SketchUp is a simple but powerful tool for quickly and easily creating, viewing and modifying your 3D ideas.
Dit handige programma om tekeningen bij toetsvragen te maken, is in april 2006 als beta beschikbaar gekomen voor Windows, en zal er ook voor MacOS X komen

answers.com question

TIMMS 2007 Trends in International Mathematics and Science Study pdf 3Mb, example mathematics items pdf, example science items pdf

Dit zijn fraaie voorbeelden van toetsen die gemengd zijn samengesteld, zowel open als gesloten vragen. Ik heb nog niet de tijd genomen een begeleidende commentaar bij de voorbeeldopgaven te ontwerpen.
TIMMS is gericht op groep 8 (grade eight) niveau.
De TIMMS/PIRLS site biedt een uitgebreide serie publicaties aan sinds 1995.

PIRLS 2006 Progress in International Reading Study Assessment Framework and Specifications, 2nd Editionpdf 1.8Mb, sample passages, questions, and scoring guides pdf

Dit zijn fraaie voorbeelden van toetsen voor taal.
TIMMS is gericht op groep 4 (young children in their fourth year of schooling) niveau.
Mieke van Diepen, Universiteit van Nijmegen, is lid van de Questionnaire Development Group
De TIMMS/PIRLS site biedt een uitgebreide serie publicaties aan sinds 1995.

De Wetenschapsquiz 2005. Bespreking van het ontwerp van de vragen in deze quiz hier.

De Grote Geschiedenis Quiz 2006. Bespreking van het ontwerp van de vragen in deze quiz hier.

CAA Centre Computer-asisted assessment in higher education site, handleiding ontwerpen keuzetoetsen pdf

Een handleiding met veel en diverse voorbeelden (niet allemaal aanraders). Het stuk is anoniem.

Tandi Clausen-May (). An approach to test development. nfer

chapter 1 'Looking ahead - ICT-based assessment.' downloadable pdf, met leuke voorbeelden wat er met toetsvragen voor afname via een computer mogelijk is.
"One simple but often effective use of ICT involves the exploitation of the ability to drag and drop on the screen."

Jon Mueller Authentic Assessment Toolbox site

MERLOT Multimedia Educational Resource for Learning and Online Teaching site

" MERLOT is a free and open resource designed primarily for faculty and students of higher education. Links to online learning materials are collected here along with annotations such as peer reviews and assignments."
"MERLOT's strategic goal is to improve the effectiveness of teaching and learning by increasing the quantity and quality of peer reviewed online learning materials that can be easily incorporated into faculty designed courses."
"MERLOT's vision is to be a premiere online community where faculty, staff, and students from around the world share their learning materials and pedagogy."

January 10, 2011 \ contact ben at at at benwilbrink.nl

http://www.benwilbrink.nl/projecten/06examples2.htm