Original publication 'Toetsvragen schrijven' 1983 Utrecht: Het Spectrum, Aula 809, Onderwijskundige Reeks voor het Hoger Onderwijs ISBN 90-274-6674-0. The 2006 text is a revised text. The integral 1983 text is available at www.benwilbrink.nl/publicaties/83ToetsvragenAula.pdf.

Item writing

Techniques for the design of items for teacher-made tests

8. Test item quality

Examples

Ben Wilbrink



this database of examples has yet to be constructed. Suggestions? Mail me.




Kwaliteitskaart

Figure 1. Quality map.



The one term that will cover both the attempts at item design in the chapters three until seven, and the chapter eight quality checks on the items thus designed, is construct validity. I will use Snow (1993) and Messick (1993) as sources on the concept of construct validity, because these contributions focus on the construction versus choice theme that is so germane to item design. Construct validity covers the item and test design process, as well as the empirical evaluation of the items and tests so designed against the major and minor purposes of the course concerned. Be warned, however, that the work of these authors will be criticized for being rather parochial to the educational measurement situation in the United States, it being so different from that in, for example, continental European countries like the Netherlands. The criticism does not touch the idea and construct validity as such, but the way it is fleshed out by Snow, Messick and other American authors. Points of criticism will be the rather norm-referenced instead of criterion-referenced character of the suggested research methodologies, the neglect of national examinations as one possible bench mark for construct validating high-stakes standardized tests, and in general the lipservice paid to the competitive sentiment in American education and its testing traditions.

The emphasis on construct validity is an addition to the 1983 text. I have to work this out, yet. In the process I will also use Hofstee, 1971, and the possible more attractive conception of validity as expounded by Borsboom, Mellenbergh and van Heerden (2004) pdf

8.1 Fair grading


Questions should be clear and unambiguous.


truth-in-testing


See Haney (1984), p. 627-8.


You are shown two pyramids, one of them with four faces, the other with five faces. All faces are equilateral triangles, except for the second pyramid's square base.

"If the two pyramids were placed together face-to-face with the vertices of the equal-sized equilateral triangles coinciding, how many exposed faces would the resulting solid have?

Haney, 1984, p. 627, citing the College Boards' Preliminary Scholastic Aptitude test (PSAT).


You might take some time, pondering the question in the box, and why I might have cited it as an example of what. The question has become rather famous because it was the first one, after the truth-in-testing law of New York State came into effect, that was shown to be keyed faulty. The official key was that the number of faces is seven. A Florida schoolboy had marked 'five,' and protested against only the official answer being scored right. ETS immediately admitted they had made a mistake. Can you figure out what the mistake was? According to the New York Times, 240.000 scores had to be raised! The same item had been used in the years before, and on those tests the scores also were corrected. Remark that standard item analyses had not turned up this item as a faulty one! In the earlier years, students were unable to protest because the tests then were kept secret. The right answer to this question indeed is five, because two equilateral triangles fall in the same plane, and another two also, diminishing the number of planes by two. Reviewers of the original item had completely missed this solution.


Walt Haney (1984). Testing reasoning and reasoning about testing. Review of Educational Research, 54, 597-654.

Kathleen Rhoades and George Madaus (2003). Errors in Standardized Tests: A Systemic Problem. Boston College: Lynch School of Education. National Board on Educational Testing and Public Policy. pdf

John R. Hills (1991). Apathy concerning grading and testing. Phi Delta Kappan, 540-545.


8.2 Checklist quality control


Checklist of format concerns

Checklist of content concerns

Haladyna (1999, p. 77), Haladyna a.o. (2002): "Base each item on specific content and a type of mental behavior.'

The Haladyna guideline endorses the construction of extensive tables crossing specific content against levels of cognitive functioning, see Bloom, Hastings and Madaus (1971).

This book, however, does not endorse the psychological approach of trying to force items in the Bloom et aliis cognitive taxonomy of 'mental behaviors' (terrible term). Also on the point of enumerating specific content, this book follows the alternative approach of schematizing course content, a technique that is so much more flexible than the linear enumeration of topics treated.
For the literature mentioned, see the list in chapter 2


1. Write up the specific content of the question.

Guideline two replaces the cognitive taxonomy or mental behavior specifications still widely used in the field, see Thomas Haladyna, Steven M. Downing, and Michael C. Rodriguez (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15, 309-334. http://depts.washington.edu/currmang/Toolsforteaching/MCItemWritingGuidelinesJAME.pdf [dead link? 1-2009]


2. Write up the kind of mastery that is asked for.
Reproduction, producing new examples, a new application, inference, translation, etcetera.

I am not aware of any places in the literature on educational measurement on the topic of what the appropriate level of abstraction in test items is. Excluding, of course, the educational writing of John Dewey.

A somewhat related guideline is Haladyna's number three "Avoid overly specific and overly general knowledge."


TOO ABSTRACT or GENERAL

What is life?

  1. The exchange of oxygen for carbon dioxide
  2. The existence of a soul
  3. Movement Haladyna 1999 p. 79

3. Is the level of abstraction appropriate?

Guideline six in the Haladyna list: Avoid trick items.


7. No trick items allowed.
(Loosely speaking, a trick item is one the student cannot prepare for, or has not been prepared for in the course.)


catch

"In this country a date such as July 4, 1971, is often written 7/4/71, but in other countries the month is given second and the same date is written 4/7/71. If you do not know which system is being used, how many dates in a year are ambiguous in two-slash notation?"

Gardner (2006). problem 1.1


Catch questions can be trick questions, too. In the above box, a date such as 7/7/77 is not ambiguous (therefore the answer is 132, not 144). By definition, the catch is not something already familiar or learned. Therefore, catch questions do not belong in achievement tests. They might be used in instructional situations, of course. For example Martin Gardner (2006) offers a fine collection of problems some of which are catch questions.


Checklist of general concerns

THE CORRECT ALTERNATIVE — It saves energy — IS MISSING

5: Why do penguins walk so silly?

  1. Their fat layers are in the way.

  2. They do not have knees.

  3. Their legs are rather short.

    The item was used in the National Science Quiz 2005 in the Netherlands (site), and consequentially was left out of the contest.


5a. Is the printed question correct?


In Dutch secundary education the exit exams are national tests. It is not humanly possible to prevent errors in the formulation of the questions. An example of one such error is given here, a rather a stupid one, demonstrating a serious shortcoming in the quality control process. The particular error results in all students getting full credit for the question involved, even if they did not answer it at all. The trouble is, errors of this kind can not be repaired adequately/justly after the fact. The better students might in fact incur a small penalty here because of the grading-on-the-curve method of norming used in these exams, but they will not be told so. Almost everyting else may be found on the high-stakes testing on Het Examenblad site.


De eerste twee scorepunten van vraag 12, behorende bij het onderdeel dat in het beoordelingsmodel is aangeduid als:
"Er dienen twee gebeurtenissen in Zuidoost-Azië te worden genoemd naar aanleiding waarvan Eisenhower deze uitspraak doet (de gebeurtenissen mogen niet na 1954 plaatsvinden), bijvoorbeeld: dienen aan alle kandidaten te worden toegekend, ongeacht het gegeven antwoord.

[The question asks for events in South East Asia before 1954, and expects answers like 'China becoming communist in 1949' and 'The invasion of South Korea in 1950.' ALL examinees will get two points for this question now, regardless whether or not and what exactly they did answer. For Korea and China do not belong to South East Asia, it would not have been very clever to have given these answers!]

Exit exam VWO history, 2006 http://www2.cito.nl/vo/ex2006/600043-A-12c-VW.pdf [dead link? 1-2009]. The CEVO is the authority handling scoring problems in the National Examination in the Netherlands


Dewey G. Cornell, Jon A. Krosnick and LinChiat Chang (under review 2004). Student Reactions to Being Wrongly Informed of Failing a High-Stakes Test: The Case of the Minnesota Basic Standards Test. pdf
K. Rhoades and R. Madaus, G. (2003). Errors in standardized tests: A systematic problem. Boston: National Board of Educational Testing and Public Policy. pdf
Robert Schaeffer testimony before the New York Senate Higher Education Committee (May 2, 2006). http://www.fairtest.org/univ/SAT_Error_testimony.html [dead link? 1-2009]


5b. Is the scoring key correct?

universal quantor
schema verknooptheid

Biology VWO 2006 http://www2.cito.nl/vo/ex2006/600025-1-26o1.pdf [dead link? 1-2009] question 15. The problem here is that alternative 2 is meant to be the 'correct' one. However, there is no indication whatsoever about the order of magitude of pressure changes as depicted on the vertical scale. Therefore the issue arises whether it is absolutely true there is no other process influencing de pressure in one or the other way. According to Karel Knip, NRC Handelsblad June 4, p. 51 (Alledaagse Wetenschap 'Erwtverwarming') there is: combustion of starch, a burning process, produces heat, therefore does increase pressure in a closed vessel. Therefore, alternative 3 is the correct one.

Layout: Thompson, Johnstone and Thurlow (2002) html . For example, avoid right justified text. "Unjustified text may be easier for poorer readers to understand because the uneven eye movements created in justified text can interrupt reading." [I have not yet checked the emprical basis for the claim. As soon as I am convinced, I will start using ragged text.] Short lines - items printed in two columns - are more vulnerable in this respect than longer lines.

miscellaneous


Security

KEEP THE TEST SECURE

The National Science Quiz 2003, a TV quiz in the Netherlands, was published just before the planned recording session, making it possible for the participants to have read the questions beforehand. The decision not to go on with the 2003 show led to the departure of qiizmaster Wim T. Schippers, who pioneered the quiz in 1994. Otherwise no serious harm was done. Security management in the case of examinations, especially nationwide ones, is a very serious matter. Premature publication on the internet of items from the exams would be a disaster for all examinees.


PISA FALLS

Welche Aussage erklärt, warum es auf der Erde Tageslicht und Dunkelheit gibt?

  1. Die Erde rotiert um ihre Achse.
  2. Die Sonne rotiert um ihre Achse.
  3. Die Erdachse ist geneigt.
  4. Die Erde dreht sich um die Sonne.

Aufgabe aus dem Naturwissenschaften-Test , vom deutschen PISA-2003-Koordinator Manfred Prenzel in der "Zeit" (09.12.04) folgende als beispielhaft vorgestellt (P2, 394; 591 Punkte).
Peter Bender (2005) Neue Anmerkungen zu alten und neuen PISA-Ergebnissen und -Interpretationen http://www-math.uni-paderborn.de/~bender/neueAnmerkungenPISAausf%FChrlich.pdf [dead link? 1-2009], p. 10: "Alle Antworten sind falsch, insbesondere auch A. Die Erklärung lautet vielmehr: "Die Erde rotiert mit einer anderen Winkelgeschwindigkeit um die eigene Achse, als sie sich um die Sonne dreht."


The PISA question is why there is day and night on earth. The correct answer, according to PISA, is because the Earth rotates. Peter Bender: the Moon does also, yet it always turns the same side to the Earth.

Bender's repairing act fails also, because it is known that Uranus, has its axis of rotation in the direction of the sun, in that way always turning the same side to the sun; how does this fact stand to statement c? Designing achievement test items is a tricky business.

Of course, the stem of the question is ambiguous also, see Bender p. 11.

Well, is't answer a. the better one of the four alternatives? It would be foolish for a knowledgeable student NOT to answer a. What do you think: is this bad design, or not?

Do not misuse the instruction to students always to choose the best alternative, if more than one are deemed to be correct. The PISA question cited does not have a single correct answer, and is therefore a faulty item.


8.3 Independent quality check


Two senior GCE examiners re-marked photocopies of the same 200 GCE examination scripts, half still containing the marks and comments of the original examiners and half with these markings removed. Removing previous markings made a considerable difference to the extent of agreement between these sets of marks.

The above is the (ERIC) abstract of R. J. L. Murphy (1979). Removing the Marks from Examination Scripts before Re-Marking Them: Does It Make Any Difference? British Journal of Educational Psychology, 49, 73-78.

This kind of exercise, of course, has been - in one variant or another - thousands of times, many of them have been published. The results tend to be the same as reported in this particular case. It's relevance to the theme of this paragraph is, that it does not matter much whether it is examination scripts being rated, or the quality of proposed test items. The point is: everything possible should be done to independent judgments of item quality; if anything might be amiss with a particular item, the probability to get it noticed must be maximised (within reasonable bounds of available resources, of course).


MULTIPLE TRUE-FALSE

In children, ventricular septal defects are associated with
  1. systolic murmur
  2. pulmonary hypertension
  3. tetralogy of Fallot
  4. cyanosis

The difficulty is that the examinee has to make assumptions about the severity of the disease, the age of the patient, and whether or not the disease has been treated. Different assumptions lead to different answers, even among experts.

Case and Swanson (2001) p. 15 http://www.nbme.org/PDF/2001iwg.pdf [dead link? 1-2009]


Even if you were a doctor, could you intuitively feel something might be wrong with an item such as the boxed one? In many cases you will pass the item, I can assure you. In those cases, before using the item in your next test, test it by having it independently answered by knowledgeable colleagues; if the item is flawed they will come up with different answers. Case and Swanson clearly indicate what is the trouble with the boxed item. In fact, the stem can be regarded as a too abstract patient vignette. In this case the vignette should have information about the severity of the disease, etcetera.

Bar-Hillel en Falk (1982) analyse a seemingly simple statistics question, showing how superficial analysis can result in different answers being given, among them the correct one (but for the wrong reason), because there are no short-cuts possible on the a careful and complete analysis of the simple problem as presented in the question. The case is treated in the Dutch chapter. This kind of problem has become famous in the US, the case is given in the box.

The Monty Hall problem

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the other doors, opens another door, say No. 3, which has a goat. He then says to you, 'Do you want to pick door No. 2?' Is it to your advantage to take the switch?

A reader’s question put to Marilyn Vos Savant, September 1991, columnist for the Sunday Parade. (e.g. see here.)

Did you answer the question? Why do you think your answer is the correct one? Want to play the game here or there? More information, see Zomorodian (1998) pdf. A book on the problem is Rosenhouse (2009). By the way, Marilyn’answer was ‘’: switch. Thousands of readers disagreed. The critical sentence in the problem is that the quizmaster knows what is behind the doors, making Marilyn’s answer true.

The reason to mention this particular problem on this partical place is that it signals a tremendous problem in assessment. Here is a very simple question, or so it seems, allowing a choice between two answers only. And yet many experts disagree with each on the correct answer. What is worrying about this case is that it shows that there do not exist procedures that are guaranteed to result in a correct verdict on what is a correct or the best answer on a given test question. In this particular case it is clear, after the fact, what is the correct answer. But what to think of all those cases where experts disagree (the items concerned should be discarded) or where the assessor's judgment is contested on reasonable grounds by the student?
There is, however, quite another way to look at the Monty Hall type of problem. See De Neys and Verschuren (2006) pdf. Working memory capacity proves a distinguishing characteristic between subjects solving the problem correctly from those failing to do so. This finding does not rule out other explanations in the extensive literature on the Monty Hall problem (see the list in De Neys and Verschueren), but it should be a warning that this type of problem is a breach of validity in achievement testing. Differences in working memory capacity, or any other differences in intelligence, should not be a factor in educational achievement testing. They might be in participating in particular courses, of course.

The question was answered by Marilyn vos Savant in her Sunday Parade column, September 1991: Switch! gif/06hredish8.3.jpg
Asking only fellow teachers to independently check on the quality of items runs the risk of overseeing an important category of faulty item design: items being way too difficult for the pupils or students, while at the same time being quite easy to the expert colleague. See Redish, Scherr and Tuminaro (2005 pdf) for classroom research on exactly this kind of design problem (the problem is given in the figure). Finding out about this kind of problem after the fact, is not clever - the damage can not be repaired. It may be necessary to use students to check for this kind of problem - then do so.

Figure 8.3.1. The three-charge-problem (Redish, Scherr and Tuminaro, 2005)


8.4 Check lists


8.5 An historical perspective

THE NECK VERSE (READING TEST TO SAVE ONE'S LIFE)

Verse 1 of Psalm 51: "Have mercy upon me, O god, according according to thy loving kindness; according unto the ,ultitude of thy tender mercies blot out my transgressions."

This biblical verse was using in medieaval England to test whether a convict's claim to belong to the clergy was true: if he could read, he was a clerk, and could esacpe capital punishment by a lay court. The favorite pasaage to be read was the verse cited, it could save one's neck. Yes, cases of fraud have been known: the verse could be learned by heart. See Frank Daniels (1973). Teaching reading in early England (p. 160-4). Pitman. Or Google "Neck Verse" html.

Starch and Elliott, in 1913, used the photographed copy of one final examination in United States history in a large school in Wisconsin: in print it is one half page of questions, and two pages of answers. Seventy principal teachers of history from schools whose passing grade was 75, graded this same examination work. The picture shows the result as summarized in Starch (1916, p. 7).

                               o
                               o    o oo
                               o    o  oo
          oo              o  o oo   o  oo      
           o o    o  o    o oo ooo  oo ooo   oo
   o   o o o ooo  oooo  ooo oooooo oooooooo oooooo o
____________________________________________________
40        50        60        70        80        90
"Marks assigned to a history paper by seventy teachers." Note: misplaced o's follow the Starch Figure.

Daniel Starch and Edward C. Elliott (1913). Reliability of grading work in history. School Review, 21, 676-681. Reprinted in John A. Laska and Tina Juarez (Eds) (1992). Grading and marking in American schools. Two centuries of debate. Springfield, Illinois: Thomas. The article gives the full text of the examination questions and the examination work used in the investigation.


Research by Daniel Starch and Edward Elliott just before World War I has been highly influential in the American way of grading, and the explosion of standardized testing in schools in the years immediately following the war.

The common method of grading in the early twentieth century is in percentage of the material the student has learned - in the eyes of the assessor, of course. To begin with, this research showed percentage marking to be ridiculously imprecise, and so marked the change to letter grading, common nowadays.

Much more important, though, results like these gave proponents of standardized testing decisively strong arguments to introduce these new testing formats in the schools.

Probably you have always thought history papers were difficult to grade reliably. The next figure shows the results of the same kind of research, now using a paper in mathematics.

                             o
                             o
                             o
                             o
                             o
                             o
                             o
       o           o         o
       o          oo       o o  o
       o      oo  oo       o o  oo
       o     ooo  ooooooo  oooo oooo o  o
    o  o  o  oooooooooooo  ooooooooo oooo     o
oooooo o ooooooooooooooooooooooooooo oooooo   o
_ _ _  ________________________________________
28    53     60        70        80        90
"Passing grade 75. Marks assigned by schools whose passing grade was 70 were weighted by 3 points. Median 70. Probable error 7.5"

Daniel Starch and Edward C. Elliott (1913). The Reliability of grading work in mathematics. School Review, 21, 254-9. The data from this research, as presented in figure 3 in Starch (1916, p. 6).


What I want you to take away from the Starch and Elliott research is, first of all, its method: get independent assessments of the same work, and let the results educate you. The second important observation is that pupils never should be brought in a position where their fate depends on the marking of one or two of this kind of examination papers. The usual solution is what is called in Dutch 'vijven en zessen' - after a great deal of shilly-shallying - the teachers and the principal of the unlucky pupil will consider other indicators of his or her school achievement to correct unexpectedly low marks. There is a lot more to say about this kind of problem, I will not do so here, except this one remark.

It is to our own peril to forget about this kind of research. It has been of tremendous significance in the promotion of so called objective methods of assessment. And yet, to the critical observer, it should be immediately obvious that the research methodology of Starch - and also that of Hartog and Rhodes (see below) and many others - is seriously at fault. The independent assessors do not know anything about the pupil who wrote the essay, nor about the instruction he or she has received: no wonder the assessments of the same work are that wildly different. This is just another case of the old psychometric dictum (a.o. Frederic Lord on the way scores in a testbattery should be combined to arrive at a final verdict: they should compensate each other): always use information from other assessments of work of the same pupil to arrive at a more balanced view of achievements and talents. Even more serious, though, is the total neglect of the possibility of assessment for learning: the grading of the same work is analysed, not the feedback these assessors would have given the pupil. The last point has been elaborated beautifully by Dennie Palmer Wolf (1993).

C. W. Valentine (1932). The Reliability of Examinations. An Enquiry. London: University of London Press.

"Many of us," writes the Headmaster of Holt School, Liverpool, "have had the heart-breaking experience of a candidate, the best in his year, with as many as four distinctions, failing to get matriculation or even a School Certificate."


Valentine, among other things, illustrates how unjust the requirement of a pass for compulsaory subjects can be. His book, which really is about examinations, not about individual examination papers, nevertheless contains gems like the cited one about the mishandling of the graded individual examination paper.

School certificate history:   predictable?

"Fifteen scripts were selected which had been awarded exactly the same "middling" mark by the School Certificate authority concerned, and these scripts were marked in turn and independently by 15 examinaers, who were asked to assign to them both marks and awards of Failure, Pass and Credit. After an interval which varied with the ifferent examinrs, but wasnot less than 12 nor more than 19 months in any instance, the same scripts, after being renumbered, were marked again by 14 out of the 15 original examiners (...). The 14 examiners assured us that they had kept no record of their previous work and this was indeed obvious from the results.
(...)
Perhaps the most striking feauture in the investigation is this: (...) On each occasion the 14 examiners awarded a total of 210 verdicts [Fail, Pass, Credit] to the 15 candidates. It was found tat in 92 cases out of the 210 the individual examiners gave a different verdict on the second occasion from the verdict awarded on the first. In nine cases candidates were moved two classes up or down
One examiner changed his verdict in regard to eight candidates out of the fifteen. Yet he only varied his average by a unit [scale of 100] and he awarded the same number of Failure marks, one less Pass, and one more Credit. Such irregularity of judgment is not only formidable, but it is one which would not be detected by any odinary analysis."

Ph. Hartog and E. C. Rhodes (1936). An examination of examinations (p. 14-15). Second edition. International Institute Examinations Enquiry. London: MacMillan.


Ph. Hartog and E. C. Rhodes (1936). An examination of examinations. Second edition. International Institute Examinations Enquiry. London: MacMillan.


The big omission in the Hartog and Rhodes prose is that they have not taken into account the number of different verdicts that are to be expected under ideal circumstances. Let me explain this in a few words. Under pass-fail scoring it will almost always be the case that the cutting score falls on a point where there are many pupils scoring exactly the cutting score, or one point less. Even when using a test that is scored very reliably, an independent assessment will result in an appreciable number of pupils crossing the line one way or the other. The problem here is much more complicated than scoring reliability alone. Regrettably, the educational measurement profession has chosen to focus on the 'subjectivity' of these assessments, because it is the natural thing to do for the measurement specialist. Seven decades later, the basic problems still have to be solved, for example see Bennet and Ward (1993). They are the more difficult to solve now, because in the meantime the world - especially the US - has been sold on standardized tests.

J. R. Gerberich (1956). Specimen objective test items. A guide to achievement test construction. Longmans.


Banesh Hoffmann (1962/1978). The tyranny of testing. Crowell-Collier. Reprint 1978. Westport, Connecticut: Greenwood Press.

8.6 Literature

R. C. Anderson (1972). How to construct achievement tests to assess comprehension. Review of Educational Research, 42, 145-170.

M. Bar-Hillel and R. Falk (1982). Some teasers concerning conditional probabilities. Cognition, 11, 109-122.

Denny Borsboom, Gideon J. Mellenbergh and Jaap van Heerden (2004). The concept of validity. Psychological Review, 111, 1061-1071. pdf

Case and Swanson (2001). Constructing Written Test Questions For the Basic and Clinical Sciences. National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104 http://www.nbme.org/PDF/2001iwg.pdf [dead link? 1-2009].

M. Bar-Hillel and R. Falk (1982). Some teasers concerning conditional probabilities. Cognition, 11, 109-122.

Randy Elliott Bennett and William C. Ward (Eds) (1993). Construction versus choice in cognitive measurement. Issues in constructed response, performance testing, and portfolio assessment. Hillsdale: Erlbaum.

Wim De Neys and Niki Verschuren (2006). Working Memory Capacity and a Notorious Brain Teaser. The Case of the Monty Hall Dilemma. Experimental Psychology, 53, 123-131. pdf

S. M. Downing (2002). Threats to the validity of locally developed multiple-choice tests in medical education: construct-irrelevant variance and construct underrepresentation. Adv Health Sci Educ Theory Pract, 7(3), 235-41.

Martin Gardner (2006). The colossal book of short puzzles and problems. Edited by Dana Richards. Norton.

W. K. B. Hofstee (1971). Begripsvalidatie van studietoetsen: een aanbeveling. Nederlands Tijdschrift voor de Psychologie, 26, 491-500.

Christina Huber, Martina Späni, Claudia Schmellentin und Lucien Criblez (2006). Bildungsstandards in Deutschland, Österreich, England, Australien, Neuseeland und Südostasien Literaturbericht zu Entwicklung, Implementation und Gebrauch von Standards in nationalen Schulsystemen. Fachhochschule Nordwestschweiz Pädagogische Hochschule Institut Forschung und Entwicklung Kasernenstr 5001 Aarau . http://www.edk.ch/PDF_Downloads/Harmos/Literaturanalyse_1.pdf [dead link? 1-2009]

Samuel Messick (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In Randy Elliot Bennett and William C. Ward Construction versus choice in cognitive measurement (p. 61-73). Erlbaum.

NIP (1986). Richtlijnen voor ontwikkeling en gebruik van psychologische tests en studietoetsen. Amsterdam: Nederlands Instituut voor Psychologie. Tweede editie.

Code of Fair Testing Practices in Education. Prepared by the joint Committee on Testing Practices. html

=

Legal issues in grading. A section in Louis N. Pangaro and William C. McGaghie (Lead Authors) Evaluation and Grading of Students. pdf, being chapter 6 in Ruth-Marie E. Fincher (Ed.) (3rd edition) Guidebook for Clerkship Directors. downloadable [Alliance for Clinical Education, USA]

Edward F. Redish, Rachel E. Scherr, and Jonathan Tuminaro (2005). Reverse engineering the solution of a 'simple' physics problem: Why learning physics is harder than it looks. paper pdf Also published in The Physics Teacher, 44, 293-300.

Jason Rosenhouse (2009). The Monty Hall problem. The remarkable story of math’s most contentious brain teaser. Oxford University Press.

Richard E. Snow (1993). Construct validity and constructed-response tests. In Randy Elliot Bennett and William C. Ward Construction versus choice in cognitive measurement (p. 61-73). Erlbaum.

U.S. Supreme Court. Regents of University of Michigan v. Ewing, 474 U.S. 214 (1985). html

Sandra J. Thompson, Christopher J. Johnstone and Martha L. Thurlow (2002). Universal Design Applied to Large Scale Assessments. The National Center on Educational Outcomes html

Andrew Watts (2006). Fostering communities of practice in examining. A rationale for developing the use of new technologies in support of examiners. Cambridge Assessment Network site pdf

Dennie Palmer Wolf (1993). Assessment as an episode of learning. In Bennett and Ward (1993, 213-240).

Afra Zomorodian (1998). The Monty Hall problem. (unpublished?) pdf


more literature


Principles for Fair Student Assessment Practices for Education in Canada pdf

Denise Jarrett and Robert McIntosh (2000). Teaching mathematical problem solving: Implementing the vision. A literature review. pdf

Douglas Fuchs and Lynn S. Fuchs (1986). Test procedure bias: A meta-analysis of examiner familiarity effects. Review of Educational Research, 56,, 243-262

Links


Bas Braams website http://www.math.nyu.edu/mfdd/braams/links/ Review of PISA Sample Science Unit 1: Stop That Germ, November 29, 2004. html; Review of PISA Sample Science Unit 2: Peter Cairney, November 29, 2004 html; Comments on the June, 2003, New York Regents Math A Exam html; Mathematics in the OECD PISA Assessment html; OECD PISA: Programme for International Student Assessment. html; Mathematics in the OECD PISA Assessment html.

FairTest The National Center for Fair & Open Testing site of the FairTest Examiner journal.

Het Examenblad http://www.eindexamen.nl/


8.7 Terminology




January 10, 2011 \ contact ben at at at benwilbrink.nl     Valid HTML 4.01!   http://www.benwilbrink.nl/projecten/06examples8.htm