home
sitemap
Publicatie-in-voorbereiding
Gedwongen raden is oneerlijk
Forced guessing unfair to students
Ben Wilbrink
Gedwongen raden leidt bewijsbaar in een bepaald aantal gevallen tot onredelijke afwijzingen, die daarom in rechte zijn aan te vechten. Studenten die zo'n onredelijke afwijzing willen voorkomen, kunnen op hun in te leveren scoringsformulier iedere gedwongen geraden vraag van een paraaf voorzien, zodat gedwongen raden achteraf bewijsbaar is.
fout raden of fout weten, een zelden of nooit in de psychometrische literatuur gemaakt onderscheid
Er is nog een heel ander probleem met gedwongen raden: dat van de correctie voor raden, waarbij meestal de aanname is dat een fout antwoord het resultaat is van fout raden, niet van fout weten. Ik moet het nog onderzoeken, onder andere aan de hand van de op deze pagina genoemde literatuur, maar een eerste steekproef uit de literatuur wijst uit dat geen van de auteurs die een formule voor correctie voor raden presenteert, signaleert dat het altijd mogelijk is dat foute antwoorden het resultaat zijn van fout weten. Dat fout weten in de onderwijspraktijk echt heel veel voorkomt, is onmiddellijk duidelijk voor wie daarnaar op zoek gaat. Kijk alleen maar eens naar wat er op korte open vagen aan fouten wordt gemaakt. Natuurlijk, ook bij antwoorden op korte open vragen kan de kandidaat een gokje wagen, en is er sprake van verkeerd gokken. Ik heb deze thematiek nog niet uitgewerkt in Toetsvagen ontwerpen, maar al wel in het SPA-model. Dat wil zeggen, ik ben nog niet in de gelegenheid geweest om het onderwerp in de tekst van hoofdstuk een, the generator, te behandelen, maar er is wel een mathematisch model uitgewerkt en toegepast in applet 1m. Er is een directe relatie tot de eerlijkheid van gedwongen raden, en ook dat moet ik hieronder nog uitwerken: miskennen dat antwoorden fout kunnen zijn geweten, leidt tot een overschatting van het aantal vragen waarop is geraden, dus tot mogelijk een te lage inschatting van het aantal goed geweten vragen. Interessant is dan hoe het aantal goed geweten te waarderen teegn het aantal fout geweten, bijvoorbeeld in een situatie waarin er nauwelijks is geraden omdat kandidaten van een onusregeling gebruikmaken.
fout raden of fout weten, een zelden of nooit in de psychometrische literatuur gemaakt onderscheid
Bijvoorbeeld lag de bal voor open doel in het overzichtsartikel van Ross E. Traub & Y. Raymond Lam (1985). Latent structure and item sampling models for testing. Annual Review of Psychology, 36, 19-48. Op p. 39 een beschouwing over.
As described, the model includes three parameters: (a) the probability, say α, that the examinee correctly answers an item from the latent category of items not known, (b) the probability, say β, that the examinee incorrectly answers an item from the category of onown items, en (c) the proportion, say ζ, of items in the domain that the examinee knows. It is not possible to estimate all three parameters from only the information contained in an examinee’s response pattern (of zero and one item scores). [etcetera]
In het gegeven citaat komt niet de noodzaakelijke ‘vierde parameter’ voor: the proportion of items in de domain that the examinee knows falsely.
De enige plek die ik zie, waar fout weten een plek heeft in de literatuur, is daar waar alternatieven zó worden gekozen dat ze het resultaat zijn van fout weten of van een foute redenering. Het probleem blijft dan, onder de instructie om niet-geweten vragen te raden, dat onbekend blijft of foute antwoorden fout zijn geweten, of fout zijn geraden. En het blijft opmerkelijk dat deze ontwerp-praktijk geen tegenhanger heeft in de formules voor correctie van raden.
Fout raden of fout weten is ook in deze webpagina nog niet onderscheiden. Het onderstaande is daarmee op zich niet incorrect, maar het is wel onvolledig. In het bijzonder is het van belang om onderzoek te doen naar de consequenties van het NIET maken van het onderscheid, bijvoorbeeld voor zak-slaagbeslissingen. Het maakt namelijk nogal verschil wanneer een fout beantwoorde vraag wordt opgevat als fout geraden, of als fout geweten. Een kandidaat die NIET raadt, heeft de goed gemaakte vragen dus goed GEWETEN, de kandidaat die nooit fouten maakt maar raadt op niet geweten vragen, zal enkele van de goed gemaakte items hebben GERADEN. Voor het bepalen van de zak-slaaggrens maakt het dus nogal uit wat de stilzwijgende veronderstelling is over deze zaken.
Gedwongen raden gebeurt onder de instructie om ook niet-geweten keuzevragen te beantwoorden. De reden voor die instructie is dat anders de vragen in ieder geval fout worden gerekend, zodat de leerling zichzelf zou benadelen door niet een gokje te wagen. Een reden voor deze absurde scoringsmethode wordt zelden of nooit gegeven, en dat kan voortaan maar beter wèl gebeuren. Er is altijd een uitstekend alternatief voorhanden: voor iedere opengelaten vraag een bonus toe te kennen die tenminste gelijk is aan de raadkans.
De aftrek van een punt voor een iedere foute vraag komt ongeveer op hetzelfde neer als een bonusregeling. Open gelaten vragen leveren weliswaar geen punt op, maar ook geen strafpunt. De aftrek voor foute vragen is een methode die bijvoorbeeld bij de Maastrichtse voortgangstoetsen wel wordt gebruikt. Op de SAT (The College Board) leveren open vragen geen punt op, en foute vragen een fractie van een punt als strafpunt (bv. bij vijfkeuzevragen 1/4e punt) (zie bv. 10 Real SAT's. New York: College Entrance Examinaion Board, 2003; of de oefensite
De CEEB Test-Taking Approaches “Omit questions that you really have no idea how to answer.”
But if you can rule out any choice, you probably should guess from among the rest of the choices."
Lewis R. Aiken (1987). Testing with multiple-choice items. Journal of Research and Development in Education, 20 #4, 44-58.
- Aiken geeft op p. 48-49 een kort overzicht van de literatuur over effecten van verschillende mogelijkheden en instructies. Dat levert geen overtuigend beeld op: de empirische literatuur zoals Aiken deze noemt is schaars, dan al zo'n 10 tot 20 jaar oud, en niet eenduidig in resultaten, o.a.:
- P. H. Taylor (1966). Study of the effects of instructions in a multiple-choice mathematics test. British Journal of Educational Psychology, 36, 1-6.
- Robert Wood (1976). Inhibited blind guessing: The effect of instructions. Journal of Educational Measurement, 13, 297- (cited by Peter Hassmén and Darwin P. Hunt. (1994) Human Self-Assessment in Multiple-Choice Testing. Journal of Educational Measurement 31, 149-160; and Ben Wilbrink html)
De thematiek raakt aan zowel de strategische voorbereiding op toetsen, als aan het ontwerpen van toetsvragen. Bij de toetsvragen is dat in het bijzonder vanwege het raden op keuzevragen, maar bedenk dat ook bij open vragen wel sprake is van raden. Bij de strategische voorbereiding op toetsen is er sprake van een modelleerbare situatie, waarin de rol van raadkansen exact valt te onderzoeken, en deze niet 'neutraal' blijkt te zijn (maar wie had dat gedacht, dan?).
In deze inventariserende fase geef ik hier eerst in temporele volgorde de ontwikkeling van het 'idee,' met daarbij - geanonimiseerd - de belangrijkste punten uit email-wisselingen over dat idee. Wie opmerkingen wil toevoegen, is hierbij van harte uitgenodigd.
Correspondentie 1 - Model voor raden
In een boeiende correspondentie over het modelleren van raadkansen bij binomiale modellen blijkt er naast het in de psychometrie bekende model ook een complexe formulering mogelijk te zijn; uiteraard leiden beide wiskundige modellen tot dezelfde resultaten, is dat ook bewijsbaar, maar het is niet vanzelfsprekend te zien aan de betreffende alternatieve formules. Voor de uitwerking van de thematiek van gedwongen raden is dit wel belang, omdat er aan het eenvoudige model twijfel zou kunnen bestaan.
Het gangbare model is natuurlijk dat bij raden de raadkans in de parameter van het binomiaalmodel wordt geabsorbeerd:
-
beheersing 60%, raadkans 0,50%, de parameter voor het binomiaalmodel-met-raden is dan 0.8. Bij twijfel, gebruik de applet om een en ander zelf te zien.
Als m de beheersing van de stof is, en r de raadkans, dan geldt voor toetsen met raden het binomiaalmodel met parameter p:
Een en ander is eenvoudig in te zien, tenminste vanuit het perspectief van de student die veronderstelt dat haar beheersing m is. De kans de volgende te trekken toetsvraag te weten òf niet te weten èn wel goed te raden is gelijk aan de bovenstaande formule.
De alternatieve ingewikkelde formule resulteert uit het perspectief van de psychometricus die inventariseert hoeveel vragen geweten resp. goed geraden kunnen zijn, gegeven mogelijk te behalen scores. Het is niet nodig dit hier uit te werken.
Zie het eerste moduulvan het spa-model voor de details. In de applet die hierbij hoort, en die in uw browser kan worden gedraaid, is overigens ook het complexe alternatieve model beschikbaar (optie 209).
Voor het volgende is van belang dat dit binomiaalmodel-met-raden wordt toegepast bij berekeningen. De inbouw van de raadkansen in het spa_model was overigens een direct gevolg van deze email-wisseling. Eerdere versies van het tentamenmodel hadden raden ook al ingebouwd, maar bij de ombouw van Pascal naar naar Java moeten alle toeters en bellen toch weer stapsgewijs opnieuw worden geïmplementeerd. Zie moduul 2 van het spa_model voor alle details. Het raden als optie is in de overige modulen van het spa-model nog niet overal doorgevoerd.
Voor de volledigheid: er zijn natuurlijk een oneindige veelheid van modellen voor raden mogelijk. Een bekende variant is die waar de raadkans afhankelijk is van de beheersing, wat geen onredelijke veronderstelling is. Voor een gegeven beheersing reduceert dit aardige model overigens weer tot bovenstaand eenvoudige model.
scientific position on guessing [taken from module 2]
"Guessing is a nuisance in educational assessment. Under all circumstances random influences like guessing on items not known or partially known are harmful, and if possible and feasible should be avoided. Because the core business of education is to educate, it definitely is harmful to teach students that it is perfectly OK to guess on questions one does not know or is not sure of. One approach, without abandoning multiplechoice questions altogether, would be to give the student a constant credit on questions left unanswered. The constant should be chosen so as to give ample credit to partial knowledge."
Stap 1. Toepassing binomiaal-model
Binomiaalmodel voor raden
Toepassing: raden bij zak-slaagtoetsing
Uit het spa-project komt de volgende toepassing van het binomiaal-model voort.
guessing under pass-fail scoring
While it is known (Lord & Novick, 1968, p. 304) that guessing, other things being equal, lowers the validity of tests, it is not generally known that guessing heightens the risk of failing under pas-fail scoring for students having satisfactory mastery. The figure shows a typical situation. The test has 40 three-choice items, its cut-off score in the no-guessing condition is 25, in the three-choice items condition the cut-off score is 30. Testscores for subjects known to have mastery 0.7 have been simulated 1000 times, using the spa-module 1 applet for binomial scores. The remarkable thing is that the probability to fail the 25 score limit is 0.115, while the probability to fail the 30 score limit under forced guessing is .165.
The statistical/simulation model is not strictly necessary to argue the case, of course, but it helps being able to quantify the argument. Suppose the student is allowed to omit questions she does not know, meaning she will not be punished for this behavior but instead will obtain a bonus of 1/3rd point for every question left unanswered. Students having satisfactory mastery will have a reasonable chance to pass the test. Those passing will do so while omitting a certain number of questions. It is perfectly clear that some of these students would fail the test if they yet had to guess on those questions. In the same way, some mastery students initially having failed the test, might pass it while guessing luckily. This second group is, however, much smaller than the first one, and they still have the option to guess. The propensity to guess is higher, the lower the expected score on tests, see Bereby-Meyer, Meyer, and Flascher (2002).
The amazing thing about this argument is that I do not know of a place in the literature where it is mentioned. There has of course been a lot of research on guessing, omissiveness, and on methods to 'correct' for guessing, but none whatsoever on this particular problem. That is remarkable, because students failing a test, might claim they have been put at a disadvantage by the scoring rule that answers left open will be scored as at fault. This is a kind of problem that should have been mentioned in every edition of the Educational Measurement handbook (its last edition 1989 by Robert L. Linn). Lord & Novick (1968, p. 304) mention the problem of examinees differing widely in their willingness to omit items; the interesting thing here is their warning that requiring every examinee to answer every item in the test introduces "a considerable amount of error in the test scores." The analysis above shows that in the particular situation of pass-fail scoring this added error puts mastery students at a disadvantage, a conclusion Lord and Novick failed to note.
Correspondentie 2 - Tegenwerpingen 1.
Stap 2 - Gedwongen raden is aanvechtbaar
In mijn persoonlijke casuïstiek over beoordeeld worden komt dat moment, ergens rond 1982 meen ik, dat ik een kleutertoets in keuzevorm onder ogen krijg. De beschrijving op mijn pagina 'Beoordeeld! En hoe! Casuïstiek' krijgt dat de volgende uitwerking.
meerkeuze-kleuter
Deze inventarisatie is gerangschikt naar opklimmende onderwijsjaren, het logische begin is de kleuterklas. Het eerste casus leidt meteen al tot uitvoerige annotaties, die ik telkens in boxen zal geven.
Mijn kleuterjaren zijn zonder beoordelings-incidenten verlopen: spelen, wandelen, verhalen van juf. Een generatie later is dat anders, en zie ik een kleuter thuiskomen met meerkeuze-werkjes.
Ik schrok me daarvan te pletter, het leidde tot scherpe formuleringen in de concept-tekst van mijn Toetsvragen schrijven. Dank zij een vruchtbare discussie met Ad Horsten (IOWO) is die onevenwichtigheid uit de tekst van hoofdstuk 2 weggeslepen. Wat blijft is de constatering dat kleuters al een hersenspoeling krijgen op een specifiek toets-format, dat zij worden geconditioneerd op het idee dat het in onze samenleving normaal is gewoon maar wat te roepen/zeggen/aankruisen wanneer je het antwoord op een vraag niet weet.
Dit ongelooflijke fenomeen is met stip een van de belangrijkste ontwikkelingen in de 20e eeuw bij het toetsen in het onderwijs: een onderwijsvreemde-beroepsgroep van psychometrici heeft het voor het zeggen gekregen. Deze
aliens hebben het nodig geoordeeld dat toetsen vooral keuzetoetsen moeten zijn en - op louter pragmatische, zeker geen wetenschappelijke gronden - dat je leerlingen moet dwingen te raden op vragen die ze niet weten. Zie ook de aantekeningen bij mijn
Assessment in historical perspective pdf op dit punt.
Pas in 2006 heb ik mij gerealiseerd dat dit gedwongen raden op keuzevragen niet alleen onnodig is, maar ronduit schadelijk, bezien vanuit de gangbare criteria voor betrouwbaarheid en geldigheid van toetsen en van beslissingen op grond van die toetsen (APA-Standards / NIP-Richtlijnen). Dit raden is natuurlijk onnodig omdat niet-geweten vragen gewoon onbeantwoord kunnen blijven, daar bestaat geen verschil van inzicht over. Tot mijn verbazing bleek het eenvoudig mogelijk te bewijzen dat gedwongen raden
nadelig is voor goed voorbereide studenten die presteren boven de grens tussen onvoldoende-voldoende scores. Dat is zelfs eenvoudig in te zien: stel je voor dat zo'n student op het tentamen tot het laatste moment de niet-geweten vragen open laat, wat een sterk aan te raden tactiek bij het maken van toetsen is. Door de raad-dwang moet dan in de laatste minuut een loterij worden gedaan, even vlug alle open vragen nog aankruisen - willekeurig, of alle laatste alternatieven, of juist alle eerste, or what not. Door die loterij kan een voldoende resultaat verkeren in een onvoldoende score op de toets. Zie voor een grondige behandeling hoofdstuk
2 van
Toetsvragen ontwerpen, of het engelstalige
spa-project (spa_generator.htm paragraaf
guessing under pass-fail scoring).
Denk er maar eens over na. Wie naar een College van Beroep voor de Examens stapt, met deze klacht, kan op mijn deskundige ondersteuning rekenen. NB: mogelijk is het op juridische gronden gewenst dat je al bij het beantwoorden een merkteken plaatst op het in te leveren scoreformulier bij de vragen die je raadt! Het moet natuurlijk geen achteraf-spelletje worden. In voorkomende gevallen kun je zo bewijzen dat een voldoende score alleen door gedwongen raden in een onvoldoende is veranderd.
Voorzover dat nu nog toelichting behoeft: uit het voorgaande volgt dat raden ronduit schadelijk is, en uit het onderwijs geweerd zou moeten worden. Het probleem met deze stelling is dat het fenomeen in de literatuur eenvoudigweg niet bekend is, omdat de bijbehorende simpele analyse en bewijsvoering bij mijn weten nooit is uitgevoerd/gepubliceerd (nee, ook niet door Frederic Lord) en het daarom evenmin in de tekst van de APA-Standards of de NIP-Richtlijnen is terug te vinden.
Correspondentie 3 - Tegenwerpingen 2.
Correspondentie 4 - Tegenwerpingen 3.
Literatuur (nog te onderzoeken)
Een sterk vermoeden dat ik heb is dat de resulaten van empirisch onderzoek over raden, deelkennis, zekerheidsscoring en enkele nog meer esoterische onderwerpen moeilijk zijn te interpreteren omdat gebrekkige kwaliteit van de toetsvragen een storende factor is. Denk daarbij vooral aan het gebruik van vierkeuzevragen, waarvan ondertussen wel genoegzaam is aangetoond dat tenminste een van de alternatieven niet een behoorlijk functionerend alternatief. Met andere woorden: dit soort kwalitatief ondermaatse items, dat overigens ook in gestandaardiseerde toetsen voorkomt, krijgt 'deelkennis' ingebouwd, ook al is dat niet het soort kennis dat de term suggereert.
David Budescu and Maya Bar-Hillel. (1993) To Guess or Not to Guess: A Decision-Theoretic View of Formula Scoring. Journal of Educational Measurement, 30, 277-291 [nog opzoeken]
- abstract [my accents] Multiple-choice tests are often scored by formulas under which the respondent's expected score for an item is the same whether he or she omits it or guesses at random. Typically, these formulas are accompanied by instructions that discourage guessing. In this article, we look at test taking from the normative and descriptive perspectives of judgment and decision theory. We show that for a rational test taker, whose goal is the maximization of expected score, answering is either superior or equivalent to omitting - a fact which follows from the scoring formula. For test takers who are not fully rational, or have goals other than the maximization of expected score, it is very hard to give adequate formula scoring instructions, and even the recommendation to answer under partial knowledge is problematic (though generally beneficial). Our analysis derives from a critical look at standard assumptions about the epistemic states, response strategies, and strategic motivations of test takers. In conclusion, we endorse the number-right scoring rule, which discourages omissions and is robust against variability in respondent motivations, limitations in judgments of uncertainty, and item vagaries.
A. Ben-Simon, D. V. Budescu and B. Nevo (1997). A comparative study of measures of partial knowledge in multiple-choice tests. Applied Psychological Measurement, 21, 65-88. [nog opzoeken] pdf for pay
- abstract [my accents] A common belief among many test experts is that measurements obtained from multiple-choice (MC) tests can be improved by using evidence about partial knowledge. A large number of methods designed to extract such information from direct reports provided by examinees have been developed over the last 50 years. Most methods require modifications in test instructions, response modes, and scoring rules. These testing methods are reviewed and the results of a large-scale empirical study of the most promising among them are reported. Seven testing methods were applied to MC tests from four different content areas using a between-persons design. To identify the most efficient methods and the optimal conditions for their application, the results were analyzed with respect to six different criteria. The results showed a surprisingly large tendency on the part of the examinees to take advantage of the special features of the alternative methods and indicated that, on average, high ability examinees were better judges of their level of knowledge and, consequently, could benefit more from these methods. Systematic interactions were found between the testing method and the test content, indicating that no method was uniformly superior.
Gershon Ben-Shakhar and Yakov Sinai. (1991) Gender Differences in Multiple-Choice Tests: The Role of Differential Guessing Tendencies. Journal of Educational Measurement, 28, 23-35 [nog opzoeken]
- abstract [my accents] The present study focused on gender differences in the tendency to omit items and to guess in multiple-choice tests. It was hypothesized that males would show greater guessing tendencies than females and that the use of formula scoring rather than the use of number of correct answers would result in a relative advantage for females. Two samples were examined: ninth graders and applicants to Israeli universities. The teenagers took a battery of five or six aptitude tests used to place them in various high schools, and the adults took a battery of five tests designed to select candidates to the various faculties of the Israeli universities. The results revealed a clear male advantage in most subtests of both batteries. Four measures of item-omission tendencies were computed for each subtest, and a consistent pattern of greater omission rates among females was revealed by all measures in most subtests of the two batteries. This pattern was observed even in the few subtests that did not show male superiority and even when permissive instructions were used. Correcting the raw scores for guessing reduced the male advantage in all cases (and in the few subtests that showed female advantage the difference increased as a result of this correction), but this effect was small. It was concluded that although gender differences in guessing tendencies are robust they account for only a small fraction of the observed gender differences in multiple-choice tests. The results were discussed, focusing on practical implications.
William H. Angoff. (1989) Does Guessing Really Help?. Journal of Educational Measurement, 26, 323-336 [nog opzoeken]
- abstract [my accents] This study examines the claim that attempting, or guessing at, more items yields improved formula scores. Two samples of students who had taken a form of the SAT- Verbal consisting of three parallel half-hour sections, were used to form the following scores on each of the three sections: the number of attempts, a guessing index, the formula score, and (indirectly) an approximation to an ability score. Correlations were obtained separately for the two samples between the attempts, and the guessing index, on one section, the formula score on a second section, and ability as measured by the third section. The partial correlations obtained hovered near zero, suggesting, contrary to conventional opinion, that, on average, attempting more items and guessing are not helpful in yielding higher formula scores, and that, therefore, formula scoring is not generally disadvantageous to the student who is less willing to guess and attempt an item that he or she is not sure of. On closer examination, however, it became clear that the advantages of guessing depend, at least in part, on the ability of the examinee. Although the relationship is generally quite weak, it is apparently the case that more able examinees do tend to profit somewhat from guessing, and would therefore be disadvantaged by their reluctance to guess. On the other hand, less able examinees may lower their scores i f they guess.
Mark A. Albanese. (1988) The Projected Impact of the Correction for Guessing on Individual Scores. Journal of Educational Measurement, 25, 149-157 [nog opzoeken]
- abstract [my accents] This article presents estimates of the effects of the use of formula scoring on an individual examinee's score. The results of this analysis suggest that under plausible assumptions, using test characteristics derived from several studies, some examinees would increase their scores by one half standard deviation or more if they were to answer items omitted under formula directions
William H. Angoff and William B. Schrader (1984). A study of hypotheses basic to the use of rights and formula scores. Journal of Educational Measurement, 21, 1-17 [nog opzoeken]
- abstract [my accents] The hypothesis that some students, when tested under formula directions, omit items about which they have useful partial knowledge implies that such directions are not as fair as rights directions, especially to those students who are less inclined to guess. This hypothesis may be called the differential effects hypothesis. An alternative hypothesis states that examinees would perform no better than chance expectation on items that they would omit under formula directions but would answer under rights directions. This may be called the invariance hypothesis. Experimental data on this question were obtained by conducting special test administrations of College Board SAT-verbal and Chemistry tests and by including experimental tests in a Graduate Management Admission Test administration. The data provide a basis for evaluating the two hypotheses and for assessing the effects of directions on the reliability and parallelism of scores for sophisticated examinees taking professionally developed tests. Results support the invariance hypothesis rather than the differential effects hypothesis.
Leonard B. Bliss(1980). A test of Lord's assumption regarding examinee Guessing Behavior on Multiple-Choice Tests Using Elementary School Students. Journal of Educational Measurement, 17, 147-152
- abstract [ERIC CTM] [my accents] A mathematics achievement test with instructions to avoid guessing wildly was given to 168 elementary school pupils who were later asked to complete all the questions using a differently colored pencil. Results showed examinees, particularly the more able students, tend to omit too many items.
Rand R. Wilcox (1982). Some new results on an answer-until-correct scoring procedure.. Journal of Educational Measurement, 19, 67-74
- abstract [ERIC Author/GK] [my accents] A new model for measuring misinformation is suggested. A modification of Wilcox's strong true-score model, to be used in certain situations, is indicated, since it solves the problem of correcting for guessing without assuming guessing is random.
Rand R. Wilcox (1979). Achievement tests and latent structure models. British Journal of Mathematical and Statistical Psychology, 32, 61-71. abstract gebundeld met:
Ivo W. Molenaar (1981). On Wilcox's latent structure model for guessing. British Journal of Mathematical and Statistical Psychology, 34, 224-228. abstract
Rand R. Wilcox (1981). Methods and recent advances in measuring achievement: A response to Molenaar. British Journal of Mathematical and Statistical Psychology, 34, 229-237.abstract
One observation: Wilcox and Molenaar do not see any examinee ‘knowing a wrong answer’. The exact formulation is somewhat ambiguous, however (Molenaar 224): “The examinee does not know and gives the incorrect response.” Is guessing meant here? ‘Knowing a wrong answer’ might obtain in many ways, for example, by misreading the question, by error in calculation, by forgetting to take an essential solution step such as translate a calculation result in some way or other.
[artikelen it de lijst uit 79toetsen.cowo.rtfd : ]
Abu-Sayf, F.K. The scoring of multiplechoice tests: a closer look. Educational Technology 1979, june, 515.
Bejar, I.I. & Weiss, D.J. A comparison of empirical differential option weighting scoring procedures as a function of interitem correlation. EPM 1977, 37, 335-340.
Borgesius, T.G. Een empirisch onderzoek naar het correctie voor raden scoringssysteem. Nijmegen, Instituut voor Onderzoek van het Wetenschappelijk Onderwijs, K.U. Nijmegen. 1978.
Claudy, J.G. Biserial weights: a new approach to test item option weighting. APM 1978, 2, 25-30.
Diamond, J. & Evans, W. The correction for guessing. RER 1973, 43, 181-192.
Duncan, G.T. & Milton, E.O. Multipleanswer multiplechoice test items: responding and scoring through Bayes and minimax strategies. Pm 1978, 43, 43-57.
Echternacht, G. The variances of empirically derived option scoring weights. EPM 1975, 35, 307-311.
Gibbons, J.D., Olkin, I. & Sobel, M. A subset selection technique for scoring items on a multiple choice test. Pm 1979, 44, 259-278.
Lord, F.M. Formula scoring and number right scoring. JEM 1975, 12, 7-12.
Molenaar, W. On Bayesian formula scores for random guessing in multipple choice tests. BrJMSP 1977, 30, 79-89.
Slakter, M.J., Crehan, K.D. & Koehler, R.A. Longitudinal studies on risk taking on objective examinations. EPM 1975, 35, 97-105.
Thorndike, R.L. The problem. of guessing. In Thorndike 1971, 59-61.
Wilcox, R.R. Achievement tests and latent structure models. BrJMSP 1979, 32, 61-71.
Rand R. Wilcox and Karen Thompson Wilcox. (1988) Models of decisionmaking processes for multiple-choice test items: An analysis of spatial ability. Journal of Educational Measurement, 25, 125-136
- abstract [my accents] Latent class models of decisionmaking processes related to multiple-choice test items are extremely important and useful in mental test theory. However, building realistic models or studying the robustness of existing models is very difficult. One problem is that there are a limited number of empirical studies that address this issue. The purpose of this paper is to describe and illustrate how latent class models, in conjunction with the answer-until-correct format, can be used to examine the strategies used by examinees for a specific type of task. In particular, suppose an examinee responds to a multiple-choice test item designed to measure spatial ability, and the examinee gets the item wrong. This paper empirically investigates various latent class models of the strategies that might be used to arrive at an incorrect response. The simplest model is a random guessing model, but the results reported here strongly suggest that this model is unsatisfactory. Models for the second attempt of an item, under an answer-until-correct scoring procedure, are proposed and found to give a good fit to data in most situations. Some results on strategies used to arrive at the first choice are also discussed
Rand R. Wilcox. (1987) Confidence Intervals for True Scores Under an Answer-Until-Correct Scoring Procedure. Journal of Educational Measurement 24:3, 263-269
- abstract Under an answer-until-correct scoring procedure, many measurement problems can be solved when certain cognitive models of examinee behavior can be assumed (Wilcox, 1983). Point estimates of true score under these models are available, but the problem of obtaining a confidence interval has never been addressed. Two simple methods for obtaining a confidence interval are suggested that give good results when the sample size is reasonably large, say, greater than or equal to 20, and when true score is not too close to zero or one. A third procedure is suggested that can also be used to get slightly better results where again the sample size is assumed to be reasonably large and true score is not too close to zero or one. For small sample sizes or situations where true score is close to zero or one, a fourth procedure is described that always gives conservative results.
Muijtjens, H van Mameren, Hoogenboom, Evers & C P M van der Vleuten. (1999) The effect of a 'don't know' option on test scores: number-right and formula scoring compared. Medical Education 33:4, 267-275
- abstract In multiple-choice tests using a `don't-know' option the number of correct minus incorrect answers was used as the test score (formula scoring) in order to reduce the measurement error resulting from random guessing. In the literature diverging results are reported when comparing formula scoring and number-right scoring, the scoring method without the don't-know option.
To investigate which method was most appropriate, both scoring methods were used in true - false tests (block tests) taken at the end of a second- and third-year educational module (block). The students were asked to answer each item initially by choosing from the response options true, false or don't know, and secondly to replace all don't-know answers by a true - false answer.
Setting Maastricht University, The Netherlands. Subjects Medical students.
The correct scores for the don't-know answered items were found to be 4·5% and 5·9%, respectively, higher than expected with pure random guesswork. This represents a source of bias with formula scoring, because students who were less willing to guess obtained lower scores. The average difference in the correct minus incorrect score for the two scoring methods (2·5%, P < 0·001, and 3·4%, P < 0·001, respectively) indicates the size of the bias (compare: the standard deviation of the score equals 11%). Test reliability was higher with formula scoring (0·72 vs. 0·66 and 0·74 vs. 0·66), but the difference decreased when the test was restricted to items which were close to the core content of the block (0·81 vs. 0·77, resp. 0·75 vs. 0·70).
In deciding what scoring method to use, less bias (number-right scoring) has to be weighed against higher reliability (formula scoring). Apart from these psychometric reasons educational factors must be considered.
Richard F Burton. (2002) Misinformation, partial knowledge and guessing in true/false tests. Medical Education 36:9, 805-811
- abstract Context Examiners disagree on whether or not multiple choice and true/false tests should be negatively marked. Much of the debate has been clouded by neglect of the role of misinformation and by vagueness regarding both the specification of test types and 'partial knowledge' in relation to guessing. Moreover, variations in risk-taking in the face of negative marking have too often been treated in absolute terms rather than in relation to the effect of guessing on test unreliability.
Objectives This paper aims to clarify these points and to compare the ill-effects on test reliability of guessing and of variable risk-taking.
Methods Three published studies on medical students are examined. These compare responses in true/false tests obtained with both negative marking and number-right scoring. The studies yield data on misinformation and on the extent to which students may fail to benefit from distrusted partial knowledge when there is negative marking. A simple statistical model is used to compare variations in risk-taking with test unreliability due to blind guessing under number-right scoring conditions.
Conclusions Partial knowledge should be least problematic with independent true/false items. The effect on test reliability of blind guessing under number-right conditions is generally greater than that due to the over-cautiousness of some students when there is negative marking.
LAWRENCE H. CROSS AND ROBERT B. FRARY. (1977) AN EMPIRICAL TEST OF LORD'S THEORETICAL RESULTS REGARDING FORMULA SCORING OF MULTIPLE-CHOICE TESTS. Journal of Educational Measurement 14:4, 313-321
Robert B. Frary. (1989) The Effect of Inappropriate Omissions on Formula Scores: A Simulation Study. Journal of Educational Measurement 26:1, 41-53
- abstract Responses to a 50-item, four-choice test were simulated for 1,000 examinees under conventional formula-scoring instructions. One hundred ninety-two simulation runs reflected variations in the average level o f item difficulty, the extent to which examinees tended to omit inappropriately (when the formulascoring directions recommended guessing), the extent to which they were misinformed (classified correct answers as distractors), the extent to which they guessed contrary to the formula-scoring instructions, the extent to which examinee ability and tendency to omit inappropriately were correlated, the examinee ability level at which misinformation was most prevalent, and the extent to which item difficulty was related to the probability that an examinee would be misinformed. For each examinee, formula scores and expected formula scores were determined allowing and not allowing inappropriate omissions. Under certain conditions, failure to guess as recommended by the formula-scoring instructions produced nontrivial proportions o f examinees with expected score losses o f one or more points. These conditions were a test o f at least moderate difficulty, a low level for the tendency to be misinformed, and at least a moderate level for the tendency to omit inappropriately.
ROBERT WOOD. (1976) INHIBITING BLIND GUESSING: THE EFFECT OF INSTRUCTIONS. Journal of Educational Measurement 13:4, 297-307
FREDERIC M. LORD. (1975) FORMULA SCORING AND NUMBER-RIGHT SCORING1. Journal of Educational Measurement 12:1, 7-11
GLENN L. ROWLEY
ROSS E. TRAUB. (1977) FORMULA SCORING, NUMBER-RIGHT SCORING, AND TEST-TAKING STRATEGY. Journal of Educational Measurement 14:1, 15-22
A. RALPH HAKSTIAN and WANLOP KANSUP. (1975) A COMPARISON OF SEVERAL METHODS OF ASSESSING PARTIAL KNOWLEDGE IN MULTIPLE-CHOICE TESTS: II. TESTING PROCEDURES*. Journal of Educational Measurement 12:4, 231-239
WANLOP KANSUP and A. RALPH HAKSTIAN. (1975) A COMPARISON OF SEVERAL METHODS OF ASSESSING PARTIAL KNOWLEDGE IN MULTIPLE-CHOICE TESTS: I. SCORING PROCEDURES*. Journal of Educational Measurement 12:4, 219-230
ROGER A. KOEHLER. (1974) OVERCONFIDENCE ON PROBABILISTIC TESTS. Journal of Educational Measurement 11:2, 101-108
- abstract A potentially valuable measure of overconfidence on confidence response multiple-choice tests was evaluated. The measure of overconfidence was based on probabilistic responses to seven nonsense items embedded in a 33 item vocabulary test. The test was administered under both confidence response and conventional choice response directions to 208 undergraduate educational psychology students. Measures of vocabulary knowledge based on confidence and choice responses, overconfidence, and risk-taking propensity were obtained. The results indicated that overconfidence was significantly related in a negative direction to confidence response vocabulary scores and essentially unrelated to choice response vocabulary scores. A moderate correlation was found between overconfidence and risk-taking propensity. However, the scatter plot for these measures showed that this relationship may have been spurious.
GLENN L. ROWLEY. (1974) WHICH EXAMINEES ARE MOST FAVOURED BY THE USE OF MULTIPLE CHOICE TESTS?. Journal of Educational Measurement 11:1, 15-23
- abstract Scores were obtained from 198 ninth grade students on achievement motivation, test anxiety, testwiseness, and risktaking. Tests in mathematics and vocabulary were constructed in free response and multiple choice form, and administered to the subjects in that order, with an interval of 5 weeks between administrations. Partial correlations were computed between scores on the multiple choice tests and achievement motivation, test anxiety, testwiseness, and risktaking, with free response scores partialled out. The partial correlations were corrected for the unreliability in the free response scores, and tested for significance. All partials involving achievement motivation and test anxiety were nonsignificant, as were all partials based on mathematics scores. The partial correlations of vocabulary scores with testwiseness and risktaking were significant without exception. It was concluded that the use of multiple choice tests can favour certain examinees those who are highly testwise and willing to take risks in the test situation. It was noted that the extent to which these examinees were favoured was dependent on the nature of the test, and that a verbal test seemed more susceptible than a numerical test.
Michael C. Rodriguez. (2003) Construct Equivalence of Multiple-Choice and Constructed-Response Items: A Random Effects Synthesis of Correlations. Journal of Educational Measurement 40:2, 163-184
- abstract A thorough search of the literature was conducted to locate empirical studies investigating the trait or construct equivalence of multiple-choice (MC) and constructed-response (CR) items. Of the 67 studies identified, 29 studies included 56 correlations between items in both formats. These 56 correlations were corrected for attenuation and synthesized to establish evidence for a common estimate of correlation (true-score correlations). The 56 disattenuated correlations were highly heterogeneous. A search for moderators to explain this variation uncovered the role of the design characteristics of test items used in the studies. When items are constructed in both formats using the same stem (stem equivalent), the mean correlation between the two formats approaches unity and is significantly higher than when using non-stem-equivalent items (particularly when using essay-type items). Construct equivalence, in part, appears to be a function of the item design method or the item writer's intent.
Niall Bolger and Thomas Kellaghan. (1990) Method of Measurement and Gender Differences in Scholastic Achievement. Journal of Educational Measurement 27:2, 165-174
- abstract Gender differences in scholastic achievement as a function of method of measurement were examined by comparing the performance of 15-year-old boys (N = 739) and girls (N = 758) in Irish schools on multiple-choice tests and free-response tests (requiring short written answers) of mathematics, Irish, and English achievement. Males performed significantly better than females on multiple-choice tests compared to their performance on free-response examinations. An expectation that the gender difference would be larger for the languages and smaller for mathematics because of the superior verbal skills attributed to females was not fulfilled.
Mark G. Simkin and William L. Kuechler. (2005) Multiple-Choice Tests and Student Understanding: What Is the Connection?. Decision Sciences Journal of Innovative Education 3:1, 73-98
- abstract Instructors can use both "multiple-choice" (MC) and "constructed response" (CR) questions (such as short answer, essay, or problem-solving questions) to evaluate student understanding of course materials and principles. This article begins by discussing the advantages and concerns of using these alternate test formats and reviews the studies conducted to test the hypothesis (or perhaps better described as the hope) that MC tests, by themselves, perform an adequate job of evaluating student understanding of course materials. Despite research from educational psychology demonstrating the potential for MC tests to measure the same levels of student mastery as CR tests, recent studies in specific educational domains find imperfect relationships between these two performance measures. We suggest that a significant confound in prior experiments has been the treatment of MC questions as homogeneous entities when in fact MC questions may test widely varying levels of student understanding. The primary contribution of the article is a modified research model for CR/MC research based on knowledge-level analyses of MC test banks and CR question sets from basic computer language programming. The analyses are based on an operationalization of Bloom's Taxonomy of Learning Goals for the domain, which is used to develop a skills-focused taxonomy of MC questions. However, we propose that their analyses readily generalize to similar teaching domains of interest to decision sciences educators such as modeling and simulation programming.
Michael O'Leary. (2002) Stability of Country Rankings Across Item Formats in the Third International Mathematics and Science Study. Educational Measurement: Issues and Practice 21:4, 27-38
- abstract Multiple-choice, short-answer, and extended-response item formats were used in the Third International Mathematics and Science Study to assess student achievement in mathematics and science at Grades 7 and 8 in more than 40 countries around the world. Data pertaining to science indicate that the standings of some countries relative to others change when performance is measured via the different item formats. The question addressed in the present article is the following: Can the instability of ranks in this case be attributed principally to item format, or are other important factors at work? It is argued that the findings provide further evidence that comparing student achievement across countries is a very complex undertaking indeed.
Brent Bridgeman and Charles Lewis. (1994) The Relationship of Essay and Multiple-Choice Scores With Grades in College Courses. Journal of Educational Measurement 31:1, 37-50
- abstract Essay and multiple-choice scores from Advanced Placement (AP) examinations in American History, European History, English Language and Composition, and Biology were matched with freshman grades in a sample of 32 colleges. Multiple-choice scores from the American History and Biology examinations were more highly correlated with freshman grade point averages than were essay scores from the same examinations, but essay scores were essentially equivalent to multiple-choice scores in correlations with course grades in history, English, and biology. In history courses, men and women received comparable grades and had nearly equal scores on the AP essays, but the multiple-choice scores of men were nearly one half of a standard deviation higher than the scores of women.
Michael C. Rodriguez. (2003) Construct Equivalence of Multiple-Choice and Constructed-Response Items: A Random Effects Synthesis of Correlations. Journal of Educational Measurement 40:2, 163-184
- abstract A thorough search of the literature was conducted to locate empirical studies investigating the trait or construct equivalence of multiple-choice (MC) and constructed-response (CR) items. Of the 67 studies identified, 29 studies included 56 correlations between items in both formats. These 56 correlations were corrected for attenuation and synthesized to establish evidence for a common estimate of correlation (true-score correlations). The 56 disattenuated correlations were highly heterogeneous. A search for moderators to explain this variation uncovered the role of the design characteristics of test items used in the studies. When items are constructed in both formats using the same stem (stem equivalent), the mean correlation between the two formats approaches unity and is significantly higher than when using non-stem-equivalent items (particularly when using essay-type items). Construct equivalence, in part, appears to be a function of the item design method or the item writer's intent.
Malcolm J. Slakter. (1968) THE EFFECT OF GUESSING STRATEGY ON OBJECTIVE TEST SCORES. Journal of Educational Measurement 5:3, 217-222
Ziller, R.C. Measure of the gambling response set in objective tests, Psychometrika, 1957, 22, 289-292.
Ahlgren, A. Confidence on achievement tests and the prediction of retention. Unpublished doctoral dissertation, Harvard University, 1967. [geen idee waar ik dit vandaan zou kunnen halen]
- 1969 Remarks delivered in the symposium "Confidence on Achievement Tests -- Theory, Applications" at the 1969 meeting of the AERA and NCME. [Ms assembled by A.R. Gardner-Medwin from text & figures at http://www.p-mmm.com/founders/AhlgrenBody.htm, Aug 2005]: RELIABILITY, PREDICTIVE VALIDITY, AND PERSONALITY BIAS OF CONFIDENCE-WEIGHTED SCORES
-
Apr 28 2006 - NSTA Reports Online Exclusive
NSTA mourns the passing on April 23 of prominent U.S. science educator Andrew Ahlgren, former associate director of the American Association for the Advancement of Science’s (AAAS) Project 2061 scientific literacy initiative. Ahlgren, who was also a high-school physics teacher, a key member of Harvard University’s Project Physics team during the 1960s, and a University of Minnesota physics and education professor, coauthored (with F. James Rutherford, former chief education officer of AAAS and director of Project 2061) the seminal 1990 book Science for All Americans (SFAA), which explored scientific literacy in modern society and the steps the United States could take to begin reforming its education system in science, math, and technology.
A.R. Gardner-Medwin (2005). Enhancing Learning and Assessment Through Confidence-Based Marking.
- Abstract accepted for a paper to the 1st International Conference on "Enhancing Teaching and Learning through Assessment", Hong Kong 12-15 June, 2005
- "Confidence-based marking (CBM) has been known for many years to stimulate reflection and constructive thinking by students, and to improve both the reliability and validity of exam data in measuring partial knowledge."
-
"At UCL [University College London], in our medical course, we have used a simple yet theoretically sound, version of CBM for ten years now, including 4 years experience of use in summative exams."
-
" Our exam data (obtained with optical mark reader cards: Speedwell Computing Ltd.) permits comparison of CBM scores with conventional (number-correct) scores. This has revealed marked improvements of the standard Cronbach alpha measurement of reliability, from 0.873 ± 0.012 to 0.925 ± 0.007 with CBM (mean ± SEM in 6 exams, P<0.001). This is an improvement that would require approximately 80% more questions in an exam to be achieved by reducing random variance with conventional marking. In both qualitative and quantitative ways there seem to be such clear advantages to the use of CBM that wider adoption and evaluation by the teaching and learning community would seem well merited."
-
The last claim calls for an analytical test. Assuming the amelioration of Cronbach's alpha to be correct, the two issues here are: what would the reliability be under a straightforward bonus option, and again but now using the freed time to pose a number of extra questions, say 20% more - assuming the confidence marking takes 20% of the student's time. b.w.
-
"Our students like CBM a lot, finding it more searching in identifying their areas of weakness or misconception." I do not quite know what to think of this, for at least three reasons. I do not trust statements of student satisfaction. I do not quite see how this kind of self-reflection could be part of item design. I suspect bad item design would promote this kind of satisfaction with CBM; reverse this: I suspect good item design would do away with CBM, and therefore with the satisfaction with CBM. This abstract does not contain the information needed to answer my suspicions. In the case of formative testing i do not have much of a quarrel on these points. In summative testing, CBM is a tricky concept. b.w.
-
Gardner-Medwin herself is somewhat doubtful of CBM, let us say she is only level 2 confident. In the very first paragraph she remarks: "However, CBM has been adopted in very few places and is sometimes, rather surprisingly, regarded with scepticism." She does not follow this up by referring to later - more critical - publications on the subject. One of the criticisms is that there is a personality factor involved - propensity to take risks - making CBM less fair to some students. On this point, however, Gardner-Medwin claims data do not prove the existence of such a bias in the UCL situation.
-
Other publications available here.
-
A.R. Gardner-Medwin (2006). Confidence-Based Marking - towards deeper learning and better exams. [Draft for Chapter 12 in : Bryan C and Clegg K (eds) (2006) Innovative Assessment in Higher Education, Taylor and Francis Group Ltd, London] doc
Good I.J. (1979) 'Proper Fees' in multiple choice examinations. Journal of Statistical and Computational Simulation, 9,164-165.
Margo GH Jansen (1993). Review of Ability, Partial Information, Guessing: Statistical. Modelling Applied to. Multiple-Choice Tests by T. P. Hutchinson. Psychometrika, 58, 513-514.
C. Horace. Hamilton (1950). Bias and error in multiple-choice tests. Psychometrika, 15, 151-168.
Samuel B. Lyerly (1951). A note on correcting for chance success in objective
tests. {sychometrika, 16, 21-30;
Lynnette B. Plumlee (1952).The effect of difficulty and chance success on item-test
correlation and on test reliability. Psychometrika, 17, 69-86;
Lynnette B. (1954). Plumlee The predicted and observed effect of chance success on multiple-choice
test validity. Psychometrika, 19, 65-70.
Vera T. Brownless (with J. A. Keats) (1958). A retest method of studying partial
knowledge and other factors influencing item response. Psychometrika, 23, 67-73.
John A. Keats (with V. T. Brownless) (1958). A retest method of studying
partial knowledge and other factors influencing item responses. 23,
67-73.
Robert C. Ziller (1957). A measure of the gambling response-set in objective tests.
Psychometrika, 22, 289-292.
Hassmen P, Hunt DP (1994) Human self-assessment in multiple-choice testing. Journal of Educational Measurement 31, 149-160.
- abstract Research indicates that the multiple-choice format in itself often seems to favor males over females. The present study utilizes a method that enables test takers to assess the correctness of their answers. Applying this self-assessment method may not only make the multiple-choice tests less biased but also provide a more comprehensive measurement of usable knowledge-that is, the kind of knowledge about which a person is sufficiently sure so that he or she will use the knowledge to make decisions and take actions. The performance of male and female undergraduates on a conventional multiple-choice test was compared with their performance on a multiple-choice self-assessment test. Results show that the difference between test scores of males and those of females was reduced when subjects were allowed to make self-assessments. This may be explained in terms of the alleged difference in cognitive style between the genders.
Archer, N.S. A comparison of the conventional and two modified procedures for responding to multiple-choice items with respect to test reliability, validity, and item characteristics, Unpublished doctoral dissertation, Syracuse University, 1962. [geen idee waar ik dit vandaan zou kunnen halen]
Alpert, R., & Haber, R.N. Anxiety in academic achievement situations. J. abnorm. & soc. Psychol., 1960, 61, 207-215.
Rippey, R. Probabilistic Testing. J. Educ. Measmt., 1968, 5, 211-215
Nedelsky, L. Ability to avoid gross error as a measure of achievement. Educ. Psychol. Measmt., 1954, 14, 459-472.
Hevner, K. Method for correcting for guessing and empirical evidence to support. J. Soc. Psych., 1932, 3, 359-362.
Gritten, F., & Johnson, D.M. Individual-differences in judging multiple-choice questions. J. Educ. Psychol., 1941, 32, 423-430.
Ebel, R.L. Confidence weighting and test reliability. J Educ. Measmt., 1965, 2, 49-57.
Coombs, C.H., Milholland, J.E. & Womer, F.B. The assessment of partial knowledge. Educ. Psychol.Measmt. 1956, 16, 13-37.
Patrick Sturges, Nick Allum, Patten Smith and Anna Woods (2004?). The Measurement of Factual Knowledge in Surveys. pdf
- Experimental research comparing 'don't know,' guessing, and 'best guess' after first having chosen 'don't know.'
-
Note that the 'don't know' option is a variant of the bonus point option, or penalty point scoring, as alternatives for forced guessing.
-
The relevant issue in this research is the strength of the 'propensity to guess' concept: do subgroups really differ in propensity to guess - in which case forced guessing might be used to counteract this tendency - or does forced guessing add so much noise is should nor be used anyway?
-
This research is in reaction to the work of Mondak and colleagues, see below.
Jeffery J. Mondak and Damarys Canache (2004). Knowledge Variables in Cross-National Social Inquiry. Social Science Quarterly, 85, 539-558.
- This is an interesting exercise, because Mondak and Canache think that validity of questionnaires will be higher when forced guessing forces respondents to equally capitalize on their partial knowledge. This goes against the grain of the psychometric view that guessing adds noise.
- abstract This article examines the impact of “don't know” responses on cross-national measures of knowledge regarding science and the environment. Specifically, we explore cross-national variance in aggregate knowledge levels and the gender gap in knowledge in each of 20 nations to determine whether response-set effects contribute to observed variance.
Analyses focus on a 12-item true-false knowledge battery asked as part of a 1993 International Social Survey Program environmental survey. Whereas most research on knowledge codes incorrect and “don't know” responses identically, we differentiate these response forms and develop procedures to identify and account for systematic differences in the tendency to guess.
Substantial cross-national variance in guessing rates is identified, variance that contributes markedly to variance in observed “knowledge” levels. Also, men are found to guess at higher rates than women, a tendency that exaggerates the magnitude of the observed gender gap in knowledge.
Recent research has suggested that “don't know” responses pose threats to the validity of inferences derived from measures of political knowledge in the United States. Our results indicate that a similar risk exists with cross-national measures of knowledge of science and the environment. It follows that considerable caution must be exercised when comparing data drawn from different nations and cultures. - You will have to pay for a pdf file
Bereby-Meyer, Y., J. Meyer, and O.M. Flascher (2002). Prospect theory analysis of guessing in multiple choice tests. Journal of Behavioral Decision Making, 15, 313-327.
- Abstract The guessing of answers in multiple choice tests adds random error to the variance of the test scores, lowering their reliability. Formula scoring rules that penalize for wrong guesses are frequently used to solve this problem. This paper uses prospect theory to analyze scoring rules from a decision-making perspective and focuses on the effects of framing on the tendency to guess. In three experiments participants were presented with hypothetical test situations and were asked to indicate the degree of certainty that they thought was required for them to answer a question. In accordance with the framing hypothesis, participants tended to guess more when they anticipated a low grade and therefore considered themselves to be in the loss domain, or when the scoring rule caused the situation to be framed as entailing potential losses. The last experiment replicated these results with a task that resembles an actual test. (Gebruikt door Burgos, 2004)
Albert Burgos (2004). Guessing and gambling. Economics Bulletin, 4, No. 4 pp. 1-10. pdf
- The Burgos case is that of multiple choice testing where the student either may leave unanswered questions she is uncertain about or doesn't know the answer of, or guess the answer. This is a problematic case where the student has partial knowledge and at the same time is risk aversive: the achievement test becomes somewhat a test of personality.. In the Netherlands this kind of situation usually is avoided by forcing students to always answer test items, if need be by guessing. In the US the GRE and the SAT follow different rules, the GRE counts the number correct (students therefore should mark all items), the SAT punishes wrong answers (students may leave questions unmarked). Nevertheless, the article is quite insightful where it comes to problems of guessing on achievement test items, a problem not, of course, unique to the multiple choice format.
-
abstract: Scoring methods in multiple-choice tests are usually designed as fair bets, and thus random guesswork yields zero expected return. This causes the undesired result of forcing risk averse test-takers to pay a premium in the sense of letting unmarked answers for which they have partial but not full knowledge. In this note I use a calibrated model of prospect theory [Tversky and Kahneman (1992, 1995))] to compute a fair rule which is also strategically neutral, (i.e. under partial knowledge answering is beneficial for the representative calibrated agent, while under total uncertainty it is not). This rule is remarkably close to an old rule presented in 1969 by Traub et al. in which there is no penalty for wrong answers but omitted answers are rewarded by 1/M if M is the number of possible answers.
Lord, Frederic M., & Novick, Melvin R. (1968). Statistical theories of mental test scores. London: Addison-Wesley. (Chapter 23)
W. Molenaar (1977). On Bayesian formula scores for random guessing in multiple choice tests. British Journal of Mathematical and Statistical Psychology, 30, 79-89.
Gerardo Prieto and Ana R. Delgado (1999). The Effect of Instructions on Multiple-Choice Test Scores. European Journal of Psychological Assessment, 15 #2. - abstract Most standardized tests instruct subjects to guess under scoring procedures that do not correct for guessing or correct only for expected random guessing. Other scoring rules, such as offering a small reward for omissions or punishing errors by discounting more than expected from random guessing, have been proposed. This study was designed to test the effects of these four instruction/scoring conditions on performance indicators and on score reliability of multiple-choice tests. Some 240 participants were randomly assigned to four conditions differing in how much they discourage guessing. Subjects performed two psychometric computerized tests, which differed only in the instructions provided and the associated scoring procedure. For both tests, our hypotheses predicted (0) an increasing trend in omissions (showing that instructions were effective); (1) decreasing trends in wrong and right responses; (2) an increase in reliability estimates of both number right and scores. Predictions regarding performance indicators were mostly fulfilled, but expected differences in reliability failed to appear. The discussion of results takes into account not only psychometric issues related to guessing, but also the misleading educational implications of recommendations to guess in testing contexts.
N. Kogan and M. A. Wallach (1964). Risk taking: A study in cognition and personality. New York: Holt, Rinehart and Winston.
- on propensity to take risks in guessing on multiple choice items ...
A.H.G.S. van der Ven (1974). A Bayesian formula score for the simple knowledge or random guessing model. NTvdPs. pdf
James Diamond, William Evans (1973). The Correction for Guessing. Review of Educational Research, Vol. 43, No. 2 (Spring, 1973), pp. 181-191. Jstor
Marilyn W. Wang and Julian C. Stanley (1970). Differential weighting: a review of methods and empirical studies. Review of Educational Research, 40, 663-705.
- a.o. confidence weighting
Ingebracht in een linkedin-discussie, 6 januari 2015.
De discussie is geopend op deze formule voor de cesuur voor een toets van 40 meerkeuzevragen
-
- bij gokken 25% goed = 10 vragen
- van de overige 30 vragen 60% goed= 18 vragen
- cesuur ligt bij 10+18= 28 vragen goed of meer.
Twee dingen. Plus één.
1) Alleen voor blinde paarden is de raadkans 25%. Omdat die paarden niet meedoen, is de raadkans beter op 1/3e te stellen. In jouw ‘formule’ (daarover straks meer): ca 13 van de 40 ‘goed’ door raden, van de overige ca 27 vragen ca 16 goed door ‘weten’, dus cesuur 13 + 16 = 29 goed (afrondingen in ‘voordeel’ voor de leerling). Dat komt uit op iets meer dan 70%. Stel dat de raadkans 1/2 is, dan kom je met deze formule op een cesuur van 32, ofwel 80%.
2) Het is mij een raadsel waar jouw formule vandaan komt. Ik weet dat hij vaak wordt gebruikt, maar dat maakt het raadsel niet kleiner. De redenering is namelijk bizar (geen enkele leerling begint met alles te raden)
r = raadkans
p = juist voldoende beheersing
Jouw formule:
r • n + p • ( n - r • n )
= r • n + p • n • ( 1 - r )
= n • ( r + p • ( 1 - r ))
Kennelijk is de bedoeling dat de cesuur wordt gelegd bij 60% beheersing (van de vragen zoals over deze stof gebruikelijk worden gesteld). Volgens Van Naerssen (in De Groot & Van Naerssen, 1969, Studietoetsen), kan de leerling dan de overige vragen raden: 1/3e van de resterende 40%, dus (kort door de stochastische bocht) wordt de cesuur dan bij de toets van 40 vragen: 24 vragen goed maken + ca 5 vragen goed gokken = 29.
Mijn formule daarvoor:
p • n + r • ( n - p • n )
= n • ( p + r • ( 1 - p ))
Zoals je ziet: p en r wisselen stuivertje. Dat kan niet goed zijn — één van de twee formules slaat nergens op.
Wat zegt het Cito ervan?
http://www.cito.nl/static/oenw/ttb/beglist1.htm#CORRECTIE%20VOOR%20RADEN
Onze Arnhemse toetsexperts geven deze formule voor ‘gecorrigeerde’ score X':
1) X' = X - F / ( a - 1 )
X' = voor raden gecorrigeerde toetsscore
X = aantal ‘goed’ gemaakt en/of geraden
F = aantal 'fout'’ gemaakt en/of geraden
a = aantal antwoordalternatieven per vraag.
De formule drukt het aantal goed geweten vragen uit in termen van het totaal aantal goed beantwoorde verminderd met ( 1 - a ) maal het aantal fout beantwoorde vragen. Immers, de aanname is (impliciet) dat fout beantwoorde vragen alle fout geraden zijn. Het aantal fout geraden geeft een aanwijzing over het mogelijke aantal goed geraden. Bij vier alternatieven is er op iedere drie foute vragen naar verwachting ook één goed geraden vraag, inbegrepen in X: dus F delen door 3, en dat aftrekken van X'.
2) X' = X - F / ( a - 1 )
Formule 2 is niet echt behulpzaam, want jij wilt X weten, namelijk het aantal vragen dat tenminste goed moet zijn wanneer de minimaal voldoende beheersing van de stof correspondeert aan de X' die hoort bij stofbeheersing 60%. De cesuur X moet dus zijn:
3) X = X' + F / ( a - 1 )
Komt de Cito-formule nu overeen met jouw formule, of met de mijne? Dat is niet zomaar in te zien, dus er komt een kladblaadje bij. Met een klein beetje moeite, en heel wat vergissingen, gaat dat uiteindelijk wel lukken:
Omzetten, voor cesuur op 60% geweten, p = 0,6: n is aantal vragen
X' = n • p
F = n - X
4) X = n • p + ( n - X ) / ( a - 1 )
alle X naar links brengen:
5) X - ( n - X ) / ( a - 1 ) = n • p
De noemer ( a - 1 ) kwijt raken:
6) (( a - 1 ) • X - ( n - X )) / ( a - 1 ) = n • p
7) ( a - 1 ) • X - ( n - X ) = n • p • ( a - 1 )
linkerterm vereenvoudigen:
8) a • X - X - n + X = n • p • ( a - 1 )
9) a • X - n = n • p • ( a - 1 )
de n naar rechts brengen, links en rechts delen door a:
10) X = ( n • p • ( a - 1 ) + n ) / a
de variabele a omzetten naar variabele r (raadkans, kans op goed dus)
11) r = 1 / a
12) a = 1 / r
13) a - 1 = ( 1 - r ) / r
invullen:
14) X = r • n • ( p • (( 1 - r ) / r ) + 1 )
raadkans r doorvermenigvuldigen binnen de haken:
15) X = n • ( r * p • (( 1 - r ) / r ) + r )
nogmaals r doorvermenigvuldigen:
16) X = n • ( p • ( 1 - r ) + r )
We zijn er bijna, want we zoeken een begrijpelijke formulering; tussenstap:
17) X = n • ( p - r • p + r )
want dit kunnen we herformuleren in bekende vorm:
18) X = n • ( p + r • ( 1 - p ))
En dit is mijn formule van hierboven.
plus één) In al deze formules, dus ook die van het Cito, en van Van Naerssen in 1969, is de stilzwijgende aanname dat aangestreepte foute alternatieven bij meerkeuzevragen het resultaat zijn van raden. Maar dat kan natuurlijk niet waar zijn. Veel foute antwoorden zullen gewoon foute antwoorden zijn, geen ongelukkige gokjes. Dit is geen onbelangrijk punt, want het betekent dat het aantal fout gemaakte meerkeuzevragen geen goede aanwijzing geeft voor het mogelijke aantal goed geraden keuzevragen. De formule 1/a maal het aantal fouten geeft daar bepaald een OVERSCHATTING van.
Vraag mij niet hoe het kan dat De Groot en Van Naerssen, en het Cito tot en met de dag van vandaag, op dit punt een misvatting hebben verspreid.
Zie ook http://www.benwilbrink.nl/projecten/toetsvragen.2.htm#raden
http://www.benwilbrink.nl/projecten/raden.htm