Ben Wilbrink literatuur over toetsen

a.o.:
Donald Bitzer & D. Skaperdas: The economics of a large-scale computer-based education system: Plato IV 17-29
Robert Glaser: Psychological questions in the development of computer-assisted instruction 74-93
Frederic M. Lord: Some test theory for tailored testing 139-183
Robert F. Simmons: Linguistic analysis of constructed student responses 203-21
Patrick Suppes and Mona Morningstar: Four programs in computer-assisted instruction 233-265 pdf
Gail S. Young: Comments n social, psychological, and mathematical aspetcts of the Suppes-Morningstar chapter 266-276
Stanford programs in arithmetic, logic, and Russian: Discussion 277-281
Carl B. Sheley & Vaughn Groom: The Apollo Flight Controller Training System concept and its educaional implications 313-335
Allan B. Ellis & David D. Tiedemann: Can a machine counsel? 345-373
Emmanuel G. Mesthene: 384-391
This leads me to perhaps the most dangerous pitfall of all, which is the unconscious reinforcement of the values of efficiency and achievement that can result from technological improvement of present educational processes.
( . . . )
The power of truth—of technology, science, knowledge—is very great these days. Those who seek after it, therefore, have a duty to measue their contribution in the context of truths that often transcend the two-valued logic of the computer.
p. 391 [Emmanuel G. Mesthene (1970). Computers and purposes of education, in Wayne H. Holtzman: Computer-assisted instruction, testing, and guidance. Harper & Row.]

Ronald K. Hambleton (Ed.) (1989). Applications of item response theory. Special Issue International Journal of Educational Research, 13 #2, 121-220.

- Gideon J. Mellenbergh: Item bias and item response theory 127-144abstract
- Ronald K. Hambleton & H. Jane Rogers: Solving criterion-referenced measurement problems with item response models 145-160abstract
- Wim van der Linden & Michael A. Zwarts: Some procedures for computerized ability testing 175-188 pdf
- Susan E. Embretson: Latent trait models as an information-processing approach to testing 189-204abstract

Benoît Dompnier, Céline Darnon, Emanuele Meier, Catherine Brandner, Annique Smeding, Fabrizio Butera (2015 accepted). Improving Low Achievers' Academic Performance at University by Changing the Social Value of Mastery Goals. American Education Research Journal, 52, 720-749. abstract

Charles W. Daves (Ed.) (1984). The uses and misuses of tests. Examining current issues in educational and psychological testing. Jossey-Bass.

- John T. Casteen III: The public stake in proper test use.
- Melvin R. Novick: Importance of professional standards for fair and appropriate test use.
- Anne Anastasi et al: Commentaries on the development of technical standards for educational and psychological testing.
- Anthony J. Alvarado: Role of testing in developing and assessing early childhood education programs.
- Diane Ravitch: Value of standardized tests in indicating how well students are learning.
- Fred A. Hargadon: Responding to charges of test misuse in higher education.
- Franklyn G. Jenifer: How test results affect college admissions of minorities.
- Donald N. Bersoff: Legal constraints on test use in the schools

David Owen (1999). None of the above. The truth behind the SATs. Revised and updated. New York: Rowman & Littlefield. isbn 0847695077 info

inserted:
CEEB (1965, 1968). Effects of coaching on Scholastic Aptitude Test scores. 28 pp brochure;
Robert L. Bangert-Drowns, James A. Kulik, and Chen-Lin C. Kulik (1983). Effects of coaching programs on achievement test performance. Review of Educational Research, 53, 571-585;
Donald E. Powers and Donald A. Rock (1999). Effects of coaching on SAT I: Reasoning test scores. Journal of Educational Measurement, 36, 93-118, fc of abstract page

W. James Popham (2005). America's 'failing' schools. How parents and teachers can cope with No Child Left Behind. Routledge. isbn 0415451283

The reason that educators test children is to make an inference about the knowledge or skills that a child possesses.

Popham 2005, p. 49
The remarkable thing in the above informal definition is that Popham knows bloody well that the kind of testing and especially of test questions, will determine what it is that the students will prepare for. Therefore, the purpose of testing would be to make sure that students learn the right kind of thing. Calling that 'inference making' does not seem to be one hundred percent truthful.

And the resulting inferences are then used by teachers to make instructional decisions about their students.

Popham 2005, p. 50
Popham is keeping all options open here. The restriction is to in-school testing.

... educational testing is far less precise than most parents (and numerous educators) think it is.

Popham 2005, p. 54-55
A most important point, and Popham is so right to mention it in the forceful way he does. He does however not try to explain that it is inherent in the character of assessment - sampling right-wrong items from the student's imperfect mastery - that there are relatively large swings possible in the test result for the individual student. This radically and fundamentally differs from the prototypical kind of measurement in the physical world: that of length and weight.

W. James Popham (2001). The truth about testing. An educator's call to action. Association for Supervision and Curriculum Development ASCD. isbn 0871205238, 167 pp. paperback belangrijke hoofdstukken online beschikbaar; ook beschikbaar in questia

Martin V. Covington (1992). Making the grade: a self-worth perspective on motivation and school reform. Cambridge University Press. isbn 052134803X

Dominique Sluijsmans, Desirée Joosten-ten Brinke & Cees van der Vleuten (2013). Toetsen met leerwaarde. Een reviewstudie naar de effectieve kenmerken van formatief toetsen. pdf

Robert Reinier Gras (1967). Studietoetsen voor moderne talen. Proefschrift RU Utrecht (Promotor A. D. de Groot).

Het 'talenproject' stond onder supervisie van A. D. de Groot.

"Naast het maken en beproeven van een serie studietoetsen, is het onderzoek ook geworden een exploratie van de mogelijkheden en moeilijkheden die bij de introductie van studietoetsen te verwachten zijn. (...) een introductie tot de studietoetsmethode , en is er mogelijk zelfs een handleiding tot het zelf samenstellen van studietoetsen aan te ontlenen."
twitterdraadje waarin ik signaleer dat zo vroeg al de ambitie bestond de eindexamens door studietoetsen te vervangen. En omdat het hier om taaltoetsen gaat: het toetsen van tekstbegrip krijgt volop aandacht, maar zonder dat het idee 'tekstbegrip' is geproblematiseerd. Dat vind ik toch wel bijzonder, hoewel natuurlijk niet onverwacht. Tests uit de VS dienen hier als na te vlgen voorbeelden. Zo ongeveer is het dus gekomen.

Alexander W. Astin (1993). Assessment for excellence: the philosophy and practice of assessment and evaluation in higher education. American Council on Education / Oryx series on higher education. isbn 0897748050

Nicholas Lemann (1999). The big test. The secret history of the American meritocracy. Farrar, Strauss and Giroux. isbn 0374299846

De geschiedenis van Educational Testing Service in Princeton, mede gebaseerd op de archieven van ETS zelf. Nicholas Lemann interview op de html 'secrets of the SAT.'

Henk van Berkel (1999). Zicht op toetsen. Toetsconstructie in het hoger onderwijs. Van Gorcum. isbn 9023234642

Dylan Wiliam (3 September 2015). On formative assessment. youtube 13:25 minutes

Here Dylan Wiliam emphasizes teacher work quality; flipside: students learn more, are more attentive, in this formative approach. The interview ends on the small ‘difference’ made by schools/teachers. Do not forget, though, the absolute level: take teachers out of school and results then will drop to near nothing. Flipside: there is unexpected room for better results. This is a clear exposition, in a few minutes, of some strong points of formative questioning in class. Must see.

P. van Duyvendijk, Joh. Janssen en L. van der Zweep (1934). Het pedagogisch opstel. Leidraad bij 't maken van pedagogische opstellen voor hoofdakte-candidaten en de hoogste klassen der kweekschool. Purmerend: J. Muusses.

Bevat door zijn vele onderwerpenschma’s feitelijk een overzicht van het pedagogisch-didaktisch denken begin dertiger jaren in Nederland.

Ton Luijten (1993). Het Cito tussen Schiermonnikoog en Maastricht. Ton Luijten in gesprek met A. D. de Groot en Wynand Wijnen over 25 jaar Cito en andere zaken. Cito. geen isbn, geen pdf op de website van het Cito.

Wij zijn inderdaad jaren lang hard aan het ijveren geweest voor een meer objectieve vorm van beoordelen. Dat was ook hard nodig. Maar ‘onlosmakelijk’? Dat was niet ònze definitie, maar zo voelde men het vooral toen het besluit was genomen de minder objectieve beoordelingswijze af te schaffen bij de examens moderne talen. Ik heb mezelf bij de invoering van toetsen wel sterk gemaakt voor alleen maar vierkeuzetoetsen, deels op grond van onderzoek, maar vooral ook om onnodige complicaties te vermijden. Wij waren echter niet voor een complete vervanging van de vertaling door een objectieve tekstbegriptoets. Dat voorstel kwam destijds van de inspectie. Ik vind het eigenlijk heel slecht dat dat gebeurd is. Zo'n vertaling is toch ook redelijk objectief te beoordelen. De te vertalen tekst fungeert als een vrij objectief criterium. Op zich zou een tekstbegriptoets naast en niet in plaats van de vertaling een goede aanvulling geweest zijn. Maar voor een betrouwbare beoordeling van de vertaling waren op zijn minst twee beoordelaars nodig. De overheid wilde dat niet. Bezuinigingen hebben altijd de overhand bij dat soort beslissingen.
blz. 12, De Groot
Het allereerste RITP-project, in ‘57 dus, kwam voort uit een aanvraag van een commissie die zich bezighield met de didactiek van het aanvankelijk wiskunde-onderwijs. Daar moest iets aan veranderd worden, vonden ze. Ik heb de commissie en daarmee ook O. en W., dat het onderzoek subsidieerde, er toen van kunnen overtuigen dat zo'n probleem zonder objectieve toetsing van leerprestaties niet te onderzoeken viel. In dit kader ontwikkelde het RITP in feite zijn eerste, nog vrij gebrekkige achievementtest. [geen publicatie hierover in het convoluut van De Groot’ publicaties http://www.ppsw.rug.nl/~biblio/convoluutdegroot.htm. Mogelijk heeft Jan Timmer dat werk bij het RITP gedaan, zie bijvoorbeeld het door hem geschreven hoofdstuk 10 'Enkele ervaringen met het schrijven van wiskunde-items voor de lagere klassen van het middelbaar onderwijs.' in De Groot & Van Naerssen 1969, Studietoetsen construeren, afnemen, analyseren, maar dat is toch veel later in 1969. Ik heb geen flauw idee welke commissie dat destijds geweest kan zijn, misschien is er in Euclides iets over te vinden. b.w.]
blz. 12, De Groot
Schooltoetsen
TL Was de Amsterdamse Schooltoets niet de voorloper van wat later de Citotoets aan het einde van het basisonderwijs is geworden?

Dat is juist. De oorsprong lag in het toen veel besproken aansluitingsprobleem van leerlingen van de zesde klas lagere school naar het VHMO. Door heerste veel onvrede over. Er was soms een forse discrepantie tussen de adviezen van het schoolhoofd en de behaalde prestaties bij toelatingsexamens. Ik meen dat medio zestiger jaren bij Koninklijk Besluit bepaald werd dat naast het advies van het schoolhoofd een ander, zo objectief mogelijk gegeven aanwezig moest zijn. De precieze formulering weel ik niet meer, maar P.J. Koets, die toen al een paar jaar voorzitter was van het RITP-bestuur, wees ons erop dat onder die formulering de uitslag van een schoolvorderingentest kon vallen. Koren op onze molen natuurlijk. Koets stond erachter dat wij zo'n toets zouden samenstellen in en voor de Amsterdamse scholen. Haast was geboden. Wij gingen aan de slag met enkele onderwijzers als producenten van item-onderwerpen en als adviseurs. Alleen, zonder medewerking van pedagogen konden wij zo'n toets niet presenteren. Op de valreep is daar een oplossing voor gevonden doordat wij professor ldenburg ervan konden overtuigen dat zijn Kohnstamminstituut moest meedoen aan de operatie. Zo kwam de eerste Amsterdamse schooltoets tot stand: haastwerk en nog vrij amateuristisch van opzet. Maar er was een begin en ook een betere relatie met de pedagogiek trouwens.

Er was, al jarenlang trouwens, een niet aflatende animositeit tussen psychologen en pedagogen. Pedagogen vonden dat psychologen van het onderwijs af moesten blijven. Dat was hun gebied. Die strijd heeft heel lang geduurd totdat de onderwijskunde er kwam, met twee ‘ingangen’ om het zo maar uit te drukken. Die strijd laait ook nu nog wel eens op trouwens.
Allerlei weerstanden moesten overwonnen worden. Het Amsterdamse schoolparlement bijvoorbeeld was faliekant tegen: ‘hef kind in de computer’, je kent die kreten wel, ze worden nog steeds geslaakt. Onder pedagogen was weinig waardering te vinden. Selectie lag ook politiek niet lekker. Hoe dan ook: met de steun van ldenburg kwamen we toch een heel eind. We hadden goede argumenten, daar lag het niet dan. Uiteindelijk slaagden we. In ellenlange vergaderingen konden de belangrijkste bezwaren worden weggenomen. Prachtige vergaderingen konden de belangrijkste bezwaren worden weggenomen. Prachtige vergaderingen, je zou daarvan nu nog eens de verslagen moeten lezen.
In die periode was ook Ko van Calcar bij ons gekomen. Die was in Amsterdamse kringen onverdacht links van signatuur en hij betekende een goede steun voor ons. Hij maakte dan wel niet op onze manier reclame voor de nieuwe ontwikkeling, maar toch. Uiteindelijk wilde het Amsterdamse schoolparlement dan wel meewerken, zij het onder protest.
blz. 14, De Groot
Een centraal instituut
TL De Amsterdamse schooltoets was in feite nog steeds een regionaal gebeuren. Hoe kwam het Cilo uiteindelijk in beeld?

Dat was achter de schermen al in beeld. Vergeet niet dat de Amsterdamse schooltoets het zo veelsie RITP project was. Wiskunde en taaltoetsprojecten - de laatste gericht op eindexamen niveau - waren eraan voorafgegaan of nog in gang. De idee van toetsen was in de eerste helft van de jaren zestig in VHMO kringen en in Den Haag al niet zo vreemd meer. Aan de Universiteit van Amsterdam werd hard gewerkt aan psychometrica en dan toetsconstructie in het bijzonder. We hadden daar een afdeling Examen-Techniek, waar onder meer het vroege werk van Van Naerssen en Mellenbergh tot stand kwam. We kregen invloed.
Op een ander niveau was het ontstaan van SVO belangrijk. Dat was voornamelijk ldenburgs werk, en daarin heb ik hèm gesteund. Onder meer als lid van een soort lobby-groep bij O. en W., bestaande uit vijf hoogleraren. Toen SVO er eenmaal was - ldenburg werd voorzitter, ik medebestuurslid - werd al vrij snel de oprichting van het Cito aangepakt. Op dat moment kon ik dat oude verhaal uit 1958 weer tevoorschijn halen, het acroniem CITO stond er al in!
In die jaren kon er veel. Het onderwijs was in beweging, de Mammoetwet werd van kracht en er was plotseling ook politieke steun voor de oprichting van het Cito. Bij de politiek moet je het juiste moment afwachten om zoiets van de grond te krijgen.
Wat de Amsterdamse schooltoets betreft: die kon meteen worden overgenomen. En dat moest ook. Het RITP was een instituut voor onderwijsresearch en voor ons was toetsontwikkeling in de eerste plaats een hulpmiddel. Bovendien waren wij niet toegerust om het blijvend en grootschalig - landelijk - aan te pakken. Zo'n taak hoort ook niet bij een universiteit. Daar kwam nog bij het verschijnen van 'Vijven en Zessen' in 1966. Het samenvallen van gunstige factoren was voor een deel het gevolg van onze strategie. Voor een ander deel - zoals altijd - van een dosis geluk, met name wat het politieke klimaat betreft.

TL Was Grosheide toen niet staatssecretaris van Onderwijs?

Ik dacht het wel ja. Hij was de politiek verantwoordelijke bewindsman. Den Haag wilde een centraal instituut, zo los mogelijk van de Amsterdamse Universiteit. Arnhem leek een geschikte neutrale plaats ervoor. En zo ontstond in 1968 - ik zat zelf een jaar in Amerika en, zoals bekend, deed Sjeng Kremers het oprichtingswerk - de Stichting Centraal Instituut voor Toelsontwikkeling, gevestigd te Arnhem.

TL Als je nu terugkijkt naar de ideeën van toen. Hoe zijn je gevoelens daar dan nu over?

Het Cito leek en lijkt mij nog steeds een zeer geslaagde onderneming. Het pakte onder Solberg de zaken aan zoals ik die toen ook beoogde. De eindexamens kwamen in beeld. Eerst experimenteel, later officieel. In een later stadium ontstond het periodiek peilingsonderzoek. Dat vond ik met name een schitterend initiatief: beleidsinformatie en publieksinformatie over leerlingenprestaties. Zoiets betekende toch een grote winst voor het onderwijs, vergeleken met de jaren daarvoor.
blz. 14-15, De Groot
Kritiek
TL Er klinkt enthousiasme door in zijn uitspraken. Het is daarom met enige terughoudendheid dat ik hem vraag of zijn gevoelens louter positief zijn.

Niet over de hele linie. Punten van kritiek heb ik ook wel. De belangrijkste vraag in dit verband is of het Cito de afgelopen vijfentwintig jaar wel voldoende geijverd heeft voor de verbreiding van de verworven meer algemene inzichten die het werk heeft opgeleverd. Ik denk hierbij aan inzicht in de grote verschillen in prestatievermogen van leerlingen die telkens weer uit scoreverdelingen naar voren komen. En dan de onmiskenbare hoofdoorzaak daarvan: grote verschillen in leervermogen, in aanleg. De tijd van het geloof in ‘iedereen kan alles leren’ - door veel wereldvreemde intellectuelen en bureaucraten, inclusief beleidsmakers, heftig beleden - is weliswaar voorbij, maar de egalitaire wensdroom is nog lang niet uitgewerkt. En juist die droom staat een goed onderwijsbeleid in de weg.
De aarzeling om verschillende prestatieniveaus te erkennen leeft nog steeds: in de basisvorming, maar ook in het hoger onderwijs, WO versus HBO is er een voorbeeld van. Ook de neiging het onderwijs de schuld te geven als blijkt dat grote aantallen leerlingen heel ‘eenvoudige’ dingen niet geleerd hebben, komt uit die droom voort. Het Cito, dat tenslotte de hele dag bezig is met het differentiëren tussen prestaties van leerlingen, ook in predictieve zin, had daar best wat meer tegengas kunnen geven. Bijvoorbeeld alleen al door meer concrete informatie te verstrekken - op itemniveau dus - over hoe moeilijk een aantal ‘eenvoudige’ dingen blijkt te zijn. Leraren uit het voortgezet onderwijs, van VBO tot en met Gymnasium, weten dat, maar politici nog steeds niet, vrees ik.

TL De terughoudende stellingname van het Cito is onder andere het gevolg van de positie die het instituut inneemt: het mede uitvoering geven aan het onderwijsbeleid. Zo wordt in Zoetermeer ook tegen het Cito aangekeken.

Ja, dat is natuurlijk waar wat je zegt. ‘Het Cito moet gewoon doen wat wij willen’, dat is de overheidsgedachte geworden, begrijp ik. Niet het onderwijsbeleid kritisch volgen en als dat nodig is van minder welkome data voorzien, maar het beleid uitdragen en uitvoeren en verder geen onzin. Niettemin: als er dan weinig speelruimte geboden wordt dan zou het Cito die zelf moeten creëren. En aan de andere kant vind ik dat de overheid zijn eigen instrumentarium beter moet benutten.
Wat ik ook jammer vind - en dat hangt met het voorgaande samen - is dat het Cito zelf zich zo weinig met doelstellingenonderzoek heeft beziggehouden. In de discussie over en in die rare ontwikkeling van de basisvorming heb ik de stem van het Cito niet gehoord. Vanuit jullie expertise in differentiatie en in vragen naar de haalbaarheid van doelstellingen had toch iets verstandigs, iets waarschuwends gezegd kunnen worden? Bijvoorbeeld over de aanvankelijk beloofde ‘algehele verhoging van het peil van het jeugdonderwijs’ en over de nadruk op het midden van de ‘leervermogen-verdeling’ die in de opzet besloten ligt, zonder veel aandacht voor de ‘top-10%’ en de ‘bottom-20%’.
blz. 16, De Groot

Samenwerkende Instituten (1967). Amsterdamse schooltoetsen. Verslag van het eerste onderzoek 'L.O.-Schooltoets Amsterdam, 1966'. Groningen; Wolters.

D. J. Bos (1973). De Amsterdamse schooltoets en de differentiatie van brugklasleerlingen. Pedagogische Studiën, 50, 62-69. online

Gertrude N. Smit (1995). De beoordeling van professionele gespreksvaardigheden. Constructie en evaluatie van rollenspel, video- en schriftelijke toetsen. Baarn: Nelissen. Proefschrift RU Groningen. 195 blz. (promotoren o.a. Hofstee) (ingevoegd: Gertrude Smit (1994). De beoordeling van professionele gespreksvaardigheden. De Psycholoog, 266-269. "Trainingen in gespreksvaardigheden maken bij tal van opleidingen deel uit van het curriculum. Toetsen om na te gaan of studenten na afloop van de training in staat zijn de geleerde gespreksvaardigheden adequaat toe te passen zijn vaak niet voorhanden. In dit artikel wordt de constructie van een mogelijke toetsvorm besproken: de rollenspeltoets. Ook wordt verslag gedaan van een eerste onderzoek naar de betrouwbaarheid en begripsvaliditeit van deze toets.")

Wat mij verbaast: ik zie geen discussie over de vraag of toetsen wel verstandig is. Het gaat tenslotte om een practicum-activiteit, mag ik aannemen (boekenwijsheid toetsen lijkt hier toch wel buiten de orde, of vergis ik me daarin?). De docenten voor dit vak gespreksvaadigheid moeten on the fly knnen beoordelen waar de student nog aan moet werken. Waarom zou dat niet voldoende zijn? Kortom: ik mis een uiteenzetting met de opvatting van A.D. de Groot over P- en H-onderdelen (uit mijn hoofd: dat staat in zijn Selektie vor en in het hoger onderwijs, 1972. Dat is inderdaadniet genoemd in de literatuurlijst van Smt). Het verbaast ook hierom: promotor Wim Hofstee staat niet bekend als voorvechter van onnodig toetsen. Maar nut en noodzaak zijn in dit proefschrift geen onderwerp van discussie/onderzoek.

William D. Hedges (1966). Testing and evaluation for the sciences in the secondary school. Wadsworth. lccc66-13465

loads of examples
the tradtional psychometric approach is endorsed, as is Bloom and others' taxonomic system

W. H. F. W. Wijnen (1972). Onder of boven de maat; een methode voor het bepalen van de grens voldoende onvoldoende bij studietoetsen. Amsterdam: Swets & Zeitlinger.

A. D. Groot (1966). Vijven en zessen. Cijfers en beslissingen: het selectieproces in ons onderwijs. Groningen: J. B. Wolters.

Ad doorzichtigheid: Wat de leerling daarentegen nodig heeft, is: voor hem/haar begrijpelijke informatie over hoe hij het gedaan heeft. Er moet ‘feedback’ zijn, zeggen we tegenwoordig graag, en de teruggekoppelde informatie moet duidelijk zijn. Juist bij afwezigheid van objectieve normen (studietoetsen b.v.) is de informatie die in de cijfergeving ligt voor de leerlingen essentieel. De leraar deelt in belangrijke mate via zijn cijfers mee, hoe zijn beoordelingsnormen liggen, wat hij als bijzaken en wat hij als hoofdzaken beschouwt, waar hij de grens tussen voldoende en onvoldoende trekt, hoe hij een rapportcijfer samenstelt - kortom: hij deelt via zijn cijfers mee, waar de leerlingen zich aan te houden hebben. Die mededelingen nu zijn alleen goed bruikbaar als ze ’doorzichtig’ zijn, d.w.z. als ze consequent volgens een voor de leerling doorzichtig systeem worden gegeven. Die doorzichtigheid ontbreekt: als het systeem zo ingewikkeld is dat het niet meer te doorzien valt; als de leraar zelf niet precies weet, hoe hij het doet, daardoor niet consequent kan zijn - maar dit als zijn soevereine ‘vrijheid’ beschouwt; als een deel van de gegevens aan de leerlingen wordt onthouden: het ‘geheime’ cijferboekje en de ‘geheime’ formule - tekenen van een machtsopvatting van het ambt en/of van een slecht geweten. Die doorzichtigheid nu wordt principieel aangetast als men het principe loslaat, dat een cijfer (een rapportcijfer met name) aan zo goed mogelijke weerspiegeling van de geleverde prestaties moet zijn; dus: zodra men bijkomende factoren gaat verdisconteren. Die bijkomende factoren - op grond van pedagogische overwegingen - zijn namelijk voor de leerling goeddeels onvoorspelbaar. Hij kan ze, als methode, nooit goed leren doorzien. Niet alleen wat de overgangsbeslissing betreft, maar zelfs al - geheel onnodig - daarvóór, bij de opstelling van de rapportcijfers, moet hij maar ‘afwachten wat ze met hem zullen doen.’
blz. 149

Charles Tilly (2006). Why? What happens when people give reasons . . . and why. Princeton University Press. isbn 9780691125213 info

On what it is to explain.

Daniel Starch (1916). Educational measurements. New York: Macmillan. https://archive.org/details/educationalmeas01stargoog

Grappig boek, bevat veel opgaven, weinig tekst. De diverse hoofdstukken gaan over het meten van abilities (!), van schrijven, spellen, rekenen, latijn, duits, natuurkunde etc. Overlap van abilities tussen klassen (p. 41)

Banesh Hoffmann (1962/78). The tyranny of testing. Crowell-Collier. Reprint 1978. Westport, Connecticut: Greenwood Press. isbn 0313200971 kind of a review?

Obituary http://www.nytimes.com/1986/08/06/obituaries/banesh-hoffmann-an-author-and-collaborator-of-einstein.html

A. R. Gilliland, R. H. Jordan & Frank S. Freeman (1931 2nd). Educational measurements and the class-room teacher. The Century Co. archive.org online

Ch. I (The need for more adequate methods of grading)
Ch. IX Arithmetic 182-207.
- Importance and problems of instruction in arithmetic types of arithmetic tests
- Courtis Standard Research Tests, Scries B
- Courtis Standard Practice Tests in Arithmetic
- Compass Diagnostic Tests in Arithmetic
- Cleveland Survey Arithmetic Tests
- Woody-McCall Mixed Fundamentals
- Monroe Diagnostic Tests in Arithmetic
- Monroe's Standardized Reasoning Tests in Arithmetic
Ch. XIII Secondary school mathematics 261-279
- The problem of measurement in mathematics
- Rogers test of Mathematical Ability
- Kelley Mathematical values Test Alpha
- Hotz First Year Algebra Scales
- Douglass Standard Diagnostic Tests for Elementary Algebra
- Illinois Standardized Algeba Tests
- Minnick Geometry Tests
- Schorling-Sanford Achievement Test in Plane Geoetry
- Columbia Resaerch Bureau Plane Geometry Test

Charles W. Odell (1927). Educational tests for use in high schools, second revision. University of Illinois Bulletin, 24 No. 33. pdf

Cor Sluijter (1998). Toetsen en beslissen: Toetsing bij doorstroombeslissingen in het voortgezet onderwijs. Proefschrift Universiteit van Amsterdam. pdf

G. M. Ruch (1924). The improvement of the written examination. New York: Scott, Foresman and Company. [not online]

Functions of written examinations 1-12
The criteria of a good examination 13-39
Sources of error in written examinations 40-64
Types and construction of the newer objective examinations 65-105
Experimental sudies of several types of objective examinations 106-130
Statistical considerations related to examination technique 131-148
appendices 154-190
This book is a program to exchange the essay examination for the new objective type test. The test may be teacher-made.
p. 10: "The next step is obviously that of refining the examination to a point where it will begin to approach the accuracy of measurement in physics, chemistry, and the quantitative sciences generally." This is a momentous misconception, the kernel of Odell's mental model on assessment.
The book, then, attempts to carry out the program hinted at in the p. 10 citation. The list of quality criteria on p. 11 looks familiar, except for the fourth item which expresses the contrast with labor intensive scoring and grading of essay-type tests:
1. Validity
2. Reliability
3. Objectivity
4. Ease of adminsitration and scoring
5. Standards.

R. F. van Naerssen (1974). Psychometrische aspecten van de kernitemmethode. Nederlands Tijdschrift voor de Psychologie, 29, 421-430.

W. K. B. Hofstee (1971). Begripsvalidatie van studietoetsen: een aanbeveling. Nederlands Tijdschrift voor de Psychologie, 26, 491-500.

D. N. M. de Gruijter (1971). Het handhaven van normen bij studietoetsen door toetsvergelijking. Nederlands Tijdschrift voor de Psychologie, 26, 480-490.

Wat halen we ons toch een hoop gedoe op de hals door net te doen alsof studietoetsen psychologische tests zijn. Een goede studietoets gaat immers over de kern van de stof, een kern die de studenten uiteraard horen beheersen. Het handhaven van normen is hier helemaal niet aan de orde!

R. F. van Naerssen (1968). Het bepalen van de caesuur voldoende/onvoldoende. Memorandum AET-245. stencil in bak ex ces

R. F. van Naerssen (1968). Waarom de kernitemmethode faalt en hoe deze verbeterd kan worden. AET memorandum 253. Stencil in bak ex ces.

Tegenkracht organiseren. Lessen uit de kredietcrisis. RMO. pdf

Over de Cito-Eindtoets-Basisonderwijs, over de rekenmethode Wizwijs, zie blz. 28 e.v.

Vergelijkingen met de voorbeelden uit de financiële sector dienen zich aan. Ook voor dit voorbeeld geldt dat we de waarde van de Citomethode niet willen bagatelliseren. Het is immers in essentie een productief instru ment om de prestaties van leerlingen te meten. Wel vragen we aandacht voor de manier waarop deze methode - en dat kan ook gelden voor vergelijkbare leerlingvolgsystemen - de structuur van het basisonderwijs over heerst. Juist omdat iedereen zo veel waarde aan de toets hecht, bestaat het risico dat het onderwijs zich alleen nog ten dienste stelt van het behalen van een hoge score. En dat is iets anders dan het nastreven van onderwijs doelen. Het onderwijsprogramma past zich aan de structuur van het toetsingssysteem aan en ook het onderwijsmateriaal richt zich op het behalen van een goede score op de toets (Bokhove 2008). Bovendien verwordt de Citoscore tot een nieuwe werkelijkheid om mee te meten, beoordelen en vergelijken. De Citoscore staat als het ware model voor de kwaliteit van een school of de capaciteiten van een leerling. De opbrengstgerichtheid in het onderwijsbeleid zwakt deze tendens niet af. Net als in de financiële sector schuilt hierin het gevaar dat er een vereenzelviging plaatsvindt met de classificatie: het kind of de school wordt vervat in de score, en de score is de basis voor ouders om een schoolkeuze te maken of voor scholen voor voort gezet onderwijs om kinderen toe te laten.
p. 30-31

Hambleton & Powell (1981). Standards for standard setters. paper AERA. [I have dumped my hardcopy; is of no use for me]

Hambleton, R. K., Swaminathan, H., Algina, J. & Coulson, D.B. Criterion referenced testing and measurement: a review of technical issues and developments. RER 1978, 48, 147.

Category mistake: thinking in terms of ckassification, while there are no classes (other than being artificially so defined).

Wim J. van der Linden (1980). Psychometric contributions to the analysis of criterion-referenced measurements. Doctoral dissertation, University of Amsterdam. (promotor: Don Mellenbergh)

Repeats the important misconceptions regarding utility functions, and classification as a model. The category mistake is that testees would belong to different categories—they do not. They get sorted into different categories (treatments), something different altogether. The category mistake does not help to identify the misconception regarding utility functions: not distinguishing between utility functions proper, and expected utility functions; see, f.e., chapter 7 or Psychometrika p. 261 “For the purpose of this paper, it is sufficiently general to consider the utility U as a function of the criterion Y, which is allowed to assume a different shape for each treatment.”. On the goal variable Y there can be, of course, only ONE utility function. Trying to specify treatment-dependent utility functions is messing up the one utility function on the goal variable with fuzzy costs or utilities really belonging to other goal variables. Van der Linden and any researchers with hem have not been able to se that they artificually and cruelly are reducing problems with multiple goal variables (treated extensively by Keeney and Raiffa, 1976) to problems with only one goal variable. Unbelievable. Didn‘t I explain the problem to Van der Linden and Mellenbergh, then? Sure, I did, in discussing my own 1980 papers in the working group headed by Van der Linden.

1. Introduction and overview.
2. Decision models for mastery testing. [van der Linden, W. J. (1980). Decision models for use with criterion-referenced tests. Applied Psychological Measurement, 4, 469-492]
3. Binomial test models and item difficulty [van der Linden, W. J. (1979). Binomial test model and item difficulty. Applied Psychological Measurement, 3, 401-411.]
4. Estimating the parameters of Emrick's mastery testing model. [van der Linden, W. J. (1981). Estimating the parameters of Emrick's mastery testing model. Applied Psychological Measurement, 5, 517-530. pdf]
5. Forgetting, guesing and mastery: The Macready and Dayton models revisited and compared with a latent trait approach. van der Linden, W. J. (1978). Forgetting, guessing, and mastery: The Macready and Dayton models revisited and compared with a latent trait approach. Journal of Educational Statistics, 3, 305-318. pdf ]
6. A latent trait look at pretest-posttest validation of criterion-referenced test items. [van der Linden, W. J. (1981). A latent trait look at pretest-posttest validation of criterion-referenced test items. Review of Educational Research, 51, 379-402. ]
7. Using aptitude measurements for the optimal assignment of subjects to treatments with and without mastery score [van der Linden, W. J. (1981). Using aptitude measurements for the optimal assignment of subjects to treatments with and without mastery score. Psychometrika, 46, 257-274. pdf] “When several consequences of the decision outcomes have to be taken into account, among which less tangible consequences as, for instance, psychic well-being or societal effects, the assessment of utilities may be extremely complicated and not possible without additional assumptions. On the other hand, when only a few clearly defined consequences are deemed important, the situation resembles those in business applications of which many successful illustrations are available [e.g., Keeney & Raiffa, 1976]. Elsewhere [van der Linden, 1980] we have indicated that in choosing a utility function fit to the decisionmaker's utilities should not be the only requirement. The choice of a utility function ought to be a compromise between at least three requirements: (a) fit to the decisionmaker's utilities; (b) fit to the psychometric model relating test scores to true future states; and (c) robustness of results with respect to its parameters. ” [p. 270]

Robert Rothman (1995). Measuring up. Standards, assessment, and school reform. Jossey Bass. isbn 0787900559

Gaat over authentic measurement experiments. De Amerikanen zijn natuurlijk bezeten door hun achterstand op de rest van de wereld, dus authentiek toetsen wordt gezien als een middel om internationaal weer mee te gaan tellen. Een journalistiek boek, geeft een overzicht over ontwikkelingen in het laatste decennium.

Dato N. M. de Guijter (1982). Tentamineren en beslissen. Tentamens met goed of fout gecodeerde itemantwoorden; een cijfermatige analyse. SVO Reeks 63.

Het was destijds zoeken naar een goede vorm/inhoud.

DOZ (1991). Toetsen en beoordelen. Culemborg: PHAEDON. isbn 9072456351

C. P. M. van der Vleuten (1989). Naar een rationeel systeem voor toetsing van studieprestaties in probleemgestuurd medisch onderwijs. Studies naar betrouwbaarheid en validiteit van toetsen voor praktische vaardigheden. Amsterdam: Thesis. proefschrift, isbn 9051700229

Bundeling van (aangeboden) artikelen . In Het eerste artikel begint met etaleren van de intellectuele armoede van Problem Based Learning PBL, en al helemaal waar het evaluatie van PBL betreft. Dat verhindert auteurs om het abstract te beginnen met gejubel: “Problem-based learning is now ackowledged to be a succesful educational method, and it has been adopted in many institutions in higher education. ” Wijzen op groei van de groep volgers van een onderwijsideologie bewijst natuurlijk niet dat deze ideologie ook levert wat zij belooft. ‘Erkennen van succes’ is een wonderlijke denkwijze, zeker van iemand als Wijnand Wijnen; succes moet blijken, en daarvoor is gewoon goed onderzoek nodig is. Dat is er anno 1989 niet, zoals de auteurs wel degelijk melden.

met Verwijnen, Wijnen & Imbos (aangeboden aan Teaching and Learning in Medicine). Assessment in problem-based learning: The case of Maastricht. 7-34.

George Moerkerke (1996). Assessment for flexible learning. Performance assessment, prior knowledge state assessment and progress assessment as new tools. Proefschrift Open Universiteit. [promotoren: de Wolf & Wijnen; co-promotor: Dochy; commissieleden: De Corte, Van der Molen, Plomp]

Chapter 4, on ill-structured problems. A disappointing study, from my (1983) pespective. Moerkerke does not use the line of thinking of Nwell & Simon.

Hfdst 4 is gebaseerd op: Moerkerke, G. (1992). Toetsing van vaardigheid in probleemoplossen, onderzoek naar een hulpmiddel voor de constructie van opgaven. Tijdschrift voor Hoger Onderwijs, 10, 143-167. open access https://www.tvho.nl/edition.php?id=60
Hfdst 10 is een herziening van: Moerkerke, G., & Munsters, J. (1994). Over de inzet van adaptief testen in het onderwijs. Tijdschrift voor Hoger Onderwijs, 12, 33-43. open access https://www.tvho.nl/edition.php?id=60

Huub van den Bergh (1988), Examens geëxamineerd. 's-Gravenhage: SVO Selecta. isbn 9064721394

Gaat dus over toetsen van tekstbegrip en schrijven.

Huub van den Bergh (1990). On the construct validity of multiple-choice items for reading comprehension. Applied Psychological Measurement, 14, 1-12. [vraagvorm validity]
Huub van den Bergh (1989). Functionele schrijfopdrachten en stelopdrachten: een onderzoek naar de constructvaliditeit van schrijfvaardigheidsmetingen. Tijdschrift voor Onderwijsresearch, 14, 151 171 vaardigheden

Harold Berlak, Fred M. Newman, Elizabeth Adams, Doug A. Archbald, Tyrrell Burgess, John Raven and Thomas A. Romberg (1992). Toward a new science of educational testing and assessment. Albany: NY: SUNY. isbn 0791408787 [Niet in UB Leiden] info

Alternatievelingen.

Tables and Figures
1. The Need for a New Science of Assessment  Harold Berlak
2. Assessing Mathematics Competence and Achievement  Thomas A. Romberg
3. The Assessment of Discourse in Social Studies  Fred. M. Newmann
4. The Nature of Authentic Academic Achievement  Fred M. Newmann and Doug A. Archbald
5. A Model of Competence, Motivation, and Behavior, and a Paradigm for Assessment  John Raven
6. Recognizing Achievement  Elizabeth Adams and Tyrrell Burgess
7. Approaches to Assessing Academic Achievement  Doug A. Archibald and Fred M. Newmann
7. Toward the Development of a New Science of Educational Testing and Assessment  Harold Berlak

H. Wesdorp (1974). Het meten van de produktief-schriftelijke taalvaardigheid. Directe en indirecte methoden: 'opstelbeoordeling' versus 'schrijfvaardigheidstoetsen' Muusses. isbn 9023171012

W. K. B. Hofstee (1983). Selectie: begrip. theorie, procedures en ethiek. Aula 736. isbn 9027455082

Ph. Hartog and E. C. Rhodes (1936). An examination of examinations. Being a Summary of Investigations on the Comparison of Marks allotted to Examination Scripts by Independent Examiners and Boards of Examiners, together with a Section on a Viva Voce Examination. International Institute Examinations Enquiry. London: MacMillan. online: https://dspace.gipe.ac.in/xmlui/bitstream/handle/10973/32779/GIPE-058037.pdf?sequence=3

Ph. Hartog and E. C. Rhodes (1936). The marks of examiners being a comparison of marks allotted to examination scripts by independent examiners & boards of examiners. London.

Cox 1969 geeft twee bladzijden met de belangrijkste resultaten) [UvA *F 6279] Fantastisch boek, fantastische data zijn daarin ook afgedrukt, ik zou daar een heleboel leuke dingen mee willen doen. Zie afzonderlijke file met relevante tekst en gegevens. Bevat ook: C. Burt. Memorandum I The analysis of examination marks, 245-314. Probeert factoranalyse uit te leggen en toe te passen op beoordelen. Onleesbaar. Rhodes, E. C. Memorandum II lijkt me niet van belang. Hartog, P. J. Memorandum III On certain points of difficulty in connection with school certificate examinations. 325-336. Een aardig memo over constante afwijzingspercentages (Posthumus' wet!).

Kenneth J. Arrow (1951/1963). Social choice and individual values. Yale University Press. isbn 0300013647

The nature of preference and choice
The social welfare function
The compensation psrinciple
The general possibility theorem for social welfare functions 46-60
The individualistic assumptions
Similarity as the basis of social welfare judgments
Notes on the theory of social choice, 1963

C-A. Staël von Holstein (Ed.) (1974). The concept of probability in psychological experiments. Reidel. isbn 9022705232

- contents:
- De Finetti, B.: The value of studying subjective evaluations of probability. 1-14.
- De Finetti, B.: The true subjective probability problem. 15-24.
- Kahneman, D., & Tversky, A.: Subjective probability: A judgment of representativeness.
- Wallsten, Th. S.: The psychological concept of subjective probability; A measurement-theoretic view. 49-72.
- De Zeeuw, G., & Wagenaar, W. A.: Are subjective probabilities probabilities? 73-102.
- Winkler, R. L., & Murphy, A. H.: On the generalizability of experimental results. 103-126.
- Winkler, R. L.: Statistical analysis: Theory versus practice. 127-140)

Robert Schlaifer (1959). Probability and statistics for business decisions. New York: McGraw-Hill.

Amartya Sen (Ed.) (1982/1997). Choice, welfare and measurement. Harvard University Press. isbn 0674127781

- a.o.
- Rational Fools: A Critique of the Behavioural Foundations of Economic Theory
- A Possibility Theorem on Majority Decisions
- Quasi-transitivity, Rational Choice and Collective Decisions
- Necessary and Sufficient Conditions for Rational Choice under Majority
- Social Choice Theory: A Re-examination
- Interpersonal Aggregation and Partial Comparability
- On Ignorance and Equal Distribution
- On weights and Measure: Informational Constraints in Social Welfare Analysis
- Interpersonal Comparisons of Welfare
- Personal Utilities and Public Judgments: or What's Wrong with Welfare
- Equality of What
- Ethical Measurement of Inequality: Some Difficulties

Irving LaValle (1978). Fundamentals of decision analysis. Holt, Rinehart and Winston. isbn 0030854083

Extensive form analysis. Fundamental approach.

Chapters ao.: Problem formulation: decisions in extensive form - Foundations of decision analysis - Analysis of decisions in extensive form - Quantification of preferences - Quantification of judgments - Decisions in normal form and sensitivity analysis - Introduction to Markovian decision processes - Monetary group decision problems - A glimpse at game theory

Jack Hirshleifer & John G. Riley (1992). The analytics of uncertainty and information. Cambridge University Press. isbn 0521283698

R. Duncan Luce and Howard Raiffa (1957). Games and decisions. Introduction and critical survey. A study of the Behavioral Models project, Bureau of Applied Social Research, Columbia University. Wiley.

Utility theory; Extensive and normal forms; etc.

Maynard W. Shelly II & Glenn L. Bryan (Eds) (1964). Human judgments and optimality. Wiley.

Much on utility and utility functions, optimality. Lord schrijft iets aardigs over beslissen (eigenlijk veel en veel beter dan wat Wim van der Linden twintig jaar later nog eens zou schrijven over soorten beslissingsproblemen); een leuk stuk van Suppes over hoe je een lijst te leren woorden optimaal in leerblokken verdeelt, dat is een leuk paradigma voor mastery learning, en heeft mij op het idee gebracht dat niet alleen het niveau van mastery dat je voor deeltoetsen eist een variabele in deze filosofie is, maar ook de grootte van het deel dat je zo toetst.

Howard Wainer, Neil J. Dorans & Ronald Flaugher (2000). Computerized adaptive testing: A primer MIT. [eBook KB]

Cees A. W. Glas (2000). Computerized adaptive testing: Theory and practice. Kluwer. [eBook KB]

Wim J. van der Linden & Cees A. W. Glas (2010). Elements of adaptive testing. Springer. [eBook KB]previews

Duanli Yan, Alina A. von Davier & Charles Lewis (2014). Computerized multistage testing: Theory and applications. CRC Press. [eBook KB]

Robert H. Ennis (1969). Logic in teaching. Prentice Hall. lccc number 69-17479

Logic is what constrains the teacher in judging students (work of). On the one hand, the teacher should be held accountable on using correct arguments; on the other hand, there is the danger that students might be held accountable for arguing logically correct even where the tested subject is not logic at all (as most of the times or always will be the case). How is this demarcation to make? Studnets should not be forced to argue logically in work that is not logic itself. Test for knowledge, not for logic. Therefore, this book by Ennis might be quite useful to demarcate logic from knowledge, for example in desinging achievement test items and what will count as satisfactory answers.

ao.: Part Three EXPLANATION Basic kinds of explanation - Gap-filling - Introductory evaluation and probing of reason-giving explanations - Types of explanation - Testability/Applicability - Exaplanation: An overview by way of application - Part four: JUSTIFICATION - Early stages of justification decisions - Value statements - Material inferences - Justification: Two cases

Paul Black (2014). Assessment and the aims of the curriculum: An explorer’s journey. Prospects: quarterly review of comparative education. abstract

Cito (mei 2015). Verschillende vormen van afname van de rekentoets. Eindrapportage. pdf

Dit Cito-rapport is een ambtelijk (want anoniem) stuk, een slecht voorteken.

Wat moeten we hier nu van denken? Een anoniem Cito-rapport over een kwestie waarin het Cito zelf bepaald niet een belangeloze partij is (grote investeringen in digitale afname eindexamens). Slager keurt eigen vlees. Ik ga even aan de tekst voorbij, omdat er een onderliggende literatuurstudie is die mij van groter belang lijkt

De conclusie van dit onderzoek is dat een digitale afname geen negatieve gevolgen heeft voor een kandidaat. Met digitale toetsen kunnen rekenkennis en vaardigheden net zo goed gemeten worden als met papieren toetsen.
blz. 4

Cito (mei 2015). Prestaties op papieren en digitale examens: wat is het verschil? Verslag van een literatuurstudie. Eindrapportage. pdf

Zelf even googelen (ook op Scholar) op digital or paper tests levert interessant materiaal op; de vraag is of dit anonieme Cito-rapport een goede keuze ui de literatuur heeft gemaakt. Ik zal daar later nog op terugkomen, als daar aanleiding voor is. Voordat ik deze literatuurstudie bekijk alvast mijn eigen overwegingen: (1) Even uit mijn hoofd: de richtlijn (Standards APA/NCME/AERA) is dat aanbieder moet aantonen dat zijn digitale test gelijkwaardig is aan papieren versie. Het eerste probleem is namelijk dat de digitale omgeving onvermijdelijk extra belastend is voor leerlingen, zeker bij toetsen. Belastend in de zin dat het capaciteit vergt van het korte-termijn-geheugen (KTG), capaciteit die leerlingen juist hard nodig hebben voor het rekenwerk. Dit probleem wordt vaak verergerd door stompzinnig ontwerp van zowel interface als opgaven, zoals wel heel erg evident het geval is bij de rekentoets van het CvTE. Specifiek voor rekenen zijn er bij digitaliseren orse problemen te verwachten, omdat de digitale omgeving zich slecht verhoudt tot het rekenwerk dat nu eenmaal bij rekenopgaven moet worden gedaan (op klad, of of meteen in het toetsboekje). Voorbeelden hoe dit fout kan zijn zijn in de VS te bij rekentoetsen in het Common Core State Standards programma. Dat is horror, dat wil je niet weten. Zowel digitaal als papier. Sommige problemen zijn niet eens zichtbaar, omdat er onderwerpen uit de toets worden weggelaten. De rekentoets toetst geen algoritmische vaardigheden, en trouwens evenmin geautomatiseerde basiskennis (Rekentuin kan dat laatste wel, geloof ik). Onderzoek van digitaal versus papier loopt het risico dat het ver weg raakt van waar de toets eigenlijk voor bedoeld is. Ouder maar verwant vraagstuk is of meerkeuze- en open-eindvragen wel hetzelfde toetsen. Uitgebreide onderzoekliteratuur hier.

De literatuurstudie over verschillen digitaal-papier zelf dan maar. Ook dit stuk is ambtelijk/anoniem, m.a.w. er is geen Cito-wetenschapper die de studie voor persoonlijke rekening neemt. Ik begin niet bij de tekst, maar bij de literatuurlijst.

Abels e.a. Basishandleiding DWO, Van: Freudenthal Instituut pdf Deze Handleiding is handleiding bij de software. Kennelijk is dit een voorbeeld van een digitale omgeving.Geen verantwoording.

AERA etc. Standards. Absoluut essentieel. Nadruk: psy. tests, toetsen bungelt erbij. 2014 edition open access Meer: webpagina h

Ashton e.a. abstract: Over partial credit (puntje van Bosker). zelf-promotie. paywalled pdf

Béguin & Wools (Cito medew) art in boek, als eBook in KB te leen. Technisch. Afijn, kijk even: preview

Benjamin et al. Over veranderen van je antwoord bij keuzetests. Schijnt een issue te zijn?paywalled [Het is leuk om dit te onderzoeken, maar het practisch belang ervan lijkt me nihil: voor specifieke testsituaties zou je zoiets helemaal opnieuw moeten onderzoeken, onbegonnen werk.

Bennett et al, rekentoetsen! Lijkt me belangrijke publicatie free access. Kijk even naar het abstract. Ik kom later nog terug op dit artikel van Bennett c.s. (o.a. de verwijzingen erin), ga nu even verder met de literatuurlijst.

This article describes selected results from the 2001 Math Online (MOL) study, one of three field investigations sponsored by the National Center for Education Statistics (NCES) to explore the use of new technology in NAEP. Of particular interest in the MOL study was the comparability of scores from paper- and computer-based tests. A nationally representative sample of eighth-grade students was administered a computer-based mathematics test and a test of computer facility, among other measures. In addition, a randomly parallel group of students was administered a paper-based test containing the same math items as the computer-based test. Results showed that the computer-based mathematics test was significantly harder statistically than the paper-based test. In addition, computer facility predicted online mathematics test performance after controlling for performance on a paper-based mathematics test, suggesting that degree of familiarity with computers may matter when taking a computer-based mathematics test in NAEP.

Bergstrom & Lunz gaat over adaptief testen, dat is een ander onderwerp. NB: Het nut van adaptief testen is omstreden.

Braswell & Bridgeman. Irrelevant, lijkt me. Interessant: berekeningen in klad (vgl. v Putten over PPON 2004, analyses op het kladwerk van de leerlingen)free

Bunderson, C.V., Inouye, D.K., & Olsen, J.B. (1989). The four generations of computerized educational measurement. In R.L. Linn (Ed.), Educational measurement, Third Edition (pp. 367-407). London: Collier Macmillan. ea. Technisch, interessant. Tekst van deze pdf ws gelijk, zo te zien, aan de (beter leesbare) in het boek. pdf

Cito (2015a). Resultaten vragenlijst Rekentoets VO 3-2015. Waar vind ik dat rapport? Aan Cito gevraagd. Ik vraag het nog een keer. pdf

Cito antwoordenanalyse rekentoets ophalen: Meeste vragen zijn open eind, zie vooral ook de bijlage.ophalen

Csapó ea p. 120 ev in een boek met een omineuze titel, wat goed aangeeft waar we mee hebben te maken. In Friedrich Scheuermann & Julius Björnsson: The transition to computer-based assessment. New approaches to skills assessment and implications for large-scale testing. pdf hele boek. In dit boek veel meer bijdragen op het thema digitaal versus papier. [Dit boek van Scheuermann & Björnsson kende ik nog niet, heel nuttig! Wel een iets eerder boek: zie hierboven ]

Assessing and Teaching 21st Century Skills Assessment Call to Action; Robert Kozma
The European Coherent Framework of Indicators and Benchmarks and Implications for Computer-based Assessment; Oyvind Bjerkestrand
Experiences from Large-Scale Computer-Based Testing in the USAL Brent Bridgeman
Introducing Large-scale Computerized Assessment - Lessons Learned and Future Challenges; Eli Moe
All Issues in the Context of European Computer-based Assessmentl Klaus Reich & Christian Petter
PART III: TRANSITION FROM PAPER-AND-PENCIL TO COMPUTER-BASED TESTING
Risks and Benefits of CBT versus PBT in High-Stakes Testing; Gerben van Lent
en andere hoofdstukken
Transitioning to computer-based assessments: A question of costs; Matthieu Farcot & Thibaud Latour
Computerized Adaptive Testing of Arithmetic at the Entrance of Primary School Training College (WISCAT - pabo); Theo J.H.M. Eggen & Gerard J.J.M. Straetmans
Testing for Equivalence of Test Data across Media; Ulrich Schroeders

College voor Examens (2014). Tussenrapportage centraal ontwikkelde examens mbo en Rekentoets VO, 2013-2014. Utrecht: College voor Examens. pdf Nuttig, maar niet direct voor digitaal vs papier, het is achtergrondinformatie.

College voor Toetsen en Examens (2014). Rapportage referentiesets taal (lezen) en rekenen. Utrecht: CvTE. ophalen. Fantastisch, al die rapporten die niet eerder heb gezien!

Commissie Bosker (2014). Advies over de uitwerking van de referentieniveaus 2F en 3F voor rekenen in toetsen en examens. Enschede: SLO. ophalen. De Cie die op stoel bewindslieden ging zitten (vooral doorgaan met die rekentoets).

Darrah, M., Fuller, E., & Miller, D. (2010). A comparative study of partial credit assessment and computer-based testing for mathematics. Journal of Computers in Mathematics and Science Teaching, 29, 4, 373-398. paywalled. Het partial credit probleem. Wiskunde op college-niveau. Vergeet het.

Dillon, A. (1992). Reading from paper versus screens: A critical review of the empirical literature. Ergonomics, 35, 1297-1326. pdf. Heel nuttig lijkt me, ook al is het van 1992

Dimock, P.H., & Cormier, P. (1991). The effects of format differences and computer experience on performance and anxiety on a computer-administered test. Measurement & Evaluation in Counseling & Development, 24, 119-126. research,net. Over toetsangst. Is dit van belang? vergeet het.

Eaves, R. C., & Smith, E. (1986). The effect of media and amount of microcomputer experience on examination scores. Journal of Experimental Education, 55, 23-26. pdf een onderzoekje uit 1986. Sla het gerust over. paywalled.

Evers, A., Lucassen, W., Meijer, R., & Sijtsma, K. (2009). COTAN beoordelingssysteem voor de kwaliteit van tests (geheel herziene versie). Amsterdam: Faculteit der Maatschappij- en Gedragswetenschappen. pdf. Vooral eens doorbladeren! Interessante oefening: de digitale rekentoets langs de beoordelingscriteria en puntenlijstjes van de COTAN leggen. Tijdrovend, maar zie cito_voorbeeldtoets_3F.htm de rekentoetsvragen gefileerd.

Fiddes, D.J., Korabinski, A.A., McGuire, G.R., Youngson, M.A., & McMillan, D. (2002). Does the mode of delivery affect mathematics examination results? Alt-J, 10, 1, 60-69. pdf Raar onderzoek, weinig ppn.

Gallagher, A., Bridgeman, B., & Calahan, C. (2000). The effect of computer-based tests on racial/ethnic, gender and language groups (RR-00-8). Princeton, NJ: Educational Testing Service. ophalen. effect digitaal testen voor subgroepen (bias, scheefheid). Ja, er zijn verschillen.

Gaskill, J., & Marshall, M. (2006). Deze publicatie is online onvindbaar.

Green, B.F., Bock, R.D., Humphreys, L.G., Linn, R.L, & Reckase, M.D. (1984). Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21, 347-359. Technisch, lijkt me niet direct relevant (adaptief toetsen). Paywalled. preview

Uit de verdere literatuuropgaven zijn niet direct relevant lijkende verwijzingen weggelaten

Greaud, V., & Green, B. F. (1986). Equivalence of conventional and computer presentation of speed tests. Applied Psychological Measurement, 10, 23–34. abstract and download Dit gaat over testen op snelheid en nauwkeurigheid, voor onderwijs niet direct relevant. Van de Ven deed eerder al onderzoek naar.

Hargreaves, M., Shorrocks-Taylor, D., Swinnerton, B., Tait, K., & Threlfall, J. (2004). Computer or paper? That is the question: Does the medium in which assessment questions are presented affect children’s performance in mathematics? Educational Research, 46, 29-42 paywalled. Ik vermoed dat dit een onderzoekje met weinig leerlingen is. Vergeet het. Bijvangst op Google: Noyes & Garland (2008). Computer- vs. paper-based tasks: Are they equivalent? Ergonomics, 51, 1352-1375. pdf “.... reviews the literature over the last 15 years and contrasts the results of these more recent studies with Dillon's findings. It is concluded that total equivalence is not possible to achieve, ... ”

International Test Commission (2001). International guidelines for test use. International Journal of Testing, 1, 93–114.

International Test Commission (20013). International guidelines for test use. Final version. pdf
The domain covered by the Guidelines includes any procedure used for 'testing', regardless of its mode of administration; regardless of whether it was developed by a professional test developer; and regardless of whether it involves sets of questions, or requires the performance of tasks or operations (e.g., work samples, psycho-motor tracking tests). The test use Guidelines presented here should be considered as applying to all such procedures, whether or not they are labelled as 'psychological tests' or 'educational tests' and whether or not they are adequately supported by accessible technical evidence.

International Test Commission (2006). International guidelines on computer-based and Internet delivered testing. International Journal of Testing, 6, 143–172. published guidelines

Johnson, M., & Green, S. (2006). On-line mathematics assessment: The impact of mode on performance and question answering strategies. Journal of Technology, Learning, and Assessment, 4, 5, 1-34. get pdf “In this project 104 eleven-year-olds . . . . ”

Findings suggested that although there were no statistically significant differences between overall performances on paper and computer, there were enough differences at the individual question-level to warrant further investigation. Close analysis of the data suggests that it is possible that the question type, the way it is asked, and the numbers involved, might interact with mode to affect students’ willingness to show working methods. The findings also suggest that certain types of questions in certain domains might have different impacts according to mode.

Johnson, D.E., & Mihal, W.L. (1973). Performance of blacks and whites in computerized versus manual testing environments. American Psychologist, 28, 8, 694–699. abstract Onderzoekje van niks, met 20 proefpersonen. Vergeet het. Bijvangs op Google: Michael Russell a.o. (2003). Computer-based testing and validity: A look back and into the future. Dit is een literatuuroverzichtje, signaleert grote verschillen digitaal-papier in relatie tot wat leerlingen in het onderwijs zelf gewend zijn: digitaal of op papier werken. Nadruk ligt hier op schrijven. En op high stakes tests. www.intasc.org online

Keng, L., McClarty, K.L., & Davis, L.L. (2008). Item-level comparative analysis of online and paper administrations of the Texas assessment of knowledge and skills. Applied Measurement in Education, 21, 3, 207-226. abstract

. . . but significant differences were found for several items and objectives in all subjects at grade 8 and in mathematics and English language arts (ELA) at grade 11. Differences generally favored the paper group.

Kim, J. (1999, October). Meta-analysis of equivalence of computerized and P&P tests on ability measures. Paper presented at the annual meeting of the Midwestern Educational Research Association, Chicago. full text Het abstract is abacadabra, kennelijk is het paper een methodologische oefening, en dat blijkt ook bij doornemen ervan. Afijn, de uitgebreide literatuurlijst kan informatief zijn

Kingston (2009). Comparability of computer- and paper-administered multiple-choice tests for K-12 populations: A synthesis. Applied Measurement in Education, 22, 22-37. abstract

This study synthesizes the results of 81 studies performed between 1997 and 2007. ( . . . ) Subject did appear to affect comparability, with computer administration appearing to provide a small advantage for English Language Arts and Social Studies test (effect sizes of .11 and .15, respectively), and paper administration appearing to provide a small advantage for Mathematics tests (effect size of −.06).

Kolen, M.J. (1999-2000). Threats to score comparability with applications to performance assessments and computerized adaptive tests. Educational Assessment 6, 73-96. abstract paywalled. Ik vind geen online versie.

This article develops a conceptual framework that addresses score comparability. The intent of the framework is to help identify and organize threats to comparability in a particular assessment situation.

Lee, J. (1986). The effects of past computer experience on computer aptitude test performance. Educational and Psychological Measurement, 46, 727–733. Een paper van Lee en anderen 1984: The Effects of Mode of Test Administration on Test Performance. txt Bijvangst via Google: Carol Taylor, Joan Jamieson, Daniel Eignor and Irwin Kirsch (1998). The relationship between computer familiarity and performance on comuter-based TOEFL Test tasks. ETS Research Report Series. free access Hier hetzelfde probleem als bij het Cito-rapport: ETS heeft geen belang bij het vinden van verschillen digitaal-papier.

Researchers concluded that there was no evidence of adverse effects on the computer-based TOEFL performance due to lack of prior computer experience.

Lee, J.A., Moreno, K.E, & Sympson, J.B. (1986). The effects of test administration on test performance. Educational and Psychological Measurement, 46, 2, 467-474. Titel moet zijn: The Effects of Mode of Test Administration on Test Performance. abstract

This study sought to determine whether the mean score on the computerized version of an arithmetic reasoning test would be significantly lower than that on the paper-and-pencil version when there was no time limit. ( . . . ) A significant main effect for Mode (p < .05) was found, with the mean score obtained by computer lower than that obtained by paper-and-pencil. No interaction between Mode and ability was found.

Leeson, H.V. (2006). The mode effect: A literature review of human and technological issues in computerized testing. International Journal of Testing, 6, 1, 1-24. abstract. Een beschouwing over mogelijke redenen voor typisch gevonden verschillen.

Mason, B.J., Patry, M., & Bernstein, D.J. (2001). An examination of the equivalence between non-adaptive computer-based and traditional testing. Journal of Educational Computing Research, 24, 1, 29-39. abstract Over 27 psychologiestudenten. Vergeet het.

Mazzeo, J., & Harvey, A. L. (1988). The equivalence of scores from automated and conventional educational and psychological tests: A review of the literature (College Board Rep. No. 88-8, ETS RR No. 88-21). Princeton, NJ: Educational Testing Service. pdf Uit een ander tijdperk. Onderzoek met psychologische tests vooral. Laatste zin: "Despite the tentative nature of our conclusions, it is clear that test publishers need to perform separate equating and/or norming studies when computer-administered versions of standardized tests are introduced."

Mueller, D.J., & Wasser, V. (1977). Implications of changing answers on objective test items. Journal of Educational Measurement, 14, 1, 9–14. abstract Een overzicht van een halve eeuw onderzoek.

Pass-it (2002). Good practice guide in question and test design. Luton: CAA Centre. pdf. Schotland. Soort korte handleiding ontwerpen toetsvragen. Het lijkt me een oppervlakkig stuk (niet gerelateerd aan relevant onderzoek, zonder bronnen dus, behalve voor de voorbeeldvragen)

Passmore, T., Brookshaw, L, & Butler, H. (2011). A flexible, extensible online testing system for mathematics. Australasian Journal of Educational Technology, 27, 6, 896-906. open access Interessant, maar niet een vergelijkend onderzoek digitaal-papier.

Pearson (2009). Computer-based & paper-pencil test comparability studies. Test, Measurement & Research Services Bulletin, 9. Related, evenaans van Pearson: research.net. Vindt geen enkel probleem met computer vs papier, zoals te verwachten van een firma die het van computerafnames denkt te moeten hebben

Poggio, J., Glasnapp, D. R., Yang, X., & Poggio, A. J. (2005). A comparative evaluation of score results from computerized and paper and pencil mathematics testing in a large scale state assessment program. Journal of Technology, Learning, and Assessment 3, 6. Beschikbaar via http://www.jtla.org free access Het gaat om grade 7 students, 13-jarigen zeg maar. Vindt nauwelijks verschillen digitaal-papier.

Pommerich, M. (2004). Developing computerized versions of paper-and-pencil tests: Mode effects for passage-based tests. Journal of Technology, Learning, and Assessment 2, 6, 1-45. open access

Although the observed mode effects were in general small, overall the findings suggest that it would be beneficial to develop an understanding of factors that can influence examinee behavior and to design a computer interface accordingly, to ensure that examinees are responding to test content rather than features inherent in presenting the test on computer.

Sandene, B., Horkay, N., Bennett, R., Allen, N., Braswell, J., Kaplan, B., & Oranje, A. (2005). Online assessment in mathematics and writing: Reports from the NAEP technology-based assessment project (NCES 2005-457). Washington, DC: Department of Education, National Center for Education Statistics. download here

This document contains reports from the 2001 Math Online (MOL) study and the 2002 Writing Online (WOL) study, both field investigations in the National Assessment of Educational Progress (NAEP) Technology-Based Assessment Project, which explored the use of new technology in NAEP. ( . . . ) Results showed that the computer-based mathematics test was significantly harder than the paper-based test for eighth-grade students. At both grade levels, computer facility predicted online mathematics test performance after controlling for performance on a paper-based mathematics test, suggesting that degree of familiarity with computers may matter when taking a computer-based mathematics test in NAEP.

Scheltens, F., Hickendorff, Eggen, Th. & Hiddink, L. (2014). Hoofdrekenen met papier - hoe zit dat met leerlingen die scoreenen? Reken-wiskundeonderwijs: onderzoek, ontwikkeling, praktijk, 33, 128-140. pdf Dit is leerzaam: kinderen kunnen nogal verschillende strategieën gebruiken bij het hoofdrekenen in de Eindtoets of de PPON. Dan wordt het dus ingewikkeld om papieren en digitale versies vergelijkbaar te krijgen.

Scheltinga, F., Keuning, J., & Kuhlemeier, H. (2014). Gericht werken aan opbrengsten in taal- en leesonderwijs: Een systematische review naar toetsvormen. Cito/Expertisecentrum Nederlands: Arnhem/Nijmegen. pdf

Door anders te toetsen in het taal- en leesonderwijs kunnen leraren hun leerlingen gerichter vooruit helpen. Dat is de voorzichtige conclusie van een literatuurstudie die is uitgevoerd door het Expertisecentrum Nederlands en Cito.

Cito moet wel toetsen kunnen verkopen! Afijn, 14 onderzoeken besproken.

Spray, J.A., Ackerman, T.A., Reckase, M.D., & Carlson, J.E. (1989). Effect of the medium of item presentation on examinee performance and item characteristics. Journal of Educational Measurement, 26, 261–271. preview

Threfall, J., Pool, P., Homer, M., & Swinnerton, B. (2007). Implicit aspects of paper and pencil mathematics assessment that come to light through the use of the computer. Educational Studies in Mathematics, 66, 335-348. preview

The conclusion is not only that translating paper and pencil items into the computer format sometimes undermines their validity as assessments, it is also that some paper and pencil items are less valid as assessments than their computer equivalents would be.

Traub, R. (1993). On the equivalence of the traits assessed by multiple-choice and constructed-response tests. In Bennett, R., & Ward, W. (eds.). Construction versus choice in cognitive measurement (pp. 29-44). Hillsdale, NJ: Lawrence Erlbaum Associates. preview of the book

Wim van den Broeck (22 januari 2016). Examens op maat? Het kan anders. web

Steeds meer studenten met een of andere leerstoornis mogen een aangepast examen afleggen aan hogescholen en universiteiten, berichtte deze krant (DM 21/1). Dit is een toepassing van het door Vlaanderen ondertekende VN-verdrag inzake de rechten van personen met een handicap. Toch zijn die 'examens op maat' volstrekt overbodig en discriminerend tegenover studenten zonder een diagnose. Het kan anders.

Benjamin Herald (Feb 3, 2016). PARCC Scores Lower for Students Who Took Exams on Computers. Education Weekwed

Roediger, H. L., Putnam, A. L., & Smith, M. A. (2011). Ten benefits of testing and their applications to educational practice. In J. Mestre & B. Ross (Eds.), Psychology of Learning and Motivation: Cognition in Education (pp. 1-36). Oxford: Elsevier. pdf

Ch. E. Harris, M. C. Alkin & W. J. Popham (Eds). Problems in criterion referenced measurement. Los Angeles: Center for the study of evaluation, University of California.

- a.o.:
- W. James Popham: Selecting objectives and generating test items for objectives-based tests 3-25
- Chester W. Harris: Problems of objectives-based measurement 83-93
- Chester W. Harris: : Some technical characteristics of mastery tests 98-115
- MelvinR. Novick & Charles lewis: Prescribing test length for criterion-referenced measurement 139-158

Common Core State Standards Assessments in California: Concerns and Recommendations . CARE-ED Research Brief #1: CCSS Assessments pdf

Elizabeth Ligon Bjork, Nicholas C. Soderstrom & Jeri L. Little (2015). Can multiple-choice testing induce desirable difficulties? Evidence from the laboratory and the classroom. American Journal of Psychology, 128, 229-239. [researchgate.net] preview

Cormac O’Keeffe (2016). Producing data through e-assessment: A trace ethnographic investigation into e-assessment event. European Educational Research Journal, 15, 99-116 [researchgate.net] [academia.edu] (via Ben Williamson) abstract

Backwards Assessment Explanations: Implications for Teaching and Assessment Practice. D. Royce Sadler (2015) In D. Lebler et al. (eds.), Assessment in Music Education: from Policy to Practice, Landscapes: the Arts, Aesthetics, and Education 16, DOI 10.1007/978-3-319-10274-0_2 This chapter is based on a Keynote Address to the Assessment in Music Conference held at the Queensland Conservatorium, Griffith University, Brisbane on Tuesday 16 July 2013.pdf

Don Klinger (). Monitoring, accountability, and improvement, oh no! Assessment policies and practices in Canadian education. In book: Assessment in Education: Implications for Leadership, Chapter: 3, Publisher: Springer, Editors: Shelleyann Scott, Donal E. Scott, Charles F. Webber, pp.53-65 preview

David Carless (2015). Excellence in University Assessment: Learning from award-winning practice. info [eBook in KB]

T. Groenendijk, M. Damen, Folkert Haanstra & C. van Boxtel 2016). Beoordelingsinstrumenten in de kunstvakken - een review. Pedagogische studien, 93 62-82.

K. D. J. M. van der Drift en P. Vos (1987). Anatomie van een leeromgeving. Een onderwijseconomische analyse van universitair onderwijs. Lisse: Swets en Zeitlinger. Proefschrift Rijksuniversiteit Leiden. stellingen, 290 pp., naam op schutblad, overigens bladzijden schoon en strak-->

Gordon Joughin (2010). The hidden curriculum revisited: a critical review of research into the influence of summative assessment on learning. Assessment & Evaluation in Higher Education Vol. 35, No. 3, May 2010, 335–345. pdf

Werkboek veilig toetsen. Hulpmiddel om het toetsproces veilig in te richten. SURF.pdf

Willem K. B. Hofstee en Frits E. Zegers (zonder datum). Het minimum aantal items in een multiple-choice of open-antwoordtoets. paper Heymans Instituut.

W. K. B. Hofstee (1996). Beoordeel liever überhaupt niet (tenzij). Commentaar op W. A. Wagenaar. De Psycholoog, 31, 410-411. Wagenaar, W. A. (1996). Beoordeel psychologen niet naar hun successen. De Psycholoog, 31, 407-410 . Hier geeft Wim nog eens kort zijn gedachten over proces- en productsturing. o.a. feedforward als motivering om niet te beoordelen; “Als je beoordeelt op output, corrumpeer je het proces.” Het leuke is natuurlijk dat ik met mijn toetsmodel juist van die vorm van ‘corrumpering’ gebruik maak. “Als het doelrationeel denken zich aldus tegen zichzelf heert .... ” “In het verlengde van de ‘overproducte van beleid” is er naar mijn indruk sprake van een tendens tot overproductie van beoordeling.” Hoewel het stukje gaat over het beoordelen van psychologen, is er niets op tegen om het ook op het beoordelen in het onderwijs toe te passen, toch?

Wendy McColskey & Mark R. Leary (1985). Differential effects of norm-referenced and self-referenced feedback on performance expectancies, attributions, and motivation. Contemporary Educational Psychology,10, 275-284. 10.1016/0361-476X(85)90024-4 predictie attributie abstract

When feedback is provided to students in a norm-referenced manner that compares the individual’s performance to that of others, people who perform poorly tend to attribute their failures to lack of ability, expect to perform poorly in the future, and demonstrate decreased motivation on subsequent tasks. The present study examined the hypothesis that the deleterious effects of failure might be attenuated when failure is expressed in self-referenced terms-relative to the individual’s known level of ability as assessed by other measures.

Egbert Warries (1970). Het relatief meten van leerprestaties in het onderwijs. Nederlands Tijdschrift voor de Psychologie, 25, 429-439. Repliek: Wijnen (1971). Dupliek: NTvdPs, 26, 135-139. Nogmaals: Warries (1971). NTvdPs, 26, 596-598.

Don A. Klinger (). Monitoring, accountability, and improvement, oh no! Assessment policies and practices in Canadian education.

Refers to "(e.g., Delandshere, 2001; Ravitch, 2010; Wilbrink, 1997)". From Diana: The death and life of the great American school system. How testing and choice are undermining education

Hunter M. Breland & Judith L. Gaynor (1979). A comparison of direct and indirect assessments of writing skill. Journal of Educational Measurement, 16, 119-128. preview

Direct assessment: essay. Indirect assessment: MC-questioning.

Coffman (1966) showed that the validities of direct assessment were often higher than their reliabilities, even though direct assessment has usually been questioned because of often demonstrated unreliability. Coffman thus emphasized the importance of comparing validities, not reliabilities, of direct and indirect assessment.
p. 119

Rob Schoonen (1998). De nieuwe samenvattingsopdracht in het Centraal Examen Nederlands. Taalbeheersing, 20, 20-38.

Mooi startpunt voor behandeling van het dilemma kenbaarheid (M. J. Cohen) en objectiviteit. Omdat betrouwbaarheid vooral met objectiviteit heeft te maken, in de praktische uitvoering ervan, en taakvariatie aangeeft dat er een kenbaarheidsprobleem is (althans wanneer die variatie leidt tot nogal forse verschillen in uitkomsten).

David M. Shoemaker (1975). Toward a framework for achievement testing. Review of Educational Research, 45, 127-147. 10.3102/00346543045001127 preview

Kern in zijn verhaal is het begrip item universe. Om dat werkbaar te maken, intoduceert hij p. 131 het begrip item domain.
"An item domain is a clearly defineable and enumerable subuniverse of items extracted through expert selection from the larger item universe."

Testing and Motivation for Learning WYNNE HARLEN & RUTH DEAKIN CRICK Assessment in Education, Vol. 10, No. 2, July 2003 pdf

Andrea Gingerich, Susan E. Ramlo, Cees P. M. van der Vleuten, Kevin W. Eva, Glenn Regehr (2016). Inter-rater variability as mutual disagreement:identifying raters’ divergent points of view. Adv in Health Sci Educ DOI 10.1007/s10459-016-9711-8 read

Gavin T. L. Brown & Lois R. Harris (Eds.) (2016). Handbook of human and social conditions in assessment. Routledge. info

Foreword by John Hattie.

Steven J. Howard, Stuart Woodcock, John Ehrich and Sahar Bokosmaty (2016). What are standardized literacy and numeracy tests testing? Evidence of the domain-general contributions to students' standardized educational test performance. British Journal of Educational Psychology abstract

Dominique Sluijsmans & René Kneyber (red.) (2016). Toetsrevolutie. Naar een feedbackcultuur in het voortgezet onderwijs. Phronese. Mooi initiatief: naast het boek is een pdf ervan als open access ter beschikking gesteld: download open access pdf

Inhoudelijke bijdragen, gelardeerd met interviews met leraren. Zodra ik er tijd voor vrij kan maken, wil in verschillende bijdragen hier graag bespreken.

Marie-Josée Bisson, Camilla Gilmore, Matthew Inglis and Ian Jones (2016). [Mathematics Education Centre, Loughborough University] Measuring Conceptual Understanding Using Comparative Judgement preprint

Ian Jones & Matthew Inglis (2015). The problem of assessing problem solving: can comparative judgement help? Educ Stud Math (2015) 89:337–355 DOI 10.1007/s10649-015-9607-1

Bernard R. Gifford (Ed.) (1989). Test policy and the politics of opportunity allocation: the workplace and the law. National Commission on Testing and Public Policy. Kluwer Academic Publishers. isbn 0792390156

a.o.: Bernard R. Gifford: The allocation of opportunities and the politics of testing: A policy analytic perspective 3-32 - Carolyn Webber: The mandarin mentality: civil service and university admissions testing in Europe and Asia 33-60 - John J. Fremer: Testing companies, trends, and policy issues: A current view from the testing industry 61-80 - Michael A. Rebell: Testing. public policy, and the courts 135-162 - Norman J. Chachkin: Testing in elementary and secondary schools 163-188 - Robert F. Adams: Economic models of discrimination, testing, and public policy 191-210 - Henry M. Levin: Ability testing for job selection: are the economic claims justified? 211-232 - John Sibley Butler: Test scores and evaluation: The military as data 265-292

Bernard R. Gifford (Ed.) (1989). Test policy and test performance: education, language and culture. National Commission on Testing and Public Policy. Kluwer Academic Publishers. isbn 0792390148

Neil J. Dorans & Linda L. Cook (Eds.) (2016). Fairness in educational assessment and measurement. NCME. [PEDAG 51.e.93] info

1. Historical Impetus for Fair Assessment - G. Hughes
2. Philosophical Considerations about Fairness - R. Zwick and N. Dorans
3. Legal Considerations -S. Phillips
4. The Profession's Perspective on Fairness - J. Herman
5. Commentary - F. Worell Ensuring Fairness in Test Design, Construction, and Administration
6. Test Design and Test Construction - M. Zieky
7. Test Administration - J. Wollack
8. Considerations for Special Populations - L. Cook
9. Commentary - B. Plake Assessing the Fairness of Scoring and Score Use under Common Measurement Conditions
10. Scoring - R. Penfield 11. Score Interpretation - J. Liu
11. Score Use-E. Haertel and A. Ho
12. Commentary - S. Sinharay Assessing the Fairness of Comparisons under Divergent Measurement Conditions
13. Comparing Test Scores Across Different Tests and Modes of Administration - M. Pommerich
14. Comparing Test Scores Across Grade Levels - M. Kolen
15. Comparing Test Scores from Tests Administered in Different Languages - S. Sireci
16. Commentary - D. Thissen
17. Synthesis and Future Directions - N. Dorans and L. Cook

Tim Paramour THE ELEPHANT IN THE PRIMARY SCHOOL CLASSROOM: THE DATA IS MADE UP. blog

W. James Popham (1999). Why Standardized Tests Don't Measure Educational Quality. Educational Leadership. pdf

Daisy Christodoulou (2017). Making good progress? The future of assessment for learning. Oxford University Press. isbn 9780198413608 info; free access Foreword and chapter 1: pdf

Greg Thompson (2013). NAPLAN, MySchool and Accountability: Teacher perceptions of the effects of testing. The International Education Journal: Comparative Perspectives, 12, 62–84 . free

Phelps, R.P. (2016). Teaching to the test: A very large red herring. Nonpartisan Education Review/Essays, 12(1). - [See more at: http://nonpartisaneducation.org/Review/Essays/v12n1.htm#sthash.6moWYkGn.uYfP676U.dpuf ] pdf

Rob Coe, Cambridge Assessment (2016). What makes great assessment? download

Richard P Phelps [updated June, 2010]. The source of Lake Wobegon free access

Monica Bulger (2016). Personalized Learning: The Conversations We’re Not Having. Working Paper 07.22.2016 9data analytics] [via Ben Williamson] pdf

Dylan Wiliam (2007). Keeping learning on track: Formative assessment and the regulation of learning. researchgate

Gavin T. L. Brown (2017). The Future of assessment as a human and Social endeavor: addressing the inconvenient truth of error. Frontiers in Education open access

Richard Phelps (2008). The Role and Importance of Standardized Testing in the World of Teaching and Training. Conference Paper May 2008 researchgate.net

Richard Phelps. Conference Paper · May 2008 The Role and Importance of Standardized Testing in the World of Teaching and Training [researchgate.net]

Schoolvaardigheidstoets Rekenen-Wiskunde Teije de Vos, Marisca Milikowski Boom webpagina

Richard Phelps (2017). The “Teaching to the Test” Family of Fallacies. Revista Iberoamericana de Evaluación Educativa, 2017, 10(1), 33-49. pdf

Karin J. Gerritsen-van Leeuwenkamp, Desirée Joosten-ten Brinke, Liesbeth Kester (2017). Assessment quality in tertiary education: An integrative literature review. Studies in Educational Evaluation, 55, 94-116. pdf

Test item quality is mostly left out. That results in the risk of ‘garbage in, garbage out’ analyses of validity etcetera.

Onmiddellijke Diagnose en Feedback voor Alle Vakken. Ed van den Berg pdf

Scholen langs de meetlat Norman Verhelst, Gerrit Staphorsius, Frans Kleintjes Citogroep Arnhem november 2003 ophalen

Richard Phelps (2012). The Effect of Testing on Student Achievement, 1910–2010. International Journal of Testing. https://doi.org/10.1080/15305058.2011.602920 paywalled

Analysis of Attendance and Graduation Outcomes at Public High Schools in the District of Columbia January 16, 2018 blog: A Bit More on the Fraudulent Grades and Promotions in DC Schools January 28, 2018 John Merrow

Assessment and learning: fields apart? Jo-Anne Baird, David Andrich, Therese N. Hopfenbeck & Gordon Stobart (2017). Assessment in Education: Principles, Policy & Practice, 24, 317-350. abstract

Maple T.A. is an online assessment system for STEM course site

Richard P. Phelps (2012) The Effect of Testing on Student Achievement, 1910–2010, International Journal of Testing, 12:1, 21-43, DOI: 10.1080/15305058.2011.602920 abstract

Ten Benefits of Testing and Their Applications to Educational Practice Henry L. Roediger III, Adam L. Putnam and Megan A. Smith (). pdf

Mien Segers en Dominique Sluijsmans (Red.) (2018). Toetsrevolutie. Phronese. pdf

Naerssen, R. F. van, Simpele items tegenover complexe. Tijdschrift voor Onderwijsresearch, 1980, 5, 193-198.

Mouly, G. J., & L. E. Walton (1962). Schaum’s outline of test items in education. New York: McGraw-Hill, 1962.

Paul Black & Dylan Wiliam (2018): Classroom assessment and pedagogy, Assessment in Education: Principles, Policy & Practice. open

Boesman, Th. Boesman (1942). De examens in de chirurgijnsgilden. Utrecht: Kemink.

Dit is toch een vorm van hoger onderwijs, bijv. in Leiden moesten ten overstaan van een hoogleraar van de universiteit stellingen worden beantwoord (verdedigd?). (In een gildebrief van Leiden, 1589, (zie p. 26) is voor het eerst sprake van 'stellingen' waarop kandidaten worden ondervraagd. Boesman geeft een fors aantal van deze stellingen, die kandidaten een maand tevoren kregen om zich te kunnen voorbereiden. Een afzonderlijk hoofdstuk over examens in Frankrijk en Vlaanderen.

Implementing assessment innovations in higher education Boevé, Anna Jannetje (2018). Proefschrift RUG. pdf

O. O. Adesope, d. A. Trevisan & N. Sundararajan (2017). Rethinking the use of tests. A meta-analysis of practice testing. Reviwe of Eucational Research, 87,, 659-701. DOI: 10.3102/0034654316689306 [ abstract

The testing effect. "Results reveal that practice tests are more beneficial for learning than restudying and all other comparison conditions."

J. D. Karpicke, A. C. Butler & H. L. Roediger (2009). Metacognitive strategies in student learning: Do students practise retrieval when they study on their own? Memory, 17, 471-479. abstract

C. L. Bae , D. J. Therriault & J. L. Redier (2018 online first). Investigating the testing effect: Retrieval practice as a characteristic of effective study strategies. Learning and Instruction abstract

ETS (1977). Educational measurement & the law. Proceedings of the 1977 ETS invitational conference. Educational Testing Service.

Barbara Lerner: Equal protection and external screening: Davis, De Funis, and Bakke. 3-28 (+ discussion)
Melvin R. Novick: The influence of the law on professional measurement standards. 41-52 (+ discussion)
Wayne H. Holtzman: Validity and legality. 63-72
Charles L. Thomas: Some possible social implications of recent court decisions 73-86
Norman Frederiksen: There ought to be a law 87-98
Michael Scriven: The logic of judgement in evaluation and the law: Making hard decisions with soft data. 99-108.

Lukas K. Sotola & Marcus Crede (2020). Regarding Class Quizzes: a Meta-analytic Synthesis of Studies on the Relationship Between Frequent Low-Stakes Testing and Class Performance. [meta-analysis.] Educational Psychology Review

-  Adesope, O., Trevisan, D., & Sundararajan, N. (2017). Rethinking the use of tests: a meta-analysis of practice testing. Review of Educational Research, 87(3), 659–701. Google Scholar
-   Arnold, K. M., & McDermott, K. B. (2013). Test-potentiated learning: distinguishing between direct and indirect effects of tests. Journal of experimental psychology. Learning, memory, and cognition, 39(3), 940–945. https://doi.org/10.1037/a0029199.
-  Basol, G., & Johanson, G. (2009). Effectiveness of frequent testing over achievement: a meta analysis study. International Journal of Human Sciences, 6(2), 99–121. Google Scholar
-   *Batsell, Jr., W. R., Perry, J. L., Hanley, E., & Hostetter, A. B. (2017). Ecological validity of the testing effect: the use of daily quizzes in introductory psychology. Teaching of Psychology, 44(1), 18–23. Google Scholar
-   Bangert-Drowns, R. L., Kulik, J. A., & Kulik, C. C. (1991). Effects of frequent classroom testing. The Journal of Educational Research, 85(2), 89–99. Google Scholar
-   *Bjork, E. L., Little, J. L., & Storm, B. C. (2014). Multiple-choice testing as a desirable difficulty in the classroom. Journal of Applied Research in Memory and Cognition, 3(3), 165-170. http://dx.doi.org.proxy.lib.iastate.edu/10.1016/j.jarmac.2014.03.002. Google Scholar
-   Blaxton, T. A. (1989). Investigating dissociations among memory measures: support for a transfer-appropriate processing framework. Journal of Experimental Psychology: Learning, Memory, and Cognition, 15(4), 657–668. Google Scholar
-   Butler, A. C., Karpicke, J. D., & Roediger, H. L. (2007). The effect of type and timing of feedback on learning from multiple-choice tests. Journal of Experimental Psychology: Applied, 13(4), 273–281. https://doi.org/10.1037/1076-898X.13.4.273. Article  Google Scholar
-   Carpenter, S. K. (2009). Cue strength as a moderator of the testing effect: the benefits of elaborative retrieval. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35(6), 1563–1569. https://doi.org/10.1037/a0017021. Article  Google Scholar
-   Carpenter, S. K. (2011). Semantic information activated during retrieval contributes to later retention: support for the mediator effectiveness hypothesis of the testing effect. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37(6), 1547–1552. https://doi.org/10.1037/a0024140. Article  Google Scholar
-   Carpenter, S. K., & DeLosh, E. (2006). Impoverished cue support enhances subsequent retention: support for the elaborative retrieval explanation of the testing effect. Memory & Cognition, 34(2), 268–276. Google Scholar
-   Carrier, M., & Pashler, H. (1992). The influence of retrieval on retention. Memory & Cognition, 20(6), 633–642. https://doi.org/10.3758/BF03202713. Article  Google Scholar
-   Chan, J. C. K., Manley, K. D., Davis, S. D., & Spunzar, K. K. (2018a). Testing potentiates new learning across a retention interval and a lag: a strategy change perspective. Journal of Memory and Language, 102, 83–96. Google Scholar
-   Chan, J. C. K., Meissner, C. A., & Davis, S. D. (2018b). Retrieval potentiates new learning: a theoretical and meta-analytic review. Psychological Bulletin, 144(11), 1111–1146. https://doi.org/10.1037/bul0000166. Article  Google Scholar
-   Credé, M., Roch, S., & Kieszczynska, U. M. (2010). Class attendance in college: a meta-analytic review of the relationship of class attendance with grades and student characteristics. Review of Educational Research, 80(2), 272–295. Google Scholar
-   Dahlke, J. A., & Wiernik, B. M. (2019). psychmeta: an R package for psychometric meta-analysis. Applied Psychological Measurement, 43(5), 515–416. https://doi.org/10.1177/0146621618795933. Article  Google Scholar
-   Godden, D., & Baddeley, A. (1975). Context-dependent memory in two natural environments: on land and underwater. British Journal of Psychology, 66(3), 325–331. Google Scholar
-   *Geiger, O. G., & Bostow, D. E. (1976). Contingency-managed college instruction: effects of weekly quizzes on performance on examination. Psychological Reports, 39, 707–710, 3. Google Scholar
-   *Hertzberg, O. E., Heilman, J. D., & Leuenberger, H. W. (1932). The value of objective tests as teaching devices in educational psychology classes. Journal of Educational Psychology, 23(5), 371-380. http://dx.doi.org.proxy.lib.iastate.edu/10.1037/h0072896. Google Scholar
-   Jacoby, L. L., Shimizu, Y., Daniels, K. A., & Rhodes, M. G. (2005). Modes of cognitive control in recognition and source memory: depth of retrieval. Psychonomic Bulletin & Review, 12(5), 852–857. Google Scholar
-   Jang, Y., & Huber, D. E. (2008). Context retrieval and context change in free recall: recalling from long-term memory drives list isolation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 34(1), 112–127. https://doi.org/10.1037/0278-7393.34.1.112. Article  Google Scholar
-   Karpicke, J. D., & Roediger, H. L. (2007). Repeated retrieval during learning is the key to long-term retention. Journal of Memory and Language, 57(2), 151–162 http://dx.doi.org.proxy.lib.iastate.edu/10.1016/j.jml.2006.09.004. Google Scholar
-   Kulik, J. A., & Kulik, C. L. C. (1988). Timing of feedback and verbal learning. Review of Educational Research, 58(1), 79–97. Google Scholar
-   Little, J. L., Bjork, E. L., Bjork, R. A., & Angello, G. (2012). Multiple choice tests exonerated, at least of some charges: fostering test-induced learning and avoiding test-induced forgetting. Psychological Science, 23(11), 1337–1344. https://doi.org/10.1177/0956797612443370. Article  Google Scholar
-   Locke, E. A., & Latham, G. P. (2002). Building a practically useful theory of goal setting and task motivation: a 35-year odyssey. American Psychologist, 57(9), 705–717. Google Scholar
-   Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207–218. https://doi.org/10.1207/s15326985ep3404_2. Article  Google Scholar
-  *McDaniel, M. A., Agarwal, P. K., Huelser, B. J., McDermott, K. B., & Roediger, Henry L. I, II (2011). Test-enhanced learning in a middle school science classroom: the effects of quiz frequency and placement. Journal of Educational Psychology, 103(2), 399–414. http://dx.doi.org.proxy.lib.iastate.edu/10.1037/a0021782
-   Morris, C. D., Bransford, J. D., & Franks, J. J. (1977). Levels of processing versus transfer appropriate processing. Journal of Verbal Learning and Verbal Behavior, 16(5), 519–533. Google Scholar
-   Pastötter, B., Schicker, S., Niedernhuber, J., & Bäuml, K. T. (2011). Retrieval during learning facilitates subsequent memory encoding. Journal of Experimental Psychology: Learning, Memory, and Cognition, 37(2), 287–297. https://doi.org/10.1037/a0021801. Article  Google Scholar
-   Pyc, M. A., & Rawson, K. A. (2012). Why is test–restudy practice beneficial for memory? An evaluation of the mediator shift hypothesis. Journal of Experimental Psychology: Learning, Memory, and Cognition, 38(2), 737–746. https://doi.org/10.1037/a0026166. Article  Google Scholar
-   Redick, T. (2019). The hype cycle of working memory training. Current Directions in Psychological Science, 28(5), 423–429. https://doi.org/10.1177/0963721419848668. Article  Google Scholar
-   Rickard, T. C., & Pan, S. C. (2018). A dual memory theory of the testing effect. Psychonomic Bulletin & Review, 25(3), 847–859. https://doi.org/10.3758/s13423-017-1298-4. Article  Google Scholar
-   Roediger, H. L., Agarwal, P. K., McDaniel, M. A., & McDermott, K. B. (2011). Test-enhanced learning in the classroom: long-term improvements from quizzing. Journal of Experimental Psychology: Applied, 17(4), 382–395. Google Scholar
-   Roediger, H. L., & Karpicke, J. D. (2006). Test-enhanced learning: taking memory tests improves long-term retention. Psychological Science, 17(3), 249–255. Google Scholar
-   *Ross, C. C., & Henry, L. K. (1939). The relation between frequency of testing and progress in learning psychology. Journal of Educational Psychology, 30(8), 604-611. http://dx.doi.org.proxy.lib.iastate.edu/10.1037/h0055717. Google Scholar
-   Rowland, C. A. (2014). The effect of testing versus restudy on retention: a meta-analytic review of the testing effect. Psychological Bulletin, 140(6), 1432–1463. Google Scholar
-   Schwieren, J., Barenberg, J., & Dutke, S. (2017). The testing effect in the psychology classroom: a meta-analytic perspective. Psychology Learning and Teaching, 16(2), 179–196 http://dx.doi.org.proxy.lib.iastate.edu/10.1177/1475725717695149. Google Scholar
-   Sisk, V. E., Burgoyne, A. P., Sun, J., Butler, J. L., & MacNamara, B. N. (2017). To what extent and under which circumstances are growth mind-sets important to academic achievement? Two meta-analyses. Psychological Science, 29(4), 549–571. https://doi.org/10.1177/0956797617739704. Article  Google Scholar
-   Smith, M. A., & Karpicke, J. D. (2014). Retrieval practice with short-answer, multiple-choice, and hybrid tests. Memory, 22(7), 784–802. Google Scholar
-   Thijssen, D. H. J., Hopman, M. T. E., van Wijngaarden, M. T., Hoenderop, J. G. J., Bindels, R. J. M., & Eijsvogel, T. M. J. (2019). The impact of feedback during formative testing on study behaviour and performance of (bio) medical students: a randomised controlled study. BMC Medical Education, 19(1), 97 (2019). https://doi.org/10.1186/s12909-019-1534-x. Article  Google Scholar
-   Vanhove, A. J., & Harms, P. D. (2015). Reconciling the two disciplines of organisational science: a comparison of findings from lab and field research. Applied Psychology, 64(4), 637–673. https://doi.org/10.1111/apps.12046. Article  Google Scholar
-   Wissman, K. T., Rawson, K. A., & Pyc, M. A. (2011). The interim test effect: testing prior material can facilitate the learning of new material. Psychonomic Bulletin & Review, 18(6), 1140–1147. https://doi.org/10.3758/s13423-011-0140-7. Article  Google Scholar
-   Zaromb, F. M., & Roediger, H. L. (2010). The testing effect in free recall is associated with enhanced organizational processes. Memory & Cognition, 38(8), 995–1008. https://doi.org/10.3758/MC.38.8.995. Article  Google Scholar

Steven M. Downing & Thomas M. Haladyna (1996). A model for evaluating high-stakes testing programs: Why the should not guard the chicken coop. EM:IP spring 5-12. abstract en pdf

Recommended Articles about High-Stakes Tests. VAMboozled! A blog by Audrey Amrein-Beardsley page

Molenaar (1981). On Wilcox's latent structure model for guesing. BrJMStPs, 34, 224-228. Met antwoord: Wilcox (1981). Methods and recent advances in measuring achievement: a response to Molenaar. BrJMStPs, 34, 229-237. raden raadkans

= Molenaar, W. (1977). On Bayesian formula scores for random guessing in multiple choice tests. BrJMStPs, 30, 79-89. abstract

o. a. p. 86 een mixture van Polya verdelingen. Hij melkt het onderwerp wel uit, en dan gaat het nog alleen over de fractie geweten van de vragen in de toets, niet die in het domein. Over dat laatste is onderzoek in progress. Is dat het artikel samen met Willink? Er zit een veronderstelling in dit artikel die implicet is gebleven: voor iedere leerling wordt een nieuwe selectie van n items verondersteld. Zie Molenaar, I. W. (1980). On Wilcox’s latent structure model for quessing. BrJMStPs.

Wilmink, F. W. (1977). Publikatie van tentamenvragen en de tentamenskore. Tijdschrift voor Onderwijs Research, 2, 157-164. http://objects.library.uu.nl/reader/resolver.php?obj=000739914

Wilcox, R. R. (1977). Estimating the likelihood of falsepositive and falsenegative decisions in mastery testing: an empirical Bayes approach. Journal of Educational Statisties 1977, 2, 289307.

Wilcox, R. R. (1978). Estimating true score in the compound binomial error model. Psyohometrika, 43, 245-258.

Wilcox, R. R. (1979). A lower bound to the probability of choosing the optimal passing score for a mastery test when there is an external criterion. Pm 1979, 44, 245-249. 10.1007/BF02293976 abstract

Wilcox, R. R. (1979). Applying ranking and selection techniques to determine the length of a mastery test. EPM, 39: 13 crm

Wilcox, R. R. (1979). Comparing examinees to a control. Psychometrika 44, 55-68 setting standards; indifference zone; strong true-score models. binomial model researchgate.net

Wilcox, R. R. (1979). On false-positive and false-negative decisions with a mastery test. JESt, 4, 59-73. crm

Wilcox, R. R. (1979). Prediction analysis and the reliability of a mastery test. EPM, 39: 825. crm

Wilcox, R. R. (1980). An approach to measuring the achievement of proficiency of an examinee. APM, 4, 241-251. 10.1177/014662168000400210 scihub pdf

Wilcox, R. R. (1981). Determining the length of a criterion-referenced test. APM, 5, 425-446. (latent trait models) crm When determining how many items to include on a criterion-referenced test, practitioners must resolve various nonstatistical issues before a particular solution can be applied. A fundamental problem is deciding which of three true scores should be used. The first is based on the probability that an examinee is correct on a "typical" test item. The second is the probability of having acquired a typical skill among a domain of skills, and the third is based on latent trait models. Once a particular true score is settled upon, there are several perspectives that might be used to determine test length. The paper reviews and critiques these solutions. Some new results are described that apply when latent structure models are used to estimate an examinee's true score.

Wilcox, R. R. (1981). Solving measurement problems with an answer-until-correct scoring procedure. APM 1981, 5, 399-414 raden

Rand R. Wilcox (1982) Some new results on an answer-until-correct scoring procedure 10.1111/j.1745-3984.1982.tb00116.x abstract

Wilcox, R. R. (1982). Some empirical and theoretical results on an answer-until-correct scoring procedure. BrJMStPs, 35, 57-70. beta-binomial raden

Wilcox, R. R. (1982). Determining the length of multiple choice criterion-referenced tests when an answer-until-correct scoring procedure is used. EPM, 42: 789. (raden) tvr crm

Wilcox, R. R. (1983). A simple model for diagnostic testing when there are several types of misinformation. JExE, 52(1), 57.

Wilcox, R. R. (1977). Estimating the likelihood of false positive and false negative decisions in mastery testing: an empirical Bayes approach. JESt, 2, 289-307. crm

Zegers, F.E., Hofstee, W.K.B. & Korbee, C.J.M. Een beleidsinstrument m.b.t. cesuurbepaling. Paper ORD 1978. R.U. Groningen, subfaculteit Psychologie, vakgroep

Nitko, A. J. (Ed.) (1991). The practical matter of setting standards. Themanummer Educational Measurement; Issues and Practice, 10(2). o.a. R. M. Jaeger: Selection of judges for standard setting abstract;Defining Minimal Competence Craig N. Mills Gerald J. Melican Nancy Thomas Ahluwalia abstract ; J. B. Reid: Training judges to generate standard-setting data abstract; K. F. Geisinger: Using standard-setting data to establish cutoff scores ; W. A. Mehrens: Facts about samples, fantasies about domains. abstract

Melvin R. Novick, Charles Lewis, Paul H. Jackson (1973). The estimation of proportions in m groups. Pm 38, 19- 46 abstract

One major application would be in the use of criterion-referenced tests in individually prescribed instruction (IPI). Here students completing a curriculum module are evaluated on the basis of a very small number of observations (a short test). With conventional estimation procedures, the decision as to whether any given student has satisfactorily mastered the material of the module will be highly uncertain due to the shortness of the test. However, with the Bayesian method, that decision will be far more certain. The Bayesian Model II method, under certain reasonable conditions, effectively increases accuracy to an extent equivalent to adding between six and 25 items to an educational test that might consist of as few as five items.
21

Gregory J. Cizek (1996). Standard-setting guidelines. EM:IP spring 10.1111/j.1745-3992.1996.tb00802.x abstract

Lord, F. M. (1975). Formula scoring and number right scoring. Journal of Educational Measurement, 12: 7-11. #raden

Wim J. van der Linden (2005). Classical test theory. In Kimberley Kempf-Leonard: Encyclopedia of social measurement. Elsevier. 301-307. [niet online te vinden; ik heb een kopie ]

Julian C. Stanley & Marilyn D. Wang (1970). Weighting test items and testitem options, an overview of the analytical and empirical literature. EPM,30, 21-35.preview

Starren, H. (1990). De beoordeling als hefboom voor onderwijsverbetering. Optimaliseren van leerresultaten via veranderen van tentamen- en examenregels. De Psycholoog, 1990, 25, 109-113. [ik heb een hrdcopy] [verwijst naar mijn publicaties. Ha, dat is zeldzaam]

Starren, H. (1996). De toets als hefboom voor gewenst leergedrag. De Psycholoog, 294-5. [ik heb een hardcopy]

Starren, H. (1998). De toets als hefboom voor meer en beter leren. Academia. Thema? Leuk stukje, dat mij iets zou moeten zeggen over hoe ik mijn toetsmodel bij een breder publiek kan presenteren, of ook; hoe ik dat model breder kan inkaderen dan ik de laatste jaren aan het doen ben. Zal het bij de literatuur voor deel 2.2 voegen. fc

Starren, H. (2001). Incompatibiliteit van toetsing in het hoger onderwijs. Tijdschrift voor Hoger Onderwijs, 19, 120-129. open access https://www.tvho.nl/edition.php?id=60

Starren, H. (2001). Infantilisering van de psychologieopleiding? De Psycholoog, 36, #12, 652-657. origineel onder t Hofstee opgeborgen (themanummer Psycholoog Toetsen in het onderwijs).

Starren, J. (1988). Uitspraken over onderwijsresultaten. In Starren e.a. 1988, 151-228.

Starren, J., S. J. Bakker, en A. Van der Wissel (Red.) (1995). Inleiding in de onderwijspsychologie. Bussum: Coutinho. isbn 9062837158, 2e editie, 330 pp., (O.a. ook over de Groningse (Hofstee) methode voor cexuurbepaling.)

Starren, J., S.J. Bakker, & A. van der Wissel (red.), Inleiding in de onderwijspsychologie. Muiderberg: Coutinho, 1988.

RichardJ. Stiggins (1991). Assessment literacy. Phi Delta Kappan, 72, 534-539. (genoemd door Wiggns 1993) fc

Michael J. Subkoviak (1976). Estimating reliability from a single administration of a mastery test. Journal of Educational Measurement 13, 265-276. crm

Michael J. Subkoviak The reliability of mastery classification decisions. Unpublished paper, 1978.

Michael J. Subkoviak (19??). Decision-consistency approaches. In Berk, R. A. (Ed.). Criterion-referenced measurement: the state of the art (p. 129-185). Baltimore: The Johns Hopkins University Press. Met een idiote hoop tabellen. [niet online beschikbaar; fc houden]

Tittle, C. K. (1994). Toward an educational psychology of assessment for teaching and learning: theories, contexts, and validation arguments. Ed. Psychologist, 29, 149-162. abstract

Howard Wainer & David Thissen (1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability? Educational Measurement: issues and Practices, spring, 22-29.

Dit is van belang voor analyse van tekstbegrip.htm in eindexamens, waar bij een gegeven casus meerdere vragen worden gesteld ( = local dependence). Leuke vingeroefeningen, maar wat heb ik eraan? Een ondersteuning voor mijn poging een model voor individuele leerlingen op te zetten? Zie daarvoor p. 23:
Suppose we administered two parallel forms of a test and we assume that the examinees do not change between testing administrations. Thus we have two scores for each examinee that, under ideal circumstances, ought to be identical. How large might the difference between the two scores be, as a result of differences between the test forms due to chance?

Herbert J. Walberg, Bernadette F. Strykowski, Evangelina Rovai, & Samuel S. Hung. Exceptional performance. Review of Educational Research, 1984, 54, 87-112. JSTOR

Wiggins, G. (1993). Assessment: authenticity, context, and validity. Phi Delta Kappan, 75 no 3, 200-214.

Wiseman, S. (1949). The marking of English composition in grammar school selection. BrJEdPs, 19, 200-209. 10.1111/j.2044-8279.1949.tb01622.x abstract

Wijnen, W. H. F. W., & W. K. B. Hofstee (196?). Een poging tot tentamen-analyse. In ???? (congresboek onderwijsresearch oid?) 171-177. [ik heb een kopie]

Novick, M. R. (1980). Statistics as psychometrics. Psychometrika, 45, 411- 424. abstract

Uitgebreid over nut. Bv. p. 420:
It is evident that utility assessment is difficult and that there are many biases to be avoided. My only surprise is that there was ever any belief that simple methods would be adequate. Surely fifty years of work in opinion polling should have made us more sophisticated.

Novick & Jackson (1970). Bayesian guidance technology. RER, 40, 459-494. JSTOR

W. James Popham (1993). Educational testing in America: what's right, what's wrong? A criterion-referenced perspective. Ed Meas, 12 #1 11-15. 10.1111/j.1745-3992.1993.tb00517.x abstract

W. James Popham (1993). Circumventing the high costs of authentic assessment. Phi Delta Kappan, 74, 470-473. [ik heb een fc] preview

W. James Popham (1999). Where large scale educational assessment is heading and why it shouldn't. Educational Measurement: Issues and Practice, Fall, 13-17. 10.1111/J.1745-3992.1999.TB00268.X abstract

Linda Sturman (2003). Teachng to the test: science or intuition? Educational Research, 45, 261-273. abstract & pdf

Laura HamiltonFirst (2003). Assessment as a Policy Tool. 10.3102/0091732X027001025 preview & references

Bernard Weiner (1994). Ability versus effort revisited: the moral determinants of achievement evaluation and achievement as a moral system. Ed. Psychologist, 29, 163-172. academia.edu

Valerie J. Shute (2008). Focus on formative feedback. Review of Educational Research, 78, 153-189. ETS Research Report 2007 [het ETS report is niet exact gelijk aan dhet artikel]

"The premise underlying most of the research conducted in this area is that good feedback can significantly improve learning processes and outcomes, if delivered correctly. Those last three words—if delivered correctly—comprise the crux of this review. "
p. 154

C. C. Ross (1947 2nd). Measurement in Today's Schoolshathitrust

Richard P. Phelps (2020). Down the Memory Hole: Evidence on Educational Testing. Academic Questions, 33, 269–278 10.1007/s12129-020-09876-9 info 556

Willem K. B. Hofstee (2001). Beoordeling in het onderwijs - of niet? De Psycholoog themanummer toetsen december 2001, 640-644. Niet online te vinden. [Ik heb dit nummer van De Psycholoog, toetsen]

van der Linden, W. J. (1983). Van standaardtest naar itembank (Inaugural address). Enschede, The Netherlands: University of Twente. (In Dutch)

[Niet online beschikbaar, terwijl het toch een aardige presentatie is. Ik heb de rede, maar ikheb er wenig tot niets aan omdat ernaar verwijzen tamelijk zinloos is].

Chunliang Yang and Liang Luo, Miguel A. Vadillo, Rongjun Yu, David R. Shanks (2020). Testing (Quizzing) Boosts Classroom Learning: A Systematic and Meta-Analytic Review pdf, via Dan Willingham https://twitter.com/DTWillingham/status/1379898918700466177

Mirjam Remie (7 april 2021). De digitale surveillant staat naast je bed artikel

Van toets naar toets. BEA ROS EN MONIQUE MARREVELD (21-10-2021) (Eerder gepubliceerd in De Groene). open

Salvador Algarabel and Carmen Dasi (2001). The definition of achievement and the construction of tests for its measurement: A review of the main trends. Psicológica, 22, 43-66. download

Robert J. Mislevy, Mark R. Wilson, Kadriye Ercikan (2001). Psychometric Principles in Student Assessment. To appear in D. Stufflebeam & T. Kellaghan (Eds.), International Handbook of Educational Evaluation. Dordrecht, the Netherlands: Kluwer Academic Press.academia.edu

Schmeiser, C. B. (1992). Ethical codes in the professions. Educational Measurement: Issues and Practice, 11, #3, 5-11. abstract en pdf

James W. Pellegrino & Naomi Chudowsky (2003). The Foundations of Assessment Measurement: Interdisciplinary Research & Perspective academia.edu

Pieter Gordts (17 december 2021). Pedagoog Pedro De Bruyckere over toetsen: ‘Het gaat ook over de macht over ons onderwijs’ DeMorgen open

Zahra Javidanmehr1, Mohammad Reza Anani Sarab (2017) Cognitive Diagnostic Assessment: Issues and Considerations International Journal of Language Testing academia [applying ideas of Leighton]

Frans J. G. Janssens (1985). Toetsgebruik in de onderwijspraktijk: stand van zaken. Tijdschrift voor Onderwijsresearch 10 (1985), nr. 6, pp. 2-291. open

Dave Bartram & Ronald K. Hambleton (Eds. ) (2006). Computer-based testing and the internet. Issues and advances. academia.edu

Contents
- Introduction: The International Test Commission and its Role in Advancing Measurement Practices and International Guidelines; Thomas Oakland 1
- Testing on the Internet: Issues, Challenges and Opportunitiesin the Field of Occupational Assessment 13 Dave Bartram
- 2 Model-Based Innovations in Computer-Based Testing 39 Wim J. van der Linden
- 3 New Tests and New Items: Opportunities and Issues 59 Fritz Drasgow and Krista Mattern
- 4 Psychometric Models, Test Designs and Item Types for the Next Generation of Educational and Psychological Tests 77 Ronald K. Hambleton
- 5 Operational Issues in Computer-Based Testing 91 Richard M. Luecht
- 6 Internet Testing: The Examinee Perspective 115 Michael M. Harris
- 7 The Impact of Technology on Test Manufacture, Deliveryand Use and on the Test Taker 135 Dave Bartram
- 8 Optimizing Quality in the Use of Web-Based andComputer-Based Testing for Personnel Selection 149 Lutz F. Hornke and Martin Kerstin
- 9 Computer-Based Testing for Professional Licensing andCertiﬁcation of Health Professionals 163 Donald E. Melnick and Brian E. Clauser
- 10 Issues that Simulations Face as Assessment Tools 187 Charles Johnson
- 11 Inexorable and Inevitable: The Continuing Story of Technologyand Assessment 201 Randy Elliot Bennett
- 12 Facing the Opportunities of the Future 219 Krista J. Breithaupt, Craig N. Mills and Gerald J. Melican

juli 2022 \ contact ben at at at benwilbrink.nl

http://www.benwilbrink.nl/literature/toetsen.htm http://goo.gl/1K3Uc

Literatuur over toetsen (itt tests)

Ben Wilbrink