Literatuur testpsychologie (psychometrie, methodologie)


Ben Wilbrink

Zie ook projecten/raden.htm

zie ook projecten/geheimhouding.htm




Peter Herriot (Ed.) (1989). Assessment and selection in organizations. Methods and selection in organizations. Chichester: Wiley. isbn 0471916404




K. Anders Ericcson, A. and Simon, H. Cognition: An Historical Overview [scan availabe in Simon archive http://diva.library.cmu.edu/Simon/ ] [never reprinted] In Thomas V. Merluzzi, Carol R. Glass and Myles Genest (Eds.) (1981). Cognitive assessment. New York: Guilford Press.




Earl Hunt (2011). Human Intelligence.




Kofi Kissi Dompere (2014). Fuzziness, Democracy, Control and Collective Decision-choice System : A Theory on Political Economy of Rent-Seeking and Profit-Harvesting Springer [eBook in KB] info


Ik heb niet veel literatuur over rentseeking, los van de wereld van selectie-aan-de-poort, ik heb deze dus maar even genoteerd.



Helga A. H. Rowe (Ed.) (1991). Intelligence: Reconceptualization and Measurement. Erlbaum. [als eBook in KB] preview Questia




Maria Elena Oliveri & Matthias von Davier (2014). Toward Increasing Fairness in Score Scale Calibrations Employed in International Large-Scale Assessments. International Journal of Testing, 14, 1-21. open access


gebruikte data: PIRLS



Arne Evers, Klaas Sijtsma, Wouter Lucassen & Rob R. Meijer (2010). The Dutch Review Process for Evaluating the Quality of Psychological Tests: History, Procedure, and Results. International Journal of Testing, 10. abstract [paywall]


Geschiedenis, en werkwijze, Cotan.



Franié, Sanja; Dolan, Conor V.; Borsboom, Denny; Hudziak, James J.; van Beijsterveldt, Catherina E. M.; Boomsma, Dorret I. (2013). Can genetics help psychometrics? Improving dimensionality assessment through genetic factor modeling. Psychological Methods, 18, 406-433. abstract




Wim J. van der Linden (1998). A discussion of some methodological issues in international assessments. International Journal of Educational Research, 29, 569-577. abstract




Stephen G. Sireci and Polly Parker (2006). Validity on Trial: Psychometric and Legal Conceptualizations of Validity. Educational Measurement: Issues and Practice, fall, 27-34. abstract




Shudong Wang, Hong Jiao, Michael J. Young, Thomas Brooks and John Olson (2008). Comparability of Computer-Based and Paper-and-Pencil Testing in K-12 Reading Assessments : A Meta-Analysis of T"methodological issues in international asssessment" esting Mode Effects. Educational and Psychological Measurement 2008 68 5abstract




Fadia Nasser-Abu Alhija & Adi Levy (2009). Effect Size Reporting Practices in Published Articles. Educational and Psychological Measurement, 69, 245-265. abstract




Alvaro J. Arce-Ferrer and Elvira Martínez Guzmán (2009). Studying the Equivalence of Computer-Delivered and Paper-Based Administrations of the Raven Standard Progressive Matrices Test. Educational and Psychological Measurement, 69, 855-867. abstract


Vindt geen verschillen, i.t.t. eerder overzicht van Kubinger (1991).



Anneke C. Timmermans, Tom A. B. Snijders and Roel J. Bosker (2013). In Search of Value Added in the Case of Complex School Effects. Educational and Psychological Measurement 73, 210-228abstract


Ik zie dit artikel als vooral een technische analyse: specificeer een model, gebruik een beschikbare dataset, en rekenen maar. Ga na hoe model A tot andere uitkomsten leidt dan model B. De auteurs gaan althans in dit artikel nauwelijks in op de vraag of het schatten toegevoegde waarde een zinvolle onderneming is waarmee de samenleving mag worden lastiggevallen. Zij rekenen gewoon aan modellen, en zoals dat dan gegarandeerd het geval is: dat levert bepaalde utikomsten op. Er liggen evenwel heel wat stilzwijgende en minder stilzwijgende vooronderstellingen ten grondslag aan deze werkwijze.



Robert W. Lissitz (2009). Validity. Revisions, new directions, and applications. Information Age Publishing. [nog niet gezien]



Wim J. van der Linden & Minjeong Jeon (2012). Modeling Answer Changes on Test Items. Journal of Educational and Behavioral Statistics, 37, 180-199abstract pdf

On fraudulent changes.



Wim J. van der Linden, Minjeong Jeon & Steve Ferrara (2011). A Paradox in the Study of the Benefits of Test-Item Review. Journal of Educational Measurement, 48, 380-398. pdf



Kristian E. Markon (2013). Information Utility: Quantifying the Total Psychometric Information Provided by a Measure. Psychological Methods, 18, 15-35. abstract



Gregory J. Cizek (2012). Defining and Distinguishing Validity: Interpretations of Score Meaning and Justifications of Test Use. Psychological Methods, 17, 31-43. abstract



Ken Kelley & Kristopher J. Preacher (2012). On effect size. Psychological Methods, 17, 137-172. accepted concept



Michèle Nuijten, Marie Deserno, Angélique Cramer & Denny (2013). Psychologische stoornissen als complexe netwerken. De Psycholoog, januari, 12-23 [gecorrigeerde referentie, zie De Psycholoog, februari 2013 blz. 4]



Han L. J. van der Maas, Conor V. Dolan, Raoul P. P. P. Grasman, Jelte M. Wicherts, Hilde M. Huizenga, and Maartje E. J. Raijmakers (2006). A Dynamical Model of General Intelligence: The Positive Manifold of Intelligence by Mutualism. Psychological Review, 113, 842-861. pdf




Nate Silver (2012). The signal and the noise. Why so many predictions fail -- but some don't. The Penguin Press. isbn9781594204111 http://www.nytimes.com/2012/10/24/books/nate-silvers-signal-and-the-noise-examines-predictions.html http://www.npr.org/2012/10/10/162594751/signal-and-noise-prediction-as-art-and-science



Cynthia G. Parshall, Judith A. Spray, John C. Kalohn, and Tim Davey (2002). Practical Considerations in Computer-Based Testing. Springer [Nog niet gezien. Besproeken door Rob Meijer: Applied Psychological Measurement, Vol. 27 No. 1, January 2003, 78-80



David Thissen & Howard Waiuner (Eds.) (2001). Test Scoring. Springer [Nog niet gezien. Besproeken door Rob Meijer: Applied Psychological Measurement, Vol. 27 No. 1, January 2003, 75-77



Ronald K. Hambleton (2000). Advances in Performance Assessment Methodology. Applied Psychological Measurement, 24, 291-293. [Introduction to special issue)



Randy Elliot Bennett, Mary Morley & Dennis Quardt (2000). Three Response Types for Broadening the Conception of Mathematical Problem Solving in Computerized. Applied Psychological Measurement, 24, 294-309. abstract



M. David Miller & Robert L. Linn (2000). Validation of performance-based assessments. Applied Psychological Measurement, 24, 367-378. abstract



Wim J. van der Linden (2000). Optimal Assembly of Tests with Item Sets. Applied Psychological Measurement, 24, 225-240. abstract


Dit gaat over het type examenopgaven waarin een tekst is gegeven, waarover dan meerdere vragen worden gesteld. En dat is een vorm die in eindexamens veel wordt gebruikt. Het woord ‘optimaal’ heeft natuurlijk maar een beperkte betekenis: optimaal binnen gegeven randvoorwaarden. Als die randvoorwaarden beroerd zijn, zoals de kwaliteit van de vragen in de vragenverzameling waaruit wordt getrokken, dan is dat ‘optimaal’ een eufemisme.



Tom Verguts & Paul de Boeck (2000). A Rasch Model for Detecting Learning While Solving an Intelligence Test. Applied Psychological Measurement, 24, 151-162. abstract


Een opvallende titel. Intrigerend.



E. Matthew Schulz, Michael J. Kolen & W. Alan Nicewander (1999). A Rationale for Defining Achievement Levels Using IRT-Estimated Domain Scores. Applied Psychological Measurement, 23, 347-362. abstract



Rob R. Meijer & Michael L. Nering (1999). Computerized Adaptive Testing: Overview and Introduction. Applied Psychological Measurement, 23, 187-194. abstract



Chi-Keung Leung, Hua-Hua Chang & Kit-Tai Hau (2005). Computerized adaptive testing: A mixture item selection approach for constrained situations. British Journal of Mathematical and Statistical Psychology, 58, 239-257. abstract



T. J. H. M. Eggen (1999). Item Selection in Adaptive Testing with the Sequential Probability Ratio Test. Applied Psychological Measurement, 23, 249-261. abstract



Almond & Mislevy (1999). Graphical Models and Computerized Adaptive Testing. Applied Psychological Measurement, 23, 223-237. abstract



Tenko Raykov(1999). Are Simple Change Scores Obsolete? An Approach to Studying Correlates and Predictors of Change. Applied Psychological Measurement, 23, 120-126. abstract



Nambury S. Raju, Reyhan Bilgic, Jack E. Edwards & Paul F. Fleer (1999). Accuracy of Population Validity and Cross-Validity Estimation: An Empirical Comparison of Formula-Based, Traditional Empirical, and Equal Weights Procedures. Applied Psychological Measurement, 23, 99-115. abstract



Wim J. van der Linden (1999). Empirical Initialization of the Trait Estimator in Adaptive Testing. Applied Psychological Measurement, 23, 21-29. abstract



Gideon J. Mellenbergh (1999). A Note on Simple Gain Score Precision. Applied Psychological Measurement, 23, 87-89. abstract



John R. Bergan, Richard D. Schwarz & Linda A. Reddy (1999). Latent Structure Analysis of Classification Errors in Screening and Clinical Diagnosis: An Alternative to Classification Analysis. Applied Psychological Measurement, 23, 69-86. abstract



Klaas Sijtsma & Anton C. Verweij (1999). Knowledge of Solution Strategies and IRT Modeling of Items for Transitive Reasoning. Applied Psychological Measurement, 23, 55-68. abstract


Onderzoek waarbij de leerlingen hun antwoorden op de toets hebben moeten motiveren. Zie ook hoofdstuk 2 van Toetsvragen ontwerpen hfdst 2.



Wim van der Linden (1998). Optimal assembly of psychological and educational tests. Applied Psychological Measurement, 22, 195-211. abstract



Anat Ben-Simon, David V. Budescu and Baruch Nevo (1997). A Comparative Study of Measures of Partial Knowledge in Multiple-Choice Tests. Applied Psychological Measurement, 21, 65-88. abstract



Craig W. Deville (1996). An empirical link of content and construct validity evidence. Applied Psychological Measurement, 20, 127-139. abstract



Richard H. Williams & Donald W. Zimmerman (1996). Are simple gain scores obsolete? Applied Psychological Measurement, 20, 59-69. abstract



Rolf Langeheine, Elsbeth Stern & Frank van de Pol (1994). State Mastery Learning: Dynamic Models for Longitudinal Data Applied Psychological Measurement, 18, 277-291. abstract



Menucha Birenbaum, Kikumi K. Tatsuoka & Yaffa Gutvirtz (1992). Effects of Response Format on Diagnostic Assessment of Scholastic Achievement. Applied Psychological Measurement,16, 353-363. abstract


In het geval van opgaven algebra.



A.H.G.S. van der Ven & F.M. Gremmen (1992). The Knowledge or Random Guessing Model for Matching Tests. Applied Psychological Measurement, 16, 177-194. abstract



Mary E. Lunz, Betty A. Bergstrom & Benjamin D. Wright (1992). The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Test. Applied Psychological Measurement, 16, 33-40. abstract



Frits E. Zegers (1991). Coefficients for interrater agreement. Applied Psychological Measurement, 15, 321-333. abstract



Frits E. Zegers (1989). Het meten van overeenstemming. Nederlands Tijdschrift voor de Psychologie, 44, 145-156.

“. . . de ene leraar geeft de cijfers 7, 8 en 9, terwijl de ander voor dezelfde opstellen respectievelijk de cijfers 2, 3 en 4 geeft. De pmc tussen deze sets scores is maximaal (+1), maar het valt moeilijk te verdedigen dat de leraren het volledig met elkaar eens zijn.”

blz. 145



W. K. B. Hofstee & F. E. Zegers (1991). Idiographic correlation: modeling judgments of agreement between school grades. Tijdschrift voor Onderwijsresearch, 16, 331-336.



John B. Carroll (1990). Estimating Item and Ability Parameters in Homogeneous Tests With the Person Characteristic Function. Applied Psychological Measurement, 14, 109-125. abstract



Huub van den Bergh (1990). On the Construct Validity of Multiple- Choice Items for Reading Comprehension. Applied Psychological Measurement, 14, 1-12. abstract



Michael I. Waller (1990). Modeling Guessing Behavior: A Comparison of Two IRT Models. Applied Psychological Measurement, 13, 233-243. abstract



Jerry S. Gilmer (1989). The Effects of Test Disclosure on Equated Scores and Pass Rates. Applied Psychological Measurement, 13, 245-255. abstract



Terry A. Ackerman (1989). Unidimensional IRT Calibration of Compensatory and Noncompensatory Multidimensional Items. Applied Psychological Measurement, 13, 113-127. abstract



Marion S. Aftanas (1988). Theories, Models, and Standard Systems of Measurement. Applied Psychological Measurement, 12, 325-338. abstract



Terry A. Ackerman & Philip L. Smith (1988). A Comparison of the Information Provided by Essay, Multiple-Choice, and Free-Response Writing Tests. Applied Psychological Measurement, 12, 117-128. abstract



David V. Budescu (1988). On the Feasibility of Multiple Matching Tests — Variations on a Theme by Guiliksen. Applied Psychological Measurement, 12, 5-14. abstract



David V. Budescu (1987). Open-Ended Versus Multiple-Choice Response Formats—It Does Make a Difference for Diagnostic Purposes. Applied Psychological Measurement, 11, 385-395. abstract



Wim J. van der Linden (1986). The Changing Conception of Measurement in Education and Psychology. Applied Psychological Measurement, 10, 325-332. abstracttestpsychologie.htm-->


Technocratic.



Catharina C. van Thiel & Michel A. Zwarts (1986). Development of a Testing Service System. Applied Psychological Measurement, 10, 391-403. abstract



Ronald K. Hambleton & Richard J. Rovinelli (1986). Assessing the Dimensionality of a Set of Test Items Applied Psychological Measurement, 10, 287-302. abstract



Harold Gulliksen (1986). Perspective on Educational Measurement. Applied Psychological Measurement, 10, 109-132. abstract



J. P. Guilford (1985). A Sixty-Year Perspective on Psychological I Measurement. Applied Psychological Measurement, 9, 341-349. abstract



Anne Anastasi (1985). Some Emerging Trends in Psychlolgical Measurement: A Fifty-Year Perspective. Applied Psychological Measurement, 9, 121-138. abstract



Gail Ironson, Susan Homan & Ruth Willis (1984). The Validity of Item Bias Techniques with Math Word Problems. Applied Psychological Measurement, 8, 391-396. abstract



Albert C. Oosterhof & Pamela K. Coats (1984). Comparison of Difficulties and Reliabilities of Quantitative Word Problems in Completion and Multiple-Choice Item Formats. Applied Psychological Measurement, 8, 287-294. abstract



Robert L. Linn & C. Nicholas Hastings (1984). Group differentiated prediction. Applied Psychological Measurement, 8, 165-172. abstract



Michael Kane & Jennifer Wilson (1984). Errors of Measurement and Standard Setting in Mastery Testing. Applied Psychological Measurement, 8, 107-115. abstract



Isaac I. Bejar (1983). Subject Matter Experts' Assessment of Item Statistics. Applied Psychological Measurement, 7, 303-310. abstract



Henk Blok & Wim E. Saris (1983). Using Longitudinal Data to Estimate Reliability. Applied Psychological Measurement 7, 295-301. abstract



Anne R. Fitzpatrick (1983). The Meaning of Content Validity. Applied Psychological Measurement 7, 3-13. abstract



Ronald K. Hambleton (1983). Application of Item Response Models to Criterion-Referenced Assessment. Applied Psychological Measurement 7, 33-44. abstract



R. A. Weitzman (1982). Sequential Testing for Selection. Applied Psychological Measurement 6, 337-51. abstract



Jo P. M. Pieters & Ad H. G. S. van der Ven (1982). Precision, Speed, and Distraction in Time-Limit Tests. Applied Psychological Measurement 6, 93-103. abstract



Rand R. Wilcox (1981). A Cautionary Note on Estimating the Reliability of a Mastery Test with the Beta-Binomial Model. Applied Psychological Measurement 5,531-537. abstract



Lawrence J. Stricker (1981). The Role of Noncognitive Measures in Medical School Admissions. Applied Psychological Measurement 5, 313-323. abstract



Gary B. Forbach & Ronald G. Evans (1981). The Remote Associates Test as a Predictor of Productivity in Brainstorming Groups. Applied Psychological Measurement 5, 333-339. abstract



Susan E. Whitely & Lisa M. Schneider (1981). Information Structure for Geometric Analogies: A Test Theory Approach. Applied Psychological Measurement 5, 383-397. abstract



Robert L. Linn, Michael V. Levine, C. Nicholas Hastings & James L. Wardrop (1981). Item Bias in a Test of Reading Comprehension. Applied Psychological Measurement 5, 159-173. abstract



Ronald K. Hambleton (1980). Contributions to Criterion-Referenced Testing Technology: An Introduction. Applied Psychological Measurement 4, 421-424. abstract



Rand R. Wilcox (1980). Determining the Length of a Criterion-Referenced Test. Applied Psychological Measurement 4, 425-446. abstract



Lorrie Shepard (1980). Standard Setting Issues and Methods. Applied Psychological Measurement 4, 447-467. abstract



Wim J. van der Linden (1980). Decision Models for Use with Criterion-Referenced Tests. Applied Psychological Measurement 4, 469-492. abstract pdf



George B. Macready & C. Mitchell Dayton (1980). The Nature and Use of State Mastery Models. Applied Psychological Measurement 4, 493-516. abstract



Ross E. Traub & Glenn L. Rowley (1980). Reliability of Test Scores and Decisions. Applied Psychological Measurement 4, 517-545. abstract



Robert L. Linn (1980). Issues of Validity for Criterion-Referenced Measure. Applied Psychological Measurement 4, 547-561. abstract



Ronald A. Berk (1980). A Framework for Methodological Advances in Criterion-Referenced Testing. Applied Psychological Measurement 4, 563-573. abstract



Samuel Livingston (1980). Comments on Criterion-Referenced Testing. Applied Psychological Measurement 4, 575-581. abstract



Howard Wainer (1980). A Test of Graphicacy in Children. Applied Psychological Measurement 4, 331-340. abstract



Luis M. Laosa (1980). Measures for the Study of Maternal Teaching Strategies. Applied Psychological Measurement 4, 355-366. abstract



Robert B. Frary (1980). The Effect of Misinformation, Partial Information, and Guessing on Expected Multiple-Choice Test Item Scores. Applied Psychological Measurement 4, 79-90. abstract



Wim J. van der Linden (1979). Binomial Test Models and Item Difficulty. Applied Psychological Measurement 3, 401-411. abstract



D. Magnusson & G. Backteman (1978). Longitudinal Stability of Person Characteristics: Intelligence and Creativity. Applied Psychological Measurement 2, 481-490. abstract



R. R. Schmeck & F. D. Ribich (1978). Construct Validation of the Inventory of Learning Processes. Applied Psychological Measurement 2, 551-562. abstract



Robert T. Keller & Winford E. Holland (1978). A Cross-Validation Study of the Kirton Adaption-Innovation Inventory in Three Research and Development Organizations. Applied Psychological Measurement 2, 563-570. abstract



Wim J. van der Linden & Gideon J. Mellenbergh (1978). Coefficients for Tests from a Decision Theoretic Point of View. Applied Psychological Measurement 2, 119-134. abstract



Wim J. van der Linden & Gideon J. Mellenbergh (1977). Optimal Cutting Scores Using A Linear Loss Function. Applied Psychological Measurement 2, 593-599. abstract



Norman Frederiksen & William C. Ward (1978). Measures for the Study of Creativity in Scientific Problem-Solving. Applied Psychological Measurement 2, 1-24. abstract



Susan E. Whitely (1977). Information-Processing on Intelligence Test Items: Some Response Components. Applied Psychological Measurement 1, 465-476. abstract



Robyn M. Dawes (1977). Suppose We Measured Height With Rating Scales Instead of Rulers. Applied Psychological Measurement 1, 267-273. abstract; pdf



Susan E. Whitely (1977). Information-Processing on Intelligence Test Items: Some Response Components. Applied Psychological Measurement 1



P. W. Van Rijn, T. J. H. M. Eggen, B. T. Hemker & P. F. Sanders (2002). Evaluation of Selection Procedures for Computerized Adaptive Testing with Polytomous Items. Applied Psychological Measurement, 26, 393-411. abstract



Dimiter M. Dimitrov (2007). Least Squares Distance Method of Cognitive Validation and Analysis for Binary Items Using Their Item Response Theory Parameters. Applied Psychological Measurement, 31, 367-387. abstract

Ik ben bang dat dit allemaal geweldig ingewikkeld is, en volkomen irrelevant. De theoretische achtergrond is stimulus-response theorie, maar dat hoeft op zich nog niet verkeerd te zijn. Ik heb gene tijd om dit nu uit te zoeken.



Donald W. Zimmerman & Richard H. Williams (2003). A New Look at the Influence of Guessing on the Reliability of Multiple-Choice Tests. Applied Psychological Measurement, 27, 357-371. abstract



Theo J. J. M. Eggen & Angela J. Verschoor (2006). Optimal Testing With Easy or Difficult Items in Computerized Adaptive Testing. Applied Psychological Measurement, 30, 379-393. abstract



Wim van der Linden (2006). Equating Error in Observed-Score Equating. Applied Psychological Measurement, 30, 355-378. abstract



Wim van der Linden (2006). Equating Scores From Adaptive to Linear Tests. Applied Psychological Measurement, 30, 493-. abstract



Neil J. Dorans, Jinghua Liu & Shelby Hammond (2008). Anchor Test Type and Population Invariance: An Exploration Across Subpopulations and Test Administrations. Applied Psychological Measurement, 32, 81-97. abstract



Robert L. Brennan (2008). A Discussion of Population Invariance. Applied Psychological Measurement, 32, 102-114. abstract



Qing Yi, Deborah J. Harris & Xiaohong Gao (2008). A Discussion of Population Invariance of Equating. Applied Psychological Measurement, 32, 98-101. abstract

“If the conversions for various subgroups of interest are not comparable or population invariant, then the psychometric implication is that different conversions should be used for different groups. However, in practice, testing programs cannot use different linkings for different groups. In today’s social and political climate, it would be very difficult for a testing program to justify assigning different reported scores to two candidates from different groups who have the same number-correct score on the test. So if the results of population invariance studies show indications of population sensitivity, then great care needs to be taken in selecting a data collection design and a subpopulation (of the total testing population) for use for all item and test analyses and for score equating. And the subpopulation for which score comparability is expected to hold should be specified in the programs’ technical manual. Careful specification of the analysis population used for a test will improve score equity and improve scale stability across test administrations and test forms.”



Qing Yi, Deborah J. Harris & Xiaohong Gao (2008). Invariance of Equating Functions Across Different Subgroups of Examinees Taking a Science Achievement Test. Applied Psychological Measurement, 32, 62-80. abstract



Robert Semmes, Mark L. Davison & Catherine Close (2011). Modeling individual differences in numerical reasoning speed as a random effect of response time limits. Applied Psychological Measurement, 35, 433-446. abstract


Bij rekentoetsen is de vraag: toetsen we hier (verschillen in) rekenvaardigheid, intelligentie, of wat? Het antwoord op die vraag hangt ook af van de tijd die beschikbaar is om de toets af te leggen: heeft iedereen ruimschoots de tijd om het werk af te maken, of is de tijd zo beperkt dat een niet te verwaarlozen aantal deelnemers niet toekomt aan behoorlijk maken van alle opgaven? Vertaald naar de Nederlandse situatie bij de rekentoetsen die aan de examens in het middelbaar onderwijs worden toegevoegd: brengt de techniek van digitale afname van de toetsen de leerlingen in een situatie van te weinig tijd om alle opgaven behoorlijk te kunnen beantwoorden? Zo ja, dan is er een tijdsfactor in het spel. Bij digitale afname is er een ingewikkelde situatie die niet dezelfde is als beperkt beschikbare tijd voor de hele test: als een opgave niet onmiddellijk kan worden gemaakt, kan de leerling (in de huidige software die door het Cito wordt gebruikt) niet later nog eens terug naar een dergelijk opgave. Uit de literatuurlijst:



Wim J. van der Linden (2011). Setting time limits on tests. Applied Psychological Measurement, 35, 183-199. abstract




Jihyun Lee & James Corter (2011). Diagnosis of subtraction bugs using Bayesian networks. Applied Psychological Measurement, 35, 27-47. abstract




Timo M. Bechger, Gunter Maris & Ya Ping Hsiao (2010). Detecting Halo Effects in Performance-Based Examinations. Applied Psychological Measurement, 35, 27-47. abstract




Wim J. van der Linden & Marie Wiberg (2010). Local Observed-Score Equating With Anchor-Test Designs. Applied Psychological Measurement, 35, 27-47. abstract




Robert C. Daniel & Susan E. Embretson (2010). Designing Cognitive Complexity in Mathematical Problem-Solving Items. Applied Psychological Measurement, 35, 27-47. abstract


Susan Embretson, Joanna Gorin First published: 15 June 2006 Improving Construct Validity With Cognitive Psychology Principles. Journal of Educational Measurement Winter 2001, Vol. 38, No. 4, pp. 343-368 https://doi.org/10.1111/j.1745-3984.2001.tb01131.x abstract: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1745-3984.2001.tb01131.x pdf: https://smartech.gatech.edu/bitstream/handle/1853/34248/embretson_JEM_2001.pdf?sequence=1



Susan Embretson (Ed.) (2010). Measuring psychological constructs. Advances in model-based approaches. American Psychological Association. site



William W. Cooley & Paul R. Lohnes (1976). Evaluation Research in Education. Irvington Publishers.



William W. Cooley & Paul R. Lohnes (1962). Multivariate Procedures for the Behavioral Sciences. Wiley. Lib. Congress 62-18990.



Daniel H. Robinson, Joel R. Levin, Leslie O'Ryan & Duane Halbur-Ramseyer (2001). Does Statistical Language Constitute a "Significant" Roadblock to Readers' Interpretations of Research Results?.Journal of Educational Psychology, 93, 646-654. abstract




AERA, APA & NCME (1999). The Standards for Educational and Psychological Testing. zie hier - niet geautoriseerde samenvatting



Anne E. Magurran & Brian J. McGill (Eds.) (2011). Biological Diversity. Frontiers in measurement and assessment. Oxford University Press.



Alexander W. Wiseman (2010). The uses of evidence for educational policy making: global contexts and international trends. Review of Research in Education, 34, 1-24.



Richard J. Murnane & John B. Willett (2011). Methods Matter. Improving Causal Inference in Educational and Social Science Research. Oxford University Press. [genoemd in blog 11 van de serie over realistisch rekenen]



Robert F. Dedrick, John M. Ferron, Melinda R. Hess, Kristine Y. Hogarty, Jeffrey D. Kromrey, Thomas R. Lang, John D. Niles, and Reginald S. Lee (2009). Multilevel Modeling: A Review of Methodological Issues and Applications. Review of Educational Research, 79, 69-102



G. van den Berg (1981). Onderwijskundig onderzoek: twee doelstellingen, één onderzoeksmodel. Pedagigische Studiën, 58, 213-225.



W. Wardekker (1959). Interdisciplinaire onderwijskunde: modellen voor een wetenschap. Pedagigische Studiën, 56, 183-196. Niet van belang.



Theresa Ann Sipe & William L. Curlette (Guest eds.) (1997). A meta-synthesis of factors related to educational achievement: A methodological approach to summarizing and synthesizing. International Journal of Educational Research, 25 #7, 583-698.



Richard P. Phelps (Ed.) (2009). Correcting fallacies about educational and psychological testing. APA. [in KB als eBook] [ UBL PEDAG. 64.b.44 ] contents




J. Tinbergen (1936). Grondproblemen der theoretische statistiek. De Erven F. Bohn.



Donald W. Zimmerman (2009). The Reliability of Difference Scores in Populations and Samples. Journal of Educational Measurement, 46, 19-42,



Stephen Gorard (2010) 'All evidence is equal: the flaw in statistical reasoning' Oxford Review of Education, 36, 63 -- 77.



Stephen B. Broomell & David V. Budescu (2009). Why are experts correlated? Decomposing correlations between judges. Psychometrika, 74, 531-553.



Rolf Haenni (2008). Aggregating referee scores: An algebraic approach. In U. Endriss and P. W. Goldberg: COMSOC'08, 2nd International Workshop on Computational Social Choice, 277-288, 2008. pdf



F. Roels (Ed.) (1928). Cinquième Conférence International de Psychotechnique Tenue à Utrecht. Comtes-Rendus. Dekker & v.d. Vegt.



Andrew H. Jazwinski (1970). Stochastic Processes and Filtering Theory. Academic Press. (Toepassingen: meetprocedures, foutencorrectie. )



Willem K. B. Hofstee (2009). Promoting intersubjectivity: a recursive-betting model of evaluative judgments. Netherlands Journal of Psychology, 65. abstract


Aantekeningen: toetsmodellen.htm#Hofstee_intersubjectivity



Tilmann Gneiting & Adrian E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359-378. pdf



William Meredith (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525-543.

Amrein-Beardsley_2008



Audrey Amrein-Beardsley (2008). Methodological concerns about the education value-added system. Educational Researcher, 37, 65-75. Slavin_2008



Robert E. Slavin (2008). Perspectives on evidence-based research in education. Educational Resaercher, 37, 5-14. Howe_2009



Kenneth R. Howe (2009). Epistemology, methodology, and education sciences. Positivist dogmas, rhetoric, and the education science question. Educational Researcher, 38, 428-440. pdf

BDL



David J. Bartholomew, Ian J"changing conception. Deary & Martin Lawn (2009). The origin of factor scores: Spearman, Thomson and Bartlett. British Journal of Mathematical and Statistical Psychology. 62, 569-582. Leuk, historisch. BDL



A. Shreider (1964). Method of statistical testing. Monte Carlo method. Elsevier.

Morrison_2009



Keith Morrison (2009). Causation in Educational Research. Routledge.



Thomas D. Cook & Donald T. Campbell (1979). Quasi-Experimentation. Design & Analysis Issues for Field Settings. Rand McNally.



George A. Anastassiou (2010). Probabilistic Inequalities. World Scientific. isn 9789814280785 981428078X



Stephen Stark, Oleksandr S. Chernyshenko & Fritz Drasgow (2012). Examining the Effects of Differential Item (Functioning and Differential) Test Functioning on Selection Decisions: When Are Statistically Significant Effects Practically Important? Journal of Applied Psychology, 89, 497-508. abstract



Schmidt, Frank L., and Ryan D. Zimmerman (2004). A counterintuitive hypothesis about employment interview validity and some supporting evidence. Journal of Applied Psychology, 89, 553-561. abstract



Timothy A. Judge, Amy E. Colbert & Remus Ilies (2004). Intelligence and Leadership: A Quantitative Review and Test of Theoretical Propositions. Journal of Applied Psychology, 89, 542-552.



Frederick L. Oswald, Neal Schmitt, Brian H. Kim, Lauren J. Ramsay, and Michael A. Gillespie (2004). Developing a Biodata Measure and Situational Judgment Inventory as Predictors of College Student Performance. Journal of Applied Psychology, 89, 187-207. pdf



Gilian B. Yeo & Andrew Neal (2004). A Multilevel Analysis of Effort, Practice, and Performance: Effects of Ability, Conscientiousness, and Goal Orientation. Journal of Applied Psychology, 89, 231-247.abstract



Weekley, Jeff A., Frank Blake, Edward J. O'Connor, and Lawrence H. Peters (1985). A comparison of three methods of estimating the standard deviation of performance in dollars. Journal of Applied Psychology, 70, 122-126.



Richard R. Reilly & James W. Smither (1985). An Examination of Two Alternative Techniques to Estimate the Standard Deviation of Job Performance in Dollars. Journal of Applied Psychology, 70, 651-661. abstract



Burke, Michael J., and James T. Fredrick (1986). A comparison of economic utility estimates for alternative SDy estimation procedures. Journal of Applied Psychology, 71, 334-339.



C-L. C. Kulik & J. A. Kulik (1982). Effects of ability grouping on secondary school students: a meta-analysis of evaluation findings. American Educational Research Journal, 19, 415-428.



J. A. Kulik, R. L. Bangert-Drowns & C-L. C. Kulik (1984). Effectiveness of coaching for aptitude tests. Psychological Bulletin, 95, 179-188.



J. A. Kulik, Robert L. Bangert-Drowns, James A. Kulik & Chen-Lin C. Kulik (1983). Effects of coaching programs on achievement test performance. Review of Educational Research, 53, 571-585. abstract



Robert J. Mislevy (1993). A framework for studying differences between multiple-choice and free-response test items. In Randy Elliot Bennett and William C. Ward Construction versus choice in cognitive measurement (p. 75-106). Erlbaum.

This really is a miserable definition; Mislevy can do much bettr than this. It is miserable because it does not exclude anything; anything goes here. Nevertheless, it has appeared in print, and as such it reveals a kind of over-simplifying that tends to be typical of a lot of psychometric work. It seems the thinking goes into the models themselves, not in the situations they are supposedly representing.
The educational decisions, by the way, are strictly reserved to instutional representatives. Mislevy does not see students as making their own decisions, whether on the basis of test results, or any other information. This, in my opinion, is unprofessional neglect that has somehow come to be regarded as professional - notwithstanding that one chapter in Cronbach and Gleser, 1957, emphasizing the individual decision maker.



Gideon J. Mellenbergh & Wulfert P. van den Brink (1998). The Measurement of Individual Change. Psychological Methods, 3, 470-485. abstract



David J. Weiss and Shannon Von Minden (2011). Measuring Individual Growth With Conventional and Adaptive Tests. Journal of Methods and Measurement in the Social Sciences Vol. 2, No. 1, 80-101. pdf



J.B. Carlin & Rubin, D.B. (1991). Summarizing multiple-choice tests using three informative statistics. Psychological Bulletin, 110, 338-349. sbetabinomiaalmodel



J. G. C. Verheij (1994, ter publicatie aangeboden). An improved maximum likelihood procedure for estimating the parameters of the beta-binomial distribution. betabinomiaal [Ter publicatie aangeboden, maar ik weet niet in welk tijdschrift; googelen levert niets op, 2020] Er zit een hoop werk in. Ik heb er niet direct iets aan, want ik kan nergens naar verwijzen, maar een keer doornemen is misschien best aardig, houden dus maar.



Henry Rouanet (1996). Bayesian Methods for Assessing Importance of Effects. Psychological Bulletin, 119, 149-158. pdf



Henry Rouanet (1996). Bayesian Methods for Assessing Importance of Effects. Psychological Bulletin, 119, 149-158. pdf



Michael T. Kane (1992). An Argument-Based Approach to Validity. Psychological Bulletin, 112, 527-535. abstract




Ju-Whei Lee & J. Frank Yates (1992). How Quantity Judgment Changes as the Number of Cues Increases: An Analytical Framework and Review. Psychological Bulletin, 112, 363-377. abstract



Deborah A. Prentice and Dale T. Miller (1992). When Small Effects Are Impressive. Psychological Bulletin, 112, 160-164. pdf



Peter Z. Schochet & Hanley S. Chiang (online first 2012). What Are Error Rates for Classifying Teacher and School Performance Using Value-Added Models? Journal of Educational and Behavioral Statistics. abstract



Lotte Schenk-Danzinger (1953). Entwicklungstests für das Schulalter. I. Teil Altersstufe 5-11 Jahre. Wien: Verlang für Jugend und Volk. Curieus boek, geen psychologie maar pedagogie. Giga veel testjes, alles is subjectief.



Barbara S. Plake (Ed.) (1984). Social and Technical Issues in Testing. Implications for Test Construction and Usage. Erlbaum. pdf's




Edward F. Alf & Donald D. Dorfman(1967). The classification of individuals into two criterion groups on the basis of a discontinuous payoff function. Psychometrika, 32, 115-123.



Ellen Condliffe Lagemann (2000). An Elusive Science: The Troubling History of Education Research. University of Chicago Press. isbn 0226467724 review short review 1997 article: http://www.jstor.org/discover/10.2307/1176271 -->



F. Allan Hanson (1993). Testing testing. Social consequences of the examined life. University of California Press.online


p. 81 APA-statement over selectie met leugendetector! [lie detector] Als 85 % correct, dan worden meer kandidaten ten onrechte voor leugenaar uitgemaakt dan er terecht als leugenaar worden geïdentificeerd. p. 114 Bentham's panopticum!



Hnry S. Dyer (1972). Recycling the problems in testing. In Proceedings of the 1972 Invitational Conference on Testing Problems. Educational Testing Service site. [niet online]


Dyer schetst enkele eeuwige misstanden rond testgebruik, maar vindt dat de onderwijstoetsen zelf prima in orde zijn (p. 87).



Robert M. Guion (1977). Content validity: the source of my discontent. Applied Psychological Measurement, 1, 1-11. abstract




L. Heyerick (1980). Het gebruik van foute woordbeelden in spellingtoetsen. Pedagogische Studiën, 57, 268-272.




Wilco H. M. Emons, Klaas Sijtsma & Rob R. Meijer (2007). On the consistency of individual classification using short scales. Psychological Methods, 12, 105-120. pdf




Cyril Burt (1921/1933). Mental and scholastic tests. London: P. S. King. fourth impression of the original text. archive.org


Een omvangrijk werk, over alles op dit onderwerp. Dus ook: rekentoetsjes, met normtabellen voor verschillende leeftijden. Burt beschouwt de rekentoetsen als toetsen (scholastic tests), niet als onderdeel van intelligentietests. Maar de scheidslijn is hier echt onduidelijk. Het was de tijd van algemeen enthousiasme over tests en toetsen, een nieuwigheid tenslotte. De zin van dit alles blijft een beetje behoorlijk in het duister.



William A. Mehrens and Robert L. Ebel (eds) (1967). Principles of educational and psychological measurement. A book of slected readings. Rand McNally. lccc 67-14694




Jackson, D. N. , & Messick, S. (Eds) (1967). Problems in human assessment. McGraw-Hill. lccc 66-15835




Kenneth A. Bollen (2002), Latent variables in psychology and the social sciences. Annual Review of Psychology, 53, 605-634. abstract




Ronald K. Hambleton & Stephen G. Sireci (1997). Future directions for norm-referenced and criterion-referenced testing. In Lorin W. Anderson (Ed.) (1997). Educational testing and assessment: lessons from the past, directions for the future (p. 379-394). Int. J. Educ. Res., 27, 355-445. pdf




Richard Phelps (Ed.) (2009). Correcting Fallacies about Educational and Psychological Testing [UB Leiden PEDAG. 64.B.44 ZICHTKAST] info




David A. Freedman (2009). Statistical models and causal inference. A dialogue with the social sciences. Cambridge University Press.


Mentioned by Jon Elster in publications on obscurantism in science. Presumably a reader of work by Freedman, edited by David Collier, Jasjeet S. Sekhon & Philip B. Stark. Waarschijnlijk een tweede editie van het in in 2005 uitgegeven werk?



Chester W. Harris (Ed.) (1963/1967 2nd). Problems in measuring change.n The University of Wisconsin Press.




John M. Gottman and Regina H. Rushe (1993). The Analysis of Change: Issues, Fallacies, and New Ideas. journal of Consulting and Clinical Psychology, 61, 907-910. pdf




Scott T. Meier (1994). The chronic crisis in psychological measurement and assessment. A historical survey. Academic Press. isbn 0124884407 info




Charles D. Spielberger & Peter R. Vagg (eds) (1995). Test anxiety. Theory, assessment and treatment. Taylor & Francis. isbn 0891162127




Benjamin S. Bloom (Chairman) (1967). Proceedings of the 1967 Invitational Conference on Testing problems. Educational Testing Service.




Ronald A. Berk (Ed.) (1982). Handbook of methods for detecting test bias. Baltimore: The Johns Hopkins University Press. isbn 0801826624.


ao: Lorrie Shepard: Definitions of bias. - Janice Dowd Scheuneman: A posteriori analysis of biased items. Donald Ross Green et aliis: Methods used by test publishers to 'debias' standardized tests.



W. K. B. Hofstee (1983). 'Dood hout' kappen. Psychologie, oktober 1983, 8-9.




W. K. B. (1980?). Stellingen over docentbeoordeling. [paper voor een seminar op deze thematiek]




W. K. B. Hofstee (1995). Beoordelen: wetenschap of kunst? In: KNAW, Verslag van de Verenigde Vergadering van de beide Afdelingen der Akademie, 1995, 15-34.




Willem K.B. Hofstee (2009). It Ain't Necessarily So.


Wim noemt dit stuk (niet gepubliceerd?) een ‘wetenschaps-autobiografisch essay’.



Willem K. B. Hofstee (20002). Kwaliteit van beoordelingen. Conferentie Beoordeling in het kunstbeleid, Kees Kolthoff, Herman Marres, Werkgroep Cultuurbeleid PvdA. 11 juni 2002 Boekmanstichting, Amsterdam. Ook gepubliceerd als 'Kwaliteit van beoordelingen in de context van kunstbeleid' in 'Boekmancahier: kwartaalschrift over kunst, onderzoek en beleid' 2002 422-431. Ha, online: https://www.boekman.nl/wp-content/uploads/2012/01/54.pdf


Mijn aantekening van bij het conferentiepaper: Impliete vooronderstelling: het gaat om alleen deze ene beoordeling, terwijl in werkelijkheid degene die zijn werk of aanvragen laat beoordelen dat een werkzaam leven lang laat doen. Kun je alles wat zo mooi is gezegd over de afzonderlijke beoordeling, ook volhouden als het gaat om permanente beoordeling? Je mag toch uit het feit dat velen zich aan dat telkens terugkerende ritueel bereid zijn te onderwerpen, niet concluderen dat het een rechtvaardig ritueel is? Amartya Sen hefet daarover gefilosofeerd ['the tamed housewife'].



W. K. B. Hofstee (1964). De methode der onderlinge beoordelingen. NTvdPs 59-78.




W. K. B. Hofstee (1999). Principes van beoordeling: Methodiek en ethiek van selectie, examinering en evaluatie. Lisse: Swets & Zeitlinger. besproken door Van der Maessen de Sombreff




Bloemers, Wim Bloemers (2014). De nieuwe assessment gids [het psychologisch onderzoek]. Een oefenboek. Ambo. isbn 9789026327346




Wim Bloemers (2003). Higher Education Interactive Diagnostic Inventory (HEIDI). The prediction of first year academic success for psychology students at the Open University of the Netherlands (OUNL). Proefschrift Universiteit Rotterdam. pdf pdf


De helft valt binnen drie maanden af. Er valt dus wat te voorspellen, zou je denken. Kader: Schouwenburg. Ik begrijp overigens niet waarom je in deze bjzondere situatie zou zoeken naar persoonlijkheidskenmerken als voorspellers (Bloemers schrapt op blz. 6 een reeks variabelen af, en houdt dan persoonlijke kenmerken over). Wim Bloemers zal dat ondertussen zelf ook wel hebben bedacht.



Trudy Dehue (2008). De depressie-epidemie. Amsterdam: Uitgeverij Augustus. isbn 9789045700953




Leila Zenderland (1998). Measuring minds. Henry Herbert Goddard and the origins of American intelligence testing. Cambridge University Press. isbn 0521003636




George Rasch (1980). Probabilistic models for some intelligence and attainment tests. Chicago: The University of Chicago Press. Expanded edition of the original 1960 text.




Denny Borsboom (2003). Conceptual issues in psychological measurement. Dissertation University of Amsterdam. download




Ivo Molenaar (1972). Dit is een uitdaging. Oratie.




David Berliner (2014). Exogenous Variables and Value-Added Assessments: A Fatal Flaw. Teachers College Record. webpage




R. F. van Naerssen (1962). Selectie van chauffeurs: onderzoekingen ten behoeve van de selectie van chauffeurs bij de Koninklijke landmacht. Groningen: Wolters. Proefschrift Universiteit van Amsterdam.




Nathaniel H. Hartshorne (Ed.) (1978). Educational measurement & the law (3-28). Proceedings of the 1977 ETS invitational conference. pdf




W. J. van der Linden & E. E. Ch. I. Roskam (1985). Testtheorie. Themanummer Nederlands Tjdschrift voor de Psychologie en haar Grensgebieden, 40, 379-451.




W. P. Fisher Jr., & Wright, B. D. (eds) (1994). Applications of probabilistic conjoint measurement. Themanummer International Journal of Educational Research, 21 (6), 557-664.


Het inleidende artikel van de editors is heel verhelderend. Het blijft allemaal binnen het meet-paradigma, en ik mis daar toch enige relativering (de functie van toetsen is niet uitsluitend en niet altijd om te meten, andere functies zijn vaak belangrijker).



Paul Davis Chapman (1988). Schools as sorters. Lewis M. Terman, Applied Psychology, and the Intelligence Testing Movement, 1890-1930. New York: New York University Press. isbn 0814714366




W. A. F. Hepburn, et al. (chair of Mental Survey Committee) (1933). The intelligence of Scottish children. A national survey of an age-group. University of London Press. [niet in UB Leiden]




Frederic M. Lord and Melvin R. Novick (1968). Statistical theories of mental test scores. With contributions by A. Birnbaum. London: Addison-Wesley. lccc 68-11394




M. Groen (1967). De voorspelbaarheid van schoolcarrières in het voortgezet onderwijs. Groningen: Wolters. proefschrift Universiteit van Amsterdam, met stellingen. [Enkele longitudinale studies van vijftiger-jaren groepen; proefschrift, promotor A. D. de Groot; gebruikt o.a. Cronbach & Gleser (1965) als methodologische basis]


Zie bv 13.3 (blz. 141 e.v.) De waarde van validiteitscoëfficiënten voor praktische beslissingen (vgl Wiegersma, 1964, pp. 36 e.v.). Een besliskundige geschikt-ongeschikt versus afgewezen - toegelaten.



Lee J. Cronbach & Goldine C. Gleser (1957/1965 ). Psychological tests and personnel decisions. University of Illinois Press.




Edward L. Thorndike (1904). Theory of mental and social measurements. New York: The Science Press. https://archive.org/details/theoryofmentalso00thor [the second revised edition 1916 is also available online https://archive.org/details/anintroductiont00thorgoog




Edward L. Thorndike, E. O. Bregman, M. V. Cobb, Ella Woodyard and the Staff of the Division of Psychology of the Institute of Educational research of Teachers College, Columbia University (n.d. prob. 1925). The measurement of intelligence. Teachers College Bureau of Publications, Columbia University archive.org




Johnson O'Connor (1934). Psychometrics. A Study of Psychological Measurements. Harvard University Press.



Jum C. Nunnally Jr, (1959). Test and measurements: Assessment and prediction. McGraw-Hill.




A. D. de Groot (1961). Methodologie. Grondslagen van onderzoek en denken in de gedragswetenschappen. Den Haag: Mouton. De 12e editie, goeddeels gelijk aan die van 1961, is integraal online beschikbaar http://www.dbnl.org/tekst/groo004meth01_01/




A. H. G. S. van der Ven (). Time-limit tests. Nederlands Tijdschrift voor de Psychologie, 26, 580-591.




W. Molenaar (1974). De logistische en de normale kromme. Nederlands Tijdschrift voor de Psychologie, 29, 415-420.




W. J. van der Linden & E. E. Ch. I. Roskam (Red.) (1985). Testtheorie. Speciaal numer Nederlands Tijdschrift voor de Psychologie, 40, 379-451. [hardcopy]




Frederic M. Lord (1981). Problems arising from the unreliability of the measuring instrument. In Philip H. DuBois and G. Douglas Mayo: Research strategies for evaluating training (79-94). AERA Monograph Series on Curriculum Evaluation. Chicago: Rand McNally.




Ken Alder (2007). Lie detectors. The history of an American obsession. Free Press. isbn 0743259882




Sir W. H. Hadow (chair) (1924). Board of Education. Report of the consultative committee on psychological tests of educable capacity and their possible use in the public system of education. London: His Majesty's Stationary Office. paper The Committee's Report pp. 1-145. Appendices 146-238 a.o. by Cyril Burt. Full text (after August 21 2007) on www.dg.dial.pipex.com/documents/


Het rapport zelf is natuurlijk historisch, maar het bevat ook een waarschijnlijk interessant historisch hoofdstuk, Historical sketch of the development of psychological tests. Ik merk erbij op dat er wel een geschiedenis van tests, maar niet van beoordelen/examineren in staat.



James W. Pellegrino & Mark Wilson (2015). Assessment of Complex Cognition: Commentary on the Design and Validation of Assessments. Theory into Practice researchgate.net


Introduction to the issue on assessment of complex cognition.



Paul E. Newton & Stuart D. Shaw (2014). Validity in educational and psychological assessment. Sage. [als eBook in KB] info




Paul E. Newton & Stuart D. Shaw (2013). Standards for talking and thinking about validity. Psychological Methods, 18, 301-319. [Ik heb geen toegang] abstract




M. David Miller & Robert L. Linn (2000). Validation of performance-based assessments. Applied Psychological Measurement, 24, 367-378. abstract



Nambury S. Raju, Reyhan Bilgic, Jack E. Edwards & Paul F. Fleer (1999). Accuracy of Population Validity and Cross-Validity Estimation: An Empirical Comparison of Formula-Based, Traditional Empirical, and Equal Weights Procedures. Applied Psychological Measurement, 23, 99-115. abstract



Craig W. Deville (1996). An empirical link of content and construct validity evidence. Applied Psychological Measurement, 20, 127-139. abstract




Samuel Messick (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In Randy Elliot Bennett and William C. Ward: Construction versus choice in measurement: Issues in Constructed Response, Performance Testing, and Portfolio Assessment. Erlbaum.


Of course, if different instruments intended to measure the same construct, do not do so (Stevens and Clauser, 2006, see below), then there is a serious validity problem at hand. This kind of situation does not validate Messick's construct validity approach, however, because absence of conflicting results does not indicate that the intended construct is being validly measured by both instruments. Sounds terribly Popperian.



Joseph J. Stevens and Patricia S. Clauser (www accessed 2006). Multitrait-Multimethod Comparisons of Selected and Constructed Response Assessments of Language Achievement. pdf




H.L.J. van der Maas, K.-J. Kan, D. Borsboom (2014). Intelligence is what the intelligence test measures. Seriously. Journal of Intelligence, 2, 12-15. pdf




Journal of Intelligence open access




Intelligencesupports open access




Freeman, A. Myrick Freeman, III (1993). The measurement of environmental and resource values. Theory and methods. Washington, D.C.: Resources for the Future. isbn 0915707691




A. W. F. Edwards (1972). Likelihood. An account of the statistical concept of likelihood and its application to scientific inference. Cambridge: Cambridge University Press. isbn 0521082994


Mijn toetsvragenverzameling-concept past uitstekend in opvatting van deze Edwards. Een directe analogie voor mijn toetsvragenverzameling is de zak met rode en witte knikkers, waar niet meer dan enkele handen vol uit mogen en kunnen worden gehaald, terwijl er toch inferentis over de verhouding tussen aantallen rode en witte ballen in de zak worden gemaakt. Deze knikker-analogie lijkt me heel bruikbaar om het idee van de toetsvragenverzameling als kern van mijn benadering te presenteren.

De opstelling van Edwards betekent dat ware score modellen in de psychologie zelden terecht zijn, althans wanneer er pogingen worden gedaan veronderstelde ware scores zo goed mogelijk te schatten (zowel klassieke ware score theorie, als latente trek modellen zouden op die manier 'verkeerd' kunnen worden gebruikt. ) Let op het onderscheid dat Edwards maakt tussen model en hypothese: een model is bijvoorbeeld het betabinomiaalmodel voor toetsscores, de hypothese zou kunnen zijn dat de parameters voor de genererende betaverdeling bepaalde 'ware' waarden hebben. Zie ook Edwards, hoofdstuk 4, waar een en ander in de context van Bayes' stelling verder wordt uitgewerkt. Het onderwerp is voor mij erg belangrijk, omdat mijn hele benadering is doordrenkt van impliciete aannamen die in de statistische wereld al heel lang onderwerp van bittere strijd zijn. Misschien moet ik de term 'ware beheersing' vermijden. Welke andere mogelijkheden zijn er dan? 'Veronderstelde beheersing' is een aardige kandidaat in de context van simulaties: doe een bepaalde veronderstelling over een stand van zaken, en leid volgens een opgesteld model af tot welke gevolgen dat in de wereld leid, vergelijk dat met empirische waarnemingen. Dat zou wel eens bij uitstek een benadering kunnen zijn waar de likelihood van Fisher goed bij past! Ook hier weer: zowel een bepaald model, als de waarden die de parameters in dat model aannemen, zijn aan de orde. Dat kan tot verwarringen leiden, en daarom moet ik daar van meet af aan duidelijkheid over scheppen.



Paul G. Hoel (1962). Introduction to mathematical statistics. Wiley. LCCC 62-18992




Herbert Hoijtink en Klaas Sijtsma (2009). Meten Onder Druk. Advies aan de CEVO Inzake de Normering van Eindexamens Voortgezet Onderwijs. pdf




Fritz Drasgow (Ed.) (2016). Technology and testing Improving Educational and Psychological Measurement. NCME. (also: Routledge). info


Very, very technocratic. In the service of big business.



Neil J. Dorans, Linda L. Cook (Eds.) (2016). Fairness in Educational Assessment and Measurement Routledge. [PEDAG.51.e.93] info




Henry Braun (Eds.) (2016). Meeting the Challenges to Measurement in an Era of Accountability. Routledge. info




Randy Elliot Bennett, Marc M. Sebrechts & Donald A. Rock (1991). Expert-System Scores for Complex Constructed-Response Quantitative Items: A Study of Convergent Validity. Applied Psychological Measurement, 15, 227-239. abstract




Bennett, R. E., Rock, D. A., & Wang, M. (1991). Equivalence of free-response and multiple-choice items. J. of Educ. Meas., 28, 77-92.




Bennett, R. E., Rock, D. A., Braun, H. I., Frye, D., Spohrer, J. C., & Soloway, E. (1990). The relationship of expert-system scored constrained free-response items to multiple-choice end open-ended items. Applied Psychological Measurement, 14, 151-162. abstract




Randy Elliot Bennett, James Braswell, Andreas Oranje, Brent Sandene, Bruce Kaplan, & Fred Yan (208). Does it Matter if I Take My Mathematics Test on Computer? A Second Empirical Study of Mode Effects in NAEP. open access




Suzanne Lane, Mark R. Raymond & Thomas M. Haladyna (Eds.) (2016). Handbook of test development. Routledge. info




Linda Darling-Hammond, Frank Adamson (Eds.) (2014). Beyond the Bubble Test: How Performance Assessments Support 21st Century Learning. Jossey-Bass. [KB eBook] info



Millsap, R.E., Bolt, D.M., van der Ark, A.L., Wang, W.C. (Eds.) Quantitative Psychology Research. Springer. [als eBoo in KB] preview


Looks like an outlet for publishing.



ETS Research Report Series web




J. Maynard Smith (1974). Models in ecology. Cambridge University Press. isbn Houden. Han van der Maas e.a. gebruiken dergelijke modellen om 'g' te verklaren




James S. Coleman (1990). Foundations of social theory. Cambridge, Massachusetts: The Belknap Press of Harvard University Press. isbn 0674312252




Han L. J. van der Maas, Kees-Jan Kan & Denny Borsboom (2014). Intelligence Is What the Intelligence Test Measures. Seriously. J. Intell. 2014, 2(1), 12-15; doi:10.3390/jintelligence2010012 open access




P. J. van Strien (Red.) (1976). Personeelsselectie in discussie. Meppel: Boom. isbn 9060092201




Robert Coe (2002). It's the Effect Size, Stupid. What effect size is and why it is important. webpage




Saskia Wools (2015). All About Validity. An evaluation system for the quality of educational assessment. Dissertation Twente University researchgate.net




Carol Burris (2016). A master teacher went to court to challenge her low evaluation. What her win means for her profession. post by Valerie Straus


VAM Value-added measurement.



American Statistical Association (2014). ASA Statement on Using Value-Added Models for Educational Assessmentpdf


http://nycpublicschoolparents.blogspot.nl/2016/05/breaking-lederman-decision-and-gallup.html via @dylanwiliam



Adaptive comparative judgement wiki




James T. Austin & Peter Villanova (1992). The criterion problem: 1917-1992. Journal of Applied Psychology, 77, 836-874. pdf




Susan Athey & Guido W. Imbens (July 2016) The State of Applied Econometrics - Causality and Policy Evaluation paper




Henry Braun (Eds.) (2016). Meeting the Challenges to Measurement in an Era of Accountability. Routledge. info




Samuel A. Livingston Marilyn S. Wingersky (1979). Assessing the reliability of tests to make pass/fail decisions. Journal of Educational Measurement, 16, 247-260. 10.1111/j.1745-3984.1979.tb00106.x preview


Dit is het eerste artikel, voorzover ik dat kan bekijken, waarin de utiliteitsfunkties betrokken worden bij het betrouwbaarheidsdenken. En daar moet het natuurlijk ook naar toe. Een ander puntje is dat ik vermoed dat in de formules die zij gebruiken, en waarin ook weer een bekend veronderstelde mastery score functioneert, ofwel die mastery score mathematisch overbodig is, ofwel coherent moet zijn met de gespecificeerde utiliteitsfunktie.



Fons van de Vijver (eindredactie) (2001). Deskundigen over het testen van etnische minderheden. pdf




W. H. P. den Brinker (mei 2002). Validering Nederlandse Differentiatie Testserie 2001. SCO-Kohnstamm Instituut.




I. W. Molenaar (1980). An insurance policy against unexpected data. (second version). Heymans Bulletins Psychologische Instituten R.U. Groningen. HB-79-447-EX. [niet gepubliceerd] [hardcopy] [beta-binomiaal]



W. Molenaar (1973). Simple approximations to the poisson, binomial and hypergeometric distributions. Biometrics, 29, 403-407. 10.2307/2529405 J STOR read online [beta-binomiaal]



Denny Borsboom, Gideon J. Mellenbergh and Jaap van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061-1071. pdf abstract


A fundamental discussion on the concept of validity. Criticizes the idea of construct validity as strongly advocated by David Messick, tries to establish the idea of validity as theory-based measurement, analogous to measurement in the physical sciences. The example mentioned is the testing of psychological concepts and development on the basis of Piagetian developmental psychology.



Jorg Huijding, Bas Hemker & Remko van den Berg (2012). Verantwoord en fair testgebruik. Welke rol heeft de Cotan? De Psycholoog, april, 47-52. besloten archief




Peter C. M. Molenaar (2004). A Manifesto on Psychology as Idiographic Science: Bringing the Person Back Into Scientific Psychology, This Time Forever Measurement, 2, 201-218. pdf




Steven J. Howard, Stuart Woodcock, John Ehrich and Sahar Bokosmaty (2016). What are standardized literacy and numeracy tests testing? Evidence of the domain-general contributions to students' standardized educational test performance. British Journal of Educational Psychology.


The results are of limited value, because of the small number of subjects (N=91) and their age (grade 2 pupils). More interesting might be the theoretical framework.



Why Standardized Tests Don't Measure Educational Quality

Why Standardized Tests Don't Measure Educational Quality

W. James Popham (1999). Educational Leadership, 56, 8-15. webpage



S. J. Howard a.o. (2015). Behavioral and fMRI evidence of the differing cognitive load of domain-specific assessments. open access




Lynn S. Fuchs a.o. (2012). Contributions of Domain-General Cognitive Resources and Different Forms of Arithmetic Development to Pre-Algebraic Knowledge. Dev Psychol. pdf




Feinstein (1967). Clinical judgment [UBL]


Ferdinand Mertens noemt het in zijn https://didactiefonline.nl/blog/blonz/de-centrale-eindtoets-en-het-schooladvies Ik doe er niets mee: het ligt geheel buiten de psychologie.



The power of bias in economics research. John P.A. Ioannidis* , T.D. Stanley** and Hristos Doucouliagos pdf




Jay P. Heubert and Robert M. Hauser (Eds.); Committee on Appropriate Test Use, National Research Council (1999). High Stakes: Testing for Tracking, Promotion, and Graduation ISBN: 0-309-52495-4, 352 pages, 6 x 9, (1999) This PDF is available from the National Academies Press at: http://www.nap.edu/catalog/6336.html




Jay P. Heubert and Robert M. Hauser, Editors (1999). High Stakes: Testing for Tracking, Promotion, and Graduation (1999) [352 pp] pdf free pdf




de Leeuw, J., & Meester, A. C. (1984). Over het intelligentie-onderzoek bij de militaire keuringen vanaf 1925 tot heden. Mens en Maatschappij, 59, 5-26. [flynn-effect] pdf


[gevonden in gehetna: 93, Handleiding ten dienste van het onderzoek naar de Algemeene praktische intelligentie bij de keuringsraden. 1925 1 deel (nog geen scans van beschikbaar).



H. Y. Groenewegen Jr (1926). Het onderzoek naar het algemene praktische intelligentie bij de keuringsraden in 1925', De militaire spectator 95 (1926), p. 634-645. [hardcopy map diff]




J. Stroomberg (1925). De beteekenis der psychotechniek voor het bedrijf. Proefschrift H.H. Rotterdam. herdruk mededeelingen uit het psychologisch laboratorium der rijksuniversiteit te Utrecht 1926




Eric Haas (1995). Op de juiste plaats. De opkomst van de bedrijfs- en schoolpsychologische beroepspraktijk in Nederland. Hilversum: Verloren. isbn 9065504222




Defending Standardized Testing Phelps, Richard  Taylor and Francis  2005 [als eBook in KB]




Collection: Ofqual's reliability research page




James H. Capshew (1999). Psychologists on the march. Science, practice, and professional identity in America, 1929-1969. Cambridge UP. isbn 0521565855 Ch 4. Sorting soldiers, psychology as personnel management


WOII, also WOI (army alpha) of course



Hickendorff, M., Edelsbrunner, P. A., Schneider, M., Trezise, K., & McMullen, J. (in press). Informative tools for characterizing individual differences in learning: Latent class, latent profile, and latent transition analyses. Learning and Individual Differences. doi:10.1016/j.lindif.2017.11.001
 preprint


Theo J. H. M. Eggen (2004). Contributions to the theory and practice of computerized adaptive testing dissertation Twente. 9058340569 pdf




Sanders, P.F., Brouwer, A.J., Veldkamp B.P. (2017). Onderzoek naar de inhoudsvaliditeit van de centrale examens en de afhandeling van onvolkomenheden bij de centrale examens. RCEC. haalt op


Teleurstellend rapport, want het is geen onderzoek naar validiteit, maar naar procedures en of die gevolgd worden. Hoe kan dat nou? Vermoedelijk heeft het CvTE de opdracht in die termen geformuleerd, en heeft het RCEC de opdracht in die termen aanvaard. Zo komen we in Nederland dus nooit uit de examenproblemen.



Validation of Score Meaning for the Next Generation of Assessments. The Use of Response Processes. Edited by Kadriye Ercikan, James W. Pellegrino 2017 – Routledge. [nog niet gevonden]




Behavior Research Methods The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Craig Hedge Georgina PowellPetroc Sumner open access




Research on Validity Theory and Practice at ETS. Michael Kane & Brent Bridgeman (2017). In: R. E. Bennett, & M. von Davier (Eds.), Advancing human assessment: The Methodological, psychological and policy contributions of ETS. Springer. pp 489-552. open access




Isaac I. Bejar (2017). A Historical Survey of Research Regarding Constructed-Response Formats'. pp 565-633 In: R. E. Bennett, & M. von Davier (Eds.), Advancing human assessment: The Methodological, psychological and policy contributions of ETS. Springer. pp 489-552. open access




Norman Frederiksen, Robert J. Mislevy & Isaac I. Bejar (Eds.) (1993). Test theory for a new generation of tests. Erlbaum. 0805805931


Besproken door Michael J. Kane in APM 1993, 17, 389-391. Nadruk ligt hier op incorporeren van ontwikkelingen in de cognitieve psychologie, wat leidt tot kritiek op het psychometrische model. Cont Ps 1994, 39, p 425. Nogal theoretisch boek, veel auteurs in ETS. Snow, Richard E. , & Lohman, David F. Cognitive psychology, new test design, and new test theory: an introduction. 1-18. Een wezenlijk academische benadering, meten ipv sturen. Mislevy,Robert J.: Foundations of a new test theory. 19-40. p. 19: It is only a slight exaggeration to describe the test theory that dominates educational measurement today as the application of 20th century statistics to 19th century psychology. Sophisticated estimation procedures, new techniques for missing-data problems, and theoretical advances into latent-variable modeling have appeared - all applied with psychological models that explain problem-solving ability in terms of a single, continuous variable. This caricature suffices for many practical prediction and selection problems because it expresses patterns in data that are pertinent to the decisons that must be made. It falls short for placement and instruction problems based on students' internal representations of systems, problem-solving strategies, or reconfigurations of knowledge as they learn. Such applications demand different caricatures of ability - more realistic ones that can express patterns suggested by recent developments in cognitive and educational psychology. The application of modern methods with modern psychological models constitutes the foundation of a new test theory. Het probleem met die benadering is dus dat het meet-paradigma niet ter discussie wordt gesteld. p. 22: Charles Spearman (...) is generally credited with the central idea of classical test theory (CTT): a test score can be viewed as the sum of two components, a true score and a random error term. Dat is opmerkelijk, want in deze formulering komt geen vergelijking van scores van verschillende personen voor! Een omissie? Verderop vindt Mislevy het in ieder geval vanzelfsprekend dat er altijd meerdere scores van verschillende personen zijn. p. 23: Like classical test theory, IRT concerns examinees' overall proficience in a domain of tasks. Whereas CTT makes no statement about the genesis of performance, IRT posits a single, unobservable, proficience variable. p. 24: At the heart of IRT is a mathematical model for the probability that a given person will respond correctly to a given item, a fucntion of that person's proficiency parameter and one or more parameters for the item. The item's parameters express properties such as difficulty or sensitivity to proficiency. The item response, rather than the test score, is the fundamental unit of observation. If an IRT model holds, responses to any subset of items support inferences on the same scale of measurement. p. 26: Measuring learning is one applicaton where IRT models can fail, because their characterization is complete for only a highly constrained type of change: an examinee's chances of success on all items must increase or decrease by exactly the same amount (in an appropriate metric). A single IRT model applied to pretest and posttest data cannot reveal how different students learn different topics to different degrees - patterns that could be at the crux of an instructional decision. Ik begrijp dit dus niet goed, is dit alleen maar een kritiek op IRT omdat Mislevy naar die instructiemodellen toe wil? p. 28: The challenge to education is to discover what experiences help a learner with a given configuration of propositions, skills, and connections to reconfigure that knowledge into a more powerful arrangement. Vosniadou and Brewer (1987) point to Socratic dialogue and analogy as mechanisms that facilitate such learning. To apply them effectively, one must take into account not simply target configurations, such as the experts' model, but the individual learners' current configurations. The challenge to test theory is to provide models and methods to assess knowledge, and to guide instruction, as seen in this new light. Nou, dat is nogal wat. Het lijkt me dat dit een megalomane opvatting is van wat testtheorie zou kunnen zijn. Zo'n testtheorie loopt het onderwijs juist in de weg, ongeveer zoals het pakket Cito-toetsen in de basisvorming het onderwijs juist hindert, in plaats van helpt. En Mislevy blijft gevangen in het meet-paradigma. Je zou kunnen zeggen dat Mislevy de leerling helemaal niet ziet staan, en de eigen verantwoordelijkheid van die leerling dan ook niet in zijn testtheorie-scenario kan betrekken. Zijn testtheorie-ideaal is nog absoluter autoritair dan CTT, omdat hij zich voorstelt dat de leerling er een nog veel groter deel van de onderwijstijd aan uitgeleverd wordt. Lohman, David F., & Ippel, Martin J.: Cognitive diagosis: from statistically based assessment toward theory-based assessment. 41-71. Thissen, David: Repealing rules that no longer apply to psychological measurement. 79-98. Bennett, Randy Elliot: Toward intelligent assessment: an integration of constructed-response testing, artificial intelligence, and model-based measurement. 99-124. This chapter presented a conceptualization of intelligent assessment as an integration of constructed-response testing, scoring methods based on artificial intelligence, and cognitively motivated measurement models. To illustrate progress toward this conception, two intelligent scoring systems - Micro-PROUST and GIDE - and two measurement models - HOST and HYBRID - were described. It is worth emphasizing that these approaches take particular perspectives, especially scoring systems, which derive from the same theoretical base. Other approaches to both scoring and response modelling exist, and it is likely to be some time before any individual method becomes generally accepted. Second, it should be evident that many unresolved issues are associated with intelligent assessment. The development of even the least ambitious realization implies a considerable effort - in domain understanding and knowledge-base development, item writing, scoring rules, feedback content and processes, programming, pilot testing, and validation research, among other things - with no certainty that the result will prove substantially better than current testing approaches. Embretson, Susan: Psychometric models for learning and cognitive processes. 125-150. Niet relevant voor mij. Ze heeft een Pm-artikel in 1991: A multidimensional item response model for learning processes. Pm, 56, 495-515. Marshall, Sandra P.: Assessing schema knowledge. 155-180. citaat uit summary: zie bestand tvr. This chaptr described how schema knowledge may be assessed in individuals, and it has outlined an approach for making the assessment. The results obtained thus far using schema theory are encouraging. Test items can be successfully parsed by using the components of schema knowledge. Student profiles of schema knowledge can be developed on the basis of their performance on collections of these items. Validity of the schema approach is demonstrated by comparing test performance with interview data. In ieder geval niet vanzelfsprekend genoeg om dit uit te kunnen buiten in een boek over toetsvragen schrijven voor docenten. Feltovich, Paul J., Spiro, RandJ., & Coulson, Richard L.: Learning, teaching, and testing for complex conceptual understanding. 181-218. Dit is interessant. Heeft praktische relevantie. fc Masters, Geoffrey, N., & Mislevy, Robert J.: New views on student learning: implications for educational measurement. 219-242. Theoretisch stuk, ik zie niet direct mogelijkheden voor toepassing. Mental models benadering. Gitomer, Drew H., & Rock, Don: Addressing process variables in test analysis. 243-268. Yamamoto, Kentaro, & Gitomer, Drew H.: Application of a HYBRID model to a test of cognitive skill representation. 275-296. Carroll, John B. Carroll: Test theory and the behavioral scaling of test performance. 297-322. Haertel, Edward H., & Wiley, David E.: Representations of ability structures: implications for ability testing. 359-384. p. IX (Robert J. Mislevy: Introduction): "We would concur with Gulliksen [Gulliksen, H. (1961). Measurement of learning and mental abilities. Psychometrika, 26, 93-107] that the heart of test theory is connecting what we can observe with a more general, inherently unobservable, conception of what a student knows or can do. This is essentially a statistical problem - given a framework in which this conception is to be erected. The framework implicit in Gulliksen's description, and throughout the papers and books listed [Edgeworth, Spearman], is that of a measure of a quantity he calls ability. (...) Useful as the ability level paradigm has proven in large-scale selection and prediction problems, it represents but one of many possible perspectives. Its legacy includes, most obviously, a collection of testing practices and statistical techniques suited to educational questions cast in its terms. More subtly, yet more profoundly, it has defined the universe of discourse within which discussions of educational options take place, at both the instructional and policy-making levels. Unrecognized assumptions underlie analyses about what should be tested, how outcomes might be used, and how effectiveness of the effort should be evaluated. What is missing from the conceptualization upon which standard test theory is based are models for just how people know what they know and do what they can do, and the ways in which they increase these capacities. The impetus for an examination of the foundatons of test theory arises from psychologists and educators, who, as they extend the frontiers of understanding about human cognition, ask questions that fall outside the universe of discourse that standard test theory generates.'



Goldberg (1970). Man versus model of man: a rationale and some evidence for a method of improving on clinical inferences. Psychological Bulletin, 73, 422-432.


Dit is een kernpublicatie als het gaat over het objectiviteitsbeginsel. Het gaat om clinische diagnostiek: MMPI-profielen van 861 patienten die duideljk ofwel psychotisch ofwel neurotisch waren. Clinische oordelen op basis van MMPI-profielen zijn aan dit heldere criterium te koppelen, het gaat dus om de validitei van de oordelen (wat op zich veel ruimer is dan objectiviteit, natuurlijk). De grap is nu dat op basis van een aantal van dergelijke oordelen van een bepaalde beoordelaar, voor die beoordelaar een statistisch model is te specificeren. Vervolgens kan de beoordelraar tegen het eigen model worden vergeleken, en het model blijkt het dan beter te doen. Worden er maar genoeg beoordelaars op dezelfde beoordelingstaken ingezet, dan gaat die grap niet meer (in dit experiment: 29 beoordelaars waren gezamenlijk beter althans niet slechter dan het op die 29 gebaeerde model). Aangenomen mag worden dat deze klassieke publicatie in latere literatuur over models of man, het lensmodel e.d. uitvoerig aan de orde is.



Harold Gulliksen (1986). Perspective on educational measurement. Applied Psychological Measurement, 10, 109-132. pdf


p. 109: It is the purpose of this paper to make two recommendations related to testing: (1) that teacher training institutions should emphasize and expand the teaching of test construction for classroom use, and (2) that makers of academic tests, both standardized and individualized, should develop tests on the basis of valid, reliable, and unbiased performance criteria, as in occupational and professional testing programs. [...] The failure to distinguish between the requirements of standardized testing and classroom testing seems to be responsible for the lack of improvement-and perhaps even a decline-in the quality of teacher-made classroom tests over the last 40 years. Test Construction - The University of Chicago Board of Examinations About 1930, President Robert Maynard Hutchins introduced an examination system at the University of Chicago. The procedures developed by the Chicago Board of Examinations during the 1930s for the first two years of college are also applicable at lower grade levels. The curriculum for the freshman and sophomore years consisted of 5 one-year courses in biological science, physical science. social science, humanities, and English. Passing each of these courses required successful completion of a six-hour exam. Initially in 1930, Louis L. Thurstone was appointed chief examiner; Marion Richardson was examiner in physical sciences. James Thomas Russell in biological sciences. and John StaInaker in humanities and English (Russell and Stainaker weir examiners from 1931-1936). 1 was examiner in social sciences from 1934 to 1940. Others later associated with the examining office were Dael Lee Wolfle (biological sciences), George Frederic Kuder, Dorothy Adkins. and Ben Bloom. In the late 1930s, Ralph Tyler replaced Thurstone as chief examiner. It should be noted that in 1947, after Dorothy Adkins had gone to work for the Civil Service Commission, she and others prepared a very good book (Adkins et al., 1947) on constructing objective and performance tests. Test security versus disclosure. One of the first rules Thurstone established was: "The day after an exam is given it goes on sale in the University of Chicago Bookstore." The issue of "test security" versus "disclosure" has now become a legal issue. In general, for a given instructor's testing of his or her own classes, maintenance of "secure" items (e.g., items used on previous tests where difficulty and correlation with total test score are known, but have not become available to students) is an inappropriate policy. Students should be informed as fully as possible regarding the course requirements, the nature of the tests, and the skills and knowledge they are expected to have gained as a result of taking the course. Making previous exams available to both present and future students is one way to achieve this objective. Also, making previous exams available prevents special advantages from accruing to certain groups (e.g., fraternities, sororities, special tutors, or coaching schools) that will, from time to time, be able to obtain access to test material that the instructor is attempting to keep secure. When instructors do not depend on secure items for equating of tests or grades from year to year, then it is necessary to depend on instructor judgment regarding similar difficulty of parallel items. The problem of equating grades and tests that show improvement (increased scores) from one year to another also taxes the instructor's judgment, as in cases where certain material is missed by many students one year and special attention is paid to ensure that students learn the material in subsequent years. When such improvement (increase) in scores occurs, the instructors must decide the extent to which grades will be increased to reflect this improvement versus raising of standards. Of course, a corresponding decision in the reverse direction is necessary whenever student performance declines.



Royce R. Ronning, Jane C. Conoley, John A. Glover, and Joseph C. Witt (Eds.) (1987). The influence of cognitive psychology on testing. Buros-Nebraska Symposium on Measurement and Testing. Volume 3. Erlbaum. isbn 0898598982 open access



The chapters of all volumes of this series are available online as pdf's: http://digitalcommons.unl.edu/buroscogpsych/



Earl Hunt (1974): Quote the Raven? Nevermore! pp 129-158 in Lee W. Gregg (Ed.) (1974). Knowledge and Cognition. Erlbaum. goo.gl/a63ThD




Alfred Binet & Théodore Simon (1916/1973 reprint). The development of intelligence in children. (The Binet-Simon Scale). Translated by Elizabeth S. Kite. Reprint: New York Arno Press. isbn 0405051350 https://archive.org/details/developmentofint00bineuoft




R. W. van der Giessen (1957). Enkele aspecten van het probleem der predictie in de psychologie, speciaal met het oog op de selectie van militair personeel. Swets en Zeitlinger. proefschrift VU, stellingen


Goed overzicht van dit veld, in NL maar ook VK en US.



Abe D. Hofman, Brenda R. J. Jansen, Susanne M. M. de Mooij , Claire E. Stevenson and Han L. J. van der Maas (2018). A Solution to the Measurement Problem in the Idiographic Approach Using Computer Adaptive Practicing Intelligence, 14 open




Matthew J. Salganik and many, many others (2020). Measuring the predictability of life outcomes with a scientific mass collaboration PNAS open




Janne Adolf, Noémi K. Schuurman, Peter Borkenau, Denny Borsboom and Conor V. Dolan (2014). Measurement invariance within and between individuals: a distinct problem in testing the equivalence of intra- and inter-individual model structures. Front. Psychol., 19 September | https://doi.org/10.3389/fpsyg.2014.00883 open




Theo J.H.M. Eggen and Bernard P. Veldkamp (Editors) (2012). Psychometrics in Practice at RCEC. academia.edu




Denny Borsboom, Gideon J. Mellenbergh, and Jaap van Heerden(2003). The Theoretical Status of Latent Variables. Psychological Review Vol. 110, No. 2, 203–219 pdf




Thomas M. Haladyna & Steven M. Downing (2004). Construct-irrelevant Variance in High-Stakes Testing. Educational Measurement: Issues and Practice




Ross E. Traub (Winter 1997). Classical test theory in historical perspective.Educational Measurement: Issues and Practice, 8-14 pdf




Neil J. Dorans (2012). The Contestant Perspective on Taking Tests: Emanations From the Statue Within Educational Measurement: Issues and Practice December, https://doi.org/10.1111/j.1745-3992.2012.00250.x




Darrell Bock (2005). A brief history of item response theory. Measurement: Issues and Practice 10.1111/j.1745-3992.1997.tb00605.x abstract & scihub pdf




Co van Calcar & Bert Tellegen (1967), Gedragsbeoordeling en prestatie. Enschede, Pedagogisch Centrum, 1967. [rapport, mijn exemplaar zou wel eens heel zeldzaam kunnen zijn]


Opvattingen van leerkrachten 1e klas over o.a. zwakke lezers.



Walt Haney (1984). Testing reasoning and reasoning about testing. Review of Educational Research, 54, 597-654. abstract & scihub




Wim Pesch & Albert Ponsioen (2004). Flinterdunne en fllagrante Flynn-effecten bij licht verstandelijk gehandicapte kindeen. Aan beve;lingen voor het gebruik van de WISC-III De Psycholoog




Robert J. Sternberg (1981). Testing and cognitive psychology. AP, 36, 1181-1189. abstract




Sandra Scarr (1981). Testing for children. Assessment and the many determinants of intellectual competence. AP, 36, 1159-1166. abstract




Barbara Lerner (1981). The minimum competence testing movement. Social, scientific, and legal implications. AP, 36 abstract




Daniel J. Reschly (1981). Psychological testing in educational classification and placement. AP, 36, 1094-1102. abstract




Sternberg, R. J., Wagner, R. K., Williams, W. M., & Horvath, J. A. (1995). Testing common sense. American Psychologist, 50(11), 912–927. https://doi.org/10.1037/0003-066X.50.11.912abstract




Riet van Bork (2019). Interpreting psychometric models. Dissertation UvA (Denn Borsboom). read online




Wim J. van der Linden (2005). Classical test theory. In K. Kempf-Leonard (Ed.), Encyclopaedia of social measurement (Vol. 1) (pp. 301‑307). Academic Press.




Advancing Human Assessment. The Methodological, Psychological and Policy Contributions of ETS Editors: Bennett, Randy, von Davier, Matthias (Eds.)open access




Measuring the predictability of life outcomes with a scientific mass collaboration. Matthew J. Salganik and many others (2020) 8398–8403 | PNAS | April 14, 2020 | vol. 117 | no. 15 www.pnas.org/cgi/doi/10.1073/pnas.1915006117 open access; via Paige Harden 'The genetic lottery'




Denny Borsboom, Jan-Willem Romeijn and Jelte M. Wicherts (2008). Measurement invariance versus selection invariance: Is fair selection possible? Psychological Methods, 13, 75-98 pdf




Andreas Demetriou, Smaragda Kazi, George Spanoudis, Nikolaos Makris (2019). Predicting school performance from cognitive ability, self-representation, and personality from primary school to senior high school. Intelligence open




Maarten Marsman & Mijke Rhemtulla (2022). Guest editors' introduction to the special issue 'Network Psychometrics in Action': Methodological innovations inspired by practical problems. Psychometrika. https://link.springer.com/content/pdf/10.1007/s11336-022-09861-x.pdf open access




Denny Borsboom (25 March 2022). Possible futures for network psychometrics. Psychometrika. open access




Alexander O. Savi, Maarten Marsman, Han L. J. van der Maas, Gunter K. J. Maris (2019). The Wiring of Intelligence. Perspectives on Psychological Science. open access open access & PsyArXiv Preprints




Lisa D. Wijsen, Denny Borsboom, Anna Alexandrova (2022). Values in Psychometrics. Perspectives in Psychological Science open access




David C. Geary (2022). Spatial ability as a distinct domain of human cognition: An evolutionary perspective. abstract




Phelps, R. P. (Ed.). (2009). Correcting fallacies about educational and psychological testing. American Psychological Association. info; Supplementary materials to the book here [chapters not available online]




Susan E. Embretson THE SECOND CENTURY OF ABILITY TESTING: SOME PREDICTIONS AND SPECULATIONS. ETS




Susan Embretson (2014). Additive Multilevel Item Structure Models with Random Residuals: Item Modeling for Explanation and Item Generation. Psychometrika, 79, 84-§04. pdf via: academia.edu









november 2022 \ contact ben at at at benwilbrink.nl      


Valid HTML 4.01!       http://www.benwilbrink.nl/literature/testpsychologie.htm http://goo.gl/BFQfp


holman