Literatuur testpsychologie (psychometrie, methodologie)


Ben Wilbrink

Zie ook projecten/raden.htm

zie ook projecten/geheimhouding.htm




Peter Herriot (Ed.) (1989). Assessment and selection in organizations. Methods and selection in organizations. Chichester: Wiley. isbn 0471916404




Thomas V. Merluzzi, Carol R. Glass and Myles Genest (Eds.) (1981). Cognitive assessment. New York: Guilford Press. isbn 0898620015




N. J. Macintosh (Ed.) (1995). Cyril Burt, fraud or framed? Oxford; Oxford University Press. isbn 019852336X




Earl Hunt (2011). Human Intelligence.




Kofi Kissi Dompere (2014). Fuzziness, Democracy, Control and Collective Decision-choice System : A Theory on Political Economy of Rent-Seeking and Profit-Harvesting Springer [eBook in KB] info


Ik heb niet veel literatuur over rentseeking, los van de wereld van selectie-aan-de-poort, ik heb deze dus maar even genoteerd.



Helga A. H. Rowe (Ed.) (1991). Intelligence: Reconceptualization and Measurement. Erlbaum. [als eBook in KB] preview Questia




Maria Elena Oliveri & Matthias von Davier (2014). Toward Increasing Fairness in Score Scale Calibrations Employed in International Large-Scale Assessments. International Journal of Testing, 14, 1-21. open access


gebruikte data: PIRLS



Arne Evers, Klaas Sijtsma, Wouter Lucassen & Rob R. Meijer (2010). The Dutch Review Process for Evaluating the Quality of Psychological Tests: History, Procedure, and Results. International Journal of Testing, 10. abstract [paywall]


Geschiedenis, en werkwijze, Cotan.



Franié, Sanja; Dolan, Conor V.; Borsboom, Denny; Hudziak, James J.; van Beijsterveldt, Catherina E. M.; Boomsma, Dorret I. (2013). Can genetics help psychometrics? Improving dimensionality assessment through genetic factor modeling. Psychological Methods, 18, 406-433. abstract




Wim J. van der Linden (1998). A discussion of some methodological issues in international assessments. International Journal of Educational Research, 29, 569-577. abstract




Stephen G. Sireci and Polly Parker (2006). Validity on Trial: Psychometric and Legal Conceptualizations of Validity. Educational Measurement: Issues and Practice, fall, 27-34. abstract




Shudong Wang, Hong Jiao, Michael J. Young, Thomas Brooks and John Olson (2008). Comparability of Computer-Based and Paper-and-Pencil Testing in K-12 Reading Assessments : A Meta-Analysis of T"methodological issues in international asssessment" esting Mode Effects. Educational and Psychological Measurement 2008 68 5abstract




Fadia Nasser-Abu Alhija & Adi Levy (2009). Effect Size Reporting Practices in Published Articles. Educational and Psychological Measurement, 69, 245-265. abstract




Alvaro J. Arce-Ferrer and Elvira Martínez Guzmán (2009). Studying the Equivalence of Computer-Delivered and Paper-Based Administrations of the Raven Standard Progressive Matrices Test. Educational and Psychological Measurement, 69, 855-867. abstract


Vindt geen verschillen, i.t.t. eerder overzicht van Kubinger (1991).



Anneke C. Timmermans, Tom A. B. Snijders and Roel J. Bosker (2013). In Search of Value Added in the Case of Complex School Effects. Educational and Psychological Measurement 73, 210-228abstract


Ik zie dit artikel als vooral een technische analyse: specificeer een model, gebruik een beschikbare dataset, en rekenen maar. Ga na hoe model A tot andere uitkomsten leidt dan model B. De auteurs gaan althans in dit artikel nauwelijks in op de vraag of het schatten toegevoegde waarde een zinvolle onderneming is waarmee de samenleving mag worden lastiggevallen. Zij rekenen gewoon aan modellen, en zoals dat dan gegarandeerd het geval is: dat levert bepaalde utikomsten op. Er liggen evenwel heel wat stilzwijgende en minder stilzwijgende vooronderstellingen ten grondslag aan deze werkwijze.



Robert W. Lissitz (2009). Validity. Revisions, new directions, and applications. Information Age Publishing. [nog niet gezien]



Wim J. van der Linden & Minjeong Jeon (2012). Modeling Answer Changes on Test Items. Journal of Educational and Behavioral Statistics, 37, 180-199abstract pdf

On fraudulent changes.



Wim J. van der Linden, Minjeong Jeon & Steve Ferrara (2011). A Paradox in the Study of the Benefits of Test-Item Review. Journal of Educational Measurement, 48, 380-398. pdf



Kristian E. Markon (2013). Information Utility: Quantifying the Total Psychometric Information Provided by a Measure. Psychological Methods, 18, 15-35. abstract



Gregory J. Cizek (2012). Defining and Distinguishing Validity: Interpretations of Score Meaning and Justifications of Test Use. Psychological Methods, 17, 31-43. abstract



Ken Kelley & Kristopher J. Preacher (2012). On effect size. Psychological Methods, 17, 137-172. accepted concept



Michèle Nuijten, Marie Deserno, Angélique Cramer & Denny Borsboom (2013). Psychologische stoornissen als complexe netwerken. De Psycholoog, januari, 12-23 [gecorrigeerde referentie, zie De Psycholoog, februari 2013 blz. 4]



Han L. J. van der Maas, Conor V. Dolan, Raoul P. P. P. Grasman, Jelte M. Wicherts, Hilde M. Huizenga, and Maartje E. J. Raijmakers (2006). A Dynamical Model of General Intelligence: The Positive Manifold of Intelligence by Mutualism. Psychological Review, 113, 842-861. pdf




Nate Silver (2012). The signal and the noise. Why so many predictions fail -- but some don't. The Penguin Press. isbn9781594204111 http://www.nytimes.com/2012/10/24/books/nate-silvers-signal-and-the-noise-examines-predictions.html http://www.npr.org/2012/10/10/162594751/signal-and-noise-prediction-as-art-and-science



Cynthia G. Parshall, Judith A. Spray, John C. Kalohn, and Tim Davey (2002). Practical Considerations in Computer-Based Testing. Springer [Nog niet gezien. Besproeken door Rob Meijer: Applied Psychological Measurement, Vol. 27 No. 1, January 2003, 78-80



David Thissen & Howard Waiuner (Eds.) (2001). Test Scoring. Springer [Nog niet gezien. Besproeken door Rob Meijer: Applied Psychological Measurement, Vol. 27 No. 1, January 2003, 75-77



Ronald K. Hambleton (2000). Advances in Performance Assessment Methodology. Applied Psychological Measurement, 24, 291-293. [Introduction to special issue)



Randy Elliot Bennett, Mary Morley & Dennis Quardt (2000). Three Response Types for Broadening the Conception of Mathematical Problem Solving in Computerized. Applied Psychological Measurement, 24, 294-309. abstract



M. David Miller & Robert L. Linn (2000). Validation of performance-based assessments. Applied Psychological Measurement, 24, 367-378. abstract



Wim J. van der Linden (2000). Optimal Assembly of Tests with Item Sets. Applied Psychological Measurement, 24, 225-240. abstract


Dit gaat over het type examenopgaven waarin een tekst is gegeven, waarover dan meerdere vragen worden gesteld. En dat is een vorm die in eindexamens veel wordt gebruikt. Het woord ‘optimaal’ heeft natuurlijk maar een beperkte betekenis: optimaal binnen gegeven randvoorwaarden. Als die randvoorwaarden beroerd zijn, zoals de kwaliteit van de vragen in de vragenverzameling waaruit wordt getrokken, dan is dat ‘optimaal’ een eufemisme.



Tom Verguts & Paul de Boeck (2000). A Rasch Model for Detecting Learning While Solving an Intelligence Test. Applied Psychological Measurement, 24, 151-162. abstract


Een opvallende titel. Intrigerend.



E. Matthew Schulz, Michael J. Kolen & W. Alan Nicewander (1999). A Rationale for Defining Achievement Levels Using IRT-Estimated Domain Scores. Applied Psychological Measurement, 23, 347-362. abstract



Rob R. Meijer & Michael L. Nering (1999). Computerized Adaptive Testing: Overview and Introduction. Applied Psychological Measurement, 23, 187-194. abstract



Chi-Keung Leung, Hua-Hua Chang & Kit-Tai Hau (2005). Computerized adaptive testing: A mixture item selection approach for constrained situations. British Journal of Mathematical and Statistical Psychology, 58, 239-257. abstract



T. J. H. M. Eggen (1999). Item Selection in Adaptive Testing with the Sequential Probability Ratio Test. Applied Psychological Measurement, 23, 249-261. abstract



Almond & Mislevy (1999). Graphical Models and Computerized Adaptive Testing. Applied Psychological Measurement, 23, 223-237. abstract



Tenko Raykov(1999). Are Simple Change Scores Obsolete? An Approach to Studying Correlates and Predictors of Change. Applied Psychological Measurement, 23, 120-126. abstract



Nambury S. Raju, Reyhan Bilgic, Jack E. Edwards & Paul F. Fleer (1999). Accuracy of Population Validity and Cross-Validity Estimation: An Empirical Comparison of Formula-Based, Traditional Empirical, and Equal Weights Procedures. Applied Psychological Measurement, 23, 99-115. abstract



Wim J. van der Linden (1999). Empirical Initialization of the Trait Estimator in Adaptive Testing. Applied Psychological Measurement, 23, 21-29. abstract



Gideon J. Mellenbergh (1999). A Note on Simple Gain Score Precision. Applied Psychological Measurement, 23, 87-89. abstract



John R. Bergan, Richard D. Schwarz & Linda A. Reddy (1999). Latent Structure Analysis of Classification Errors in Screening and Clinical Diagnosis: An Alternative to Classification Analysis. Applied Psychological Measurement, 23, 69-86. abstract



Klaas Sijtsma & Anton C. Verweij (1999). Knowledge of Solution Strategies and IRT Modeling of Items for Transitive Reasoning. Applied Psychological Measurement, 23, 55-68. abstract


Onderzoek waarbij de leerlingen hun antwoorden op de toets hebben moeten motiveren. Zie ook hoofdstuk 2 van Toetsvragen ontwerpen hfdst 2.



Wim van der Linden (1998). Optimal assembly of psychological and educational tests. Applied Psychological Measurement, 22, 195-211. abstract



Anat Ben-Simon, David V. Budescu and Baruch Nevo (1997). A Comparative Study of Measures of Partial Knowledge in Multiple-Choice Tests. Applied Psychological Measurement, 21, 65-88. abstract



Craig W. Deville (1996). An empirical link of content and construct validity evidence. Applied Psychological Measurement, 20, 127-139. abstract



Richard H. Williams & Donald W. Zimmerman (1996). Are simple gain scores obsolete? Applied Psychological Measurement, 20, 59-69. abstract



Rolf Langeheine, Elsbeth Stern & Frank van de Pol (1994). State Mastery Learning: Dynamic Models for Longitudinal Data Applied Psychological Measurement, 18, 277-291. abstract



Menucha Birenbaum, Kikumi K. Tatsuoka & Yaffa Gutvirtz (1992). Effects of Response Format on Diagnostic Assessment of Scholastic Achievement. Applied Psychological Measurement,16, 353-363. abstract


In het geval van opgaven algebra.



A.H.G.S. van der Ven & F.M. Gremmen (1992). The Knowledge or Random Guessing Model for Matching Tests. Applied Psychological Measurement, 16, 177-194. abstract



Mary E. Lunz, Betty A. Bergstrom & Benjamin D. Wright (1992). The Effect of Review on Student Ability and Test Efficiency for Computerized Adaptive Test. Applied Psychological Measurement, 16, 33-40. abstract



Frits E. Zegers (1991). Coefficients for interrater agreement. Applied Psychological Measurement, 15, 321-333. abstract



Frits E. Zegers (1989). Het meten van overeenstemming. Nederlands Tijdschrift voor de Psychologie, 44, 145-156.

“. . . de ene leraar geeft de cijfers 7, 8 en 9, terwijl de ander voor dezelfde opstellen respectievelijk de cijfers 2, 3 en 4 geeft. De pmc tussen deze sets scores is maximaal (+1), maar het valt moeilijk te verdedigen dat de leraren het volledig met elkaar eens zijn.”

blz. 145



W. K. B. Hofstee & F. E. Zegers (1991). Idiographic correlation: modeling judgments of agreement between school grades. Tijdschrift voor Onderwijsresearch, 16, 331-336.



John B. Carroll (1990). Estimating Item and Ability Parameters in Homogeneous Tests With the Person Characteristic Function. Applied Psychological Measurement, 14, 109-125. abstract



Huub van den Bergh (1990). On the Construct Validity of Multiple- Choice Items for Reading Comprehension. Applied Psychological Measurement, 14, 1-12. abstract



Michael I. Waller (1990). Modeling Guessing Behavior: A Comparison of Two IRT Models. Applied Psychological Measurement, 13, 233-243. abstract



Jerry S. Gilmer (1989). The Effects of Test Disclosure on Equated Scores and Pass Rates. Applied Psychological Measurement, 13, 245-255. abstract



Terry A. Ackerman (1989). Unidimensional IRT Calibration of Compensatory and Noncompensatory Multidimensional Items. Applied Psychological Measurement, 13, 113-127. abstract



Marion S. Aftanas (1988). Theories, Models, and Standard Systems of Measurement. Applied Psychological Measurement, 12, 325-338. abstract



Terry A. Ackerman & Philip L. Smith (1988). A Comparison of the Information Provided by Essay, Multiple-Choice, and Free-Response Writing Tests. Applied Psychological Measurement, 12, 117-128. abstract



David V. Budescu (1988). On the Feasibility of Multiple Matching Tests — Variations on a Theme by Guiliksen. Applied Psychological Measurement, 12, 5-14. abstract



David V. Budescu (1987). Open-Ended Versus Multiple-Choice Response Formats—It Does Make a Difference for Diagnostic Purposes. Applied Psychological Measurement, 11, 385-395. abstract



Wim J. van der Linden (1986). The Changing Conception of Measurement in Education and Psychology. Applied Psychological Measurement, 10, 325-332. abstracttestpsychologie.htm-->


Technocratic.



Catharina C. van Thiel & Michel A. Zwarts (1986). Development of a Testing Service System. Applied Psychological Measurement, 10, 391-403. abstract



Ronald K. Hambleton & Richard J. Rovinelli (1986). Assessing the Dimensionality of a Set of Test Items Applied Psychological Measurement, 10, 287-302. abstract



Harold Gulliksen (1986). Perspective on Educational Measurement. Applied Psychological Measurement, 10, 109-132. abstract



Neal Schmitt & Daniel M. Stults (1986). Methodology Review: Analysis of Multitrait-Multimethod Matrices. Applied Psychological Measurement, 10, 1-22. abstract



J. P. Guilford (1985). A Sixty-Year Perspective on Psychological I Measurement. Applied Psychological Measurement, 9, 341-349. abstract



Anne Anastasi (1985). Some Emerging Trends in Psychlolgical Measurement: A Fifty-Year Perspective. Applied Psychological Measurement, 9, 121-138. abstract



Gail Ironson, Susan Homan & Ruth Willis (1984). The Validity of Item Bias Techniques with Math Word Problems. Applied Psychological Measurement, 8, 391-396. abstract



Albert C. Oosterhof & Pamela K. Coats (1984). Comparison of Difficulties and Reliabilities of Quantitative Word Problems in Completion and Multiple-Choice Item Formats. Applied Psychological Measurement, 8, 287-294. abstract



Robert L. Linn & C. Nicholas Hastings (1984). Group differentiated prediction. Applied Psychological Measurement, 8, 165-172. abstract



Michael Kane & Jennifer Wilson (1984). Errors of Measurement and Standard Setting in Mastery Testing. Applied Psychological Measurement, 8, 107-115. abstract



Isaac I. Bejar (1983). Subject Matter Experts' Assessment of Item Statistics. Applied Psychological Measurement, 7, 303-310. abstract



Henk Blok & Wim E. Saris (1983). Using Longitudinal Data to Estimate Reliability. Applied Psychological Measurement 7, 295-301. abstract



Anne R. Fitzpatrick (1983). The Meaning of Content Validity. Applied Psychological Measurement 7, 3-13. abstract



Ronald K. Hambleton (1983). Application of Item Response Models to Criterion-Referenced Assessment. Applied Psychological Measurement 7, 33-44. abstract



R. A. Weitzman (1982). Sequential Testing for Selection. Applied Psychological Measurement 6, 337-51. abstract



Jo P. M. Pieters & Ad H. G. S. van der Ven (1982). Precision, Speed, and Distraction in Time-Limit Tests. Applied Psychological Measurement 6, 93-103. abstract



Rand R. Wilcox (1981). A Cautionary Note on Estimating the Reliability of a Mastery Test with the Beta-Binomial Model. Applied Psychological Measurement 5,531-537. abstract



Rand R. Wilcox (1981). A Cautionary Note on Estimating the Reliability of a Mastery Test with the Beta-Binomial Model. Applied Psychological Measurement 5,531-537. abstract



Lawrence J. Stricker (1981). The Role of Noncognitive Measures in Medical School Admissions. Applied Psychological Measurement 5, 313-323. abstract



Gary B. Forbach & Ronald G. Evans (1981). The Remote Associates Test as a Predictor of Productivity in Brainstorming Groups. Applied Psychological Measurement 5, 333-339. abstract



Susan E. Whitely & Lisa M. Schneider (1981). Information Structure for Geometric Analogies: A Test Theory Approach. Applied Psychological Measurement 5, 383-397. abstract



Robert L. Linn, Michael V. Levine, C. Nicholas Hastings & James L. Wardrop (1981). Item Bias in a Test of Reading Comprehension. Applied Psychological Measurement 5, 159-173. abstract



Ronald K. Hambleton (1980). Contributions to Criterion-Referenced Testing Technology: An Introduction. Applied Psychological Measurement 4, 421-424. abstract



Rand R. Wilcox (1980). Determining the Length of a Criterion-Referenced Test. Applied Psychological Measurement 4, 425-446. abstract



Lorrie Shepard (1980). Standard Setting Issues and Methods. Applied Psychological Measurement 4, 447-467. abstract



Wim J. van der Linden (1980). Decision Models for Use with Criterion-Referenced Tests. Applied Psychological Measurement 4, 469-492. abstract



George B. Macready & C. Mitchell Dayton (1980). The Nature and Use of State Mastery Models. Applied Psychological Measurement 4, 493-516. abstract



Ross E. Traub & Glenn L. Rowley (1980). Reliability of Test Scores and Decisions. Applied Psychological Measurement 4, 517-545. abstract



Robert L. Linn (1980). Issues of Validity for Criterion-Referenced Measure. Applied Psychological Measurement 4, 547-561. abstract



Ronald A. Berk (1980). A Framework for Methodological Advances in Criterion-Referenced Testing. Applied Psychological Measurement 4, 563-573. abstract



Samuel Livingston (1980). Comments on Criterion-Referenced Testing. Applied Psychological Measurement 4, 575-581. abstract



Howard Wainer (1980). A Test of Graphicacy in Children. Applied Psychological Measurement 4, 331-340. abstract



Luis M. Laosa (1980). Measures for the Study of Maternal Teaching Strategies. Applied Psychological Measurement 4, 355-366. abstract



Robert B. Frary (1980). The Effect of Misinformation, Partial Information, and Guessing on Expected Multiple-Choice Test Item Scores. Applied Psychological Measurement 4, 79-90. abstract



Wim J. van der Linden (1979). Binomial Test Models and Item Difficulty. Applied Psychological Measurement 3, 401-411. abstract



D. Magnusson & G. Backteman (1978). Longitudinal Stability of Person Characteristics: Intelligence and Creativity. Applied Psychological Measurement 2, 481-490. abstract



R. R. Schmeck & F. D. Ribich (1978). Construct Validation of the Inventory of Learning Processes. Applied Psychological Measurement 2, 551-562. abstract



Robert T. Keller & Winford E. Holland (1978). A Cross-Validation Study of the Kirton Adaption-Innovation Inventory in Three Research and Development Organizations. Applied Psychological Measurement 2, 563-570. abstract



Wim J. van der Linden & Gideon J. Mellenbergh (1978). Coefficients for Tests from a Decision Theoretic Point of View. Applied Psychological Measurement 2, 119-134. abstract



Wim J. van der Linden & Gideon J. Mellenbergh (1977). Optimal Cutting Scores Using A Linear Loss Function. Applied Psychological Measurement 2, 593-599. abstract



Norman Frederiksen & William C. Ward (1978). Measures for the Study of Creativity in Scientific Problem-Solving. Applied Psychological Measurement 2, 1-24. abstract



Susan E. Whitely (1977). Information-Processing on Intelligence Test Items: Some Response Components. Applied Psychological Measurement 1, 465-476. abstract



Robyn M. Dawes (1977). Suppose We Measured Height With Rating Scales Instead of Rulers. Applied Psychological Measurement 1, 267-273. abstract; pdf



Susan E. Whitely (1977). Information-Processing on Intelligence Test Items: Some Response Components. Applied Psychological Measurement 1



P. W. Van Rijn, T. J. H. M. Eggen, B. T. Hemker & P. F. Sanders (2002). Evaluation of Selection Procedures for Computerized Adaptive Testing with Polytomous Items. Applied Psychological Measurement, 26, 393-411. abstract



Dimiter M. Dimitrov (2007). Least Squares Distance Method of Cognitive Validation and Analysis for Binary Items Using Their Item Response Theory Parameters. Applied Psychological Measurement, 31, 367-387. abstract

Ik ben bang dat dit allemaal geweldig ingewikkeld is, en volkomen irrelevant. De theoretische achtergrond is stimulus-response theorie, maar dat hoeft op zich nog niet verkeerd te zijn. Ik heb gene tijd om dit nu uit te zoeken.



Donald W. Zimmerman & Richard H. Williams (2003). A New Look at the Influence of Guessing on the Reliability of Multiple-Choice Tests. Applied Psychological Measurement, 27, 357-371. abstract



Theo J. J. M. Eggen & Angela J. Verschoor (2006). Optimal Testing With Easy or Difficult Items in Computerized Adaptive Testing. Applied Psychological Measurement, 30, 379-393. abstract



Wim van der Linden (2006). Equating Error in Observed-Score Equating. Applied Psychological Measurement, 30, 355-378. abstract



Wim van der Linden (2006). Equating Scores From Adaptive to Linear Tests. Applied Psychological Measurement, 30, 493-. abstract



Neil J. Dorans, Jinghua Liu & Shelby Hammond (2008). Anchor Test Type and Population Invariance: An Exploration Across Subpopulations and Test Administrations. Applied Psychological Measurement, 32, 81-97. abstract



Robert L. Brennan (2008). A Discussion of Population Invariance. Applied Psychological Measurement, 32, 102-114. abstract



Qing Yi, Deborah J. Harris & Xiaohong Gao (2008). A Discussion of Population Invariance of Equating. Applied Psychological Measurement, 32, 98-101. abstract

“If the conversions for various subgroups of interest are not comparable or population invariant, then the psychometric implication is that different conversions should be used for different groups. However, in practice, testing programs cannot use different linkings for different groups. In today’s social and political climate, it would be very difficult for a testing program to justify assigning different reported scores to two candidates from different groups who have the same number-correct score on the test. So if the results of population invariance studies show indications of population sensitivity, then great care needs to be taken in selecting a data collection design and a subpopulation (of the total testing population) for use for all item and test analyses and for score equating. And the subpopulation for which score comparability is expected to hold should be specified in the programs’ technical manual. Careful specification of the analysis population used for a test will improve score equity and improve scale stability across test administrations and test forms.”



Qing Yi, Deborah J. Harris & Xiaohong Gao (2008). Invariance of Equating Functions Across Different Subgroups of Examinees Taking a Science Achievement Test. Applied Psychological Measurement, 32, 62-80. abstract



Robert Semmes, Mark L. Davison & Catherine Close (2011). Modeling individual differences in numerical reasoning speed as a random effect of response time limits. Applied Psychological Measurement, 35, 433-446. abstract


Bij rekentoetsen is de vraag: toetsen we hier (verschillen in) rekenvaardigheid, intelligentie, of wat? Het antwoord op die vraag hangt ook af van de tijd die beschikbaar is om de toets af te leggen: heeft iedereen ruimschoots de tijd om het werk af te maken, of is de tijd zo beperkt dat een niet te verwaarlozen aantal deelnemers niet toekomt aan behoorlijk maken van alle opgaven? Vertaald naar de Nederlandse situatie bij de rekentoetsen die aan de examens in het middelbaar onderwijs worden toegevoegd: brengt de techniek van digitale afname van de toetsen de leerlingen in een situatie van te weinig tijd om alle opgaven behoorlijk te kunnen beantwoorden? Zo ja, dan is er een tijdsfactor in het spel. Bij digitale afname is er een ingewikkelde situatie die niet dezelfde is als beperkt beschikbare tijd voor de hele test: als een opgave niet onmiddellijk kan worden gemaakt, kan de leerling (in de huidige software die door het Cito wordt gebruikt) niet later nog eens terug naar een dergelijk opgave. Uit de literatuurlijst:



Wim J. van der Linden (2011). Setting time limits on tests. Applied Psychological Measurement, 35, 183-199. abstract




Jihyun Lee & James Corter (2011). Diagnosis of subtraction bugs using Bayesian networks. Applied Psychological Measurement, 35, 27-47. abstract




Timo M. Bechger, Gunter Maris & Ya Ping Hsiao (2010). Detecting Halo Effects in Performance-Based Examinations. Applied Psychological Measurement, 35, 27-47. abstract




Wim J. van der Linden & Marie Wiberg (2010). Local Observed-Score Equating With Anchor-Test Designs. Applied Psychological Measurement, 35, 27-47. abstract




Robert C. Daniel & Susan E. Embretson (2010). Designing Cognitive Complexity in Mathematical Problem-Solving Items. Applied Psychological Measurement, 35, 27-47. abstract




Susan Embretson (Ed.) (2010). Measuring psychological constructs. Advances in model-based approaches. American Psychological Association. site



William W. Cooley & Paul R. Lohnes (1976). Evaluation Research in Education. Irvington Publishers.



William W. Cooley & Paul R. Lohnes (1962). Multivariate Procedures for the Behavioral Sciences. Wiley. Lib. Congress 62-18990.



Daniel H. Robinson, Joel R. Levin, Leslie O'Ryan & Duane Halbur-Ramseyer (2001). Does Statistical Language Constitute a "Significant" Roadblock to Readers' Interpretations of Research Results?.Journal of Educational Psychology, 93, 646-654. abstract




AERA, APA & NCME (1999). The Standards for Educational and Psychological Testing. zie hier - niet geautoriseerde samenvatting



Anne E. Magurran & Brian J. McGill (Eds.) (2011). Biological Diversity. Frontiers in measurement and assessment. Oxford University Press.



Alexander W. Wiseman (2010). The uses of evidence for educational policy making: global contexts and international trends. Review of Research in Education, 34, 1-24.



Richard J. Murnane & John B. Willett (2011). Methods Matter. Improving Causal Inference in Educational and Social Science Research. Oxford University Press. [genoemd in blog 11 van de serie over realistisch rekenen]



Robert F. Dedrick, John M. Ferron, Melinda R. Hess, Kristine Y. Hogarty, Jeffrey D. Kromrey, Thomas R. Lang, John D. Niles, and Reginald S. Lee (2009). Multilevel Modeling: A Review of Methodological Issues and Applications. Review of Educational Research, 79, 69-102



G. van den Berg (1981). Onderwijskundig onderzoek: twee doelstellingen, één onderzoeksmodel. Pedagigische Studiën, 58, 213-225.



W. Wardekker (1959). Interdisciplinaire onderwijskunde: modellen voor een wetenschap. Pedagigische Studiën, 56, 183-196. Niet van belang.



Theresa Ann Sipe & William L. Curlette (Guest eds.) (1997). A meta-synthesis of factors related to educational achievement: A methodological approach to summarizing and synthesizing. International Journal of Educational Research, 25 #7, 583-698.



Richard P. Phelps (Ed.) (2009). Correcting fallacies about educational and psychological testing. APA. [in KB als eBook] [ UBL PEDAG. 64.b.44 ]




J. Tinbergen (1936). Grondproblemen der theoretische statistiek. De Erven F. Bohn.



Donald W. Zimmerman (2009). The Reliability of Difference Scores in Populations and Samples. Journal of Educational Measurement, 46, 19-42,



Stephen Gorard (2010) 'All evidence is equal: the flaw in statistical reasoning' Oxford Review of Education, 36, 63 -- 77.



Stephen B. Broomell & David V. Budescu (2009). Why are experts correlated? Decomposing correlations between judges. Psychometrika, 74, 531-553.



Rolf Haenni (2008). Aggregating referee scores: An algebraic approach. In U. Endriss and P. W. Goldberg: COMSOC'08, 2nd International Workshop on Computational Social Choice, 277-288, 2008. pdf



F. Roels (Ed.) (1928). Cinquième Conférence International de Psychotechnique Tenue à Utrecht. Comtes-Rendus. Dekker & v.d. Vegt.



Andrew H. Jazwinski (1970). Stochastic Processes and Filtering Theory. Academic Press. (Toepassingen: meetprocedures, foutencorrectie. )



Willem K. B. Hofstee (2009). Promoting intersubjectivity: a recursive-betting model of evaluative judgments. Netherlands Journal of Psychology, 65. abstract


Aantekeningen: toetsmodellen.htm#Hofstee_intersubjectivity



Tilmann Gneiting & Adrian E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359-378. pdf



Rebecca Zwick & Lubella Lenaburg (2009). Using discrete loss functions and weighted kappa for classification: An illustration based on Bayesian network analysis. Journal of Educational and Behavioral Statistics, 34, 190-200. <



William Meredith (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525-543.

Amrein-Beardsley_2008



Audrey Amrein-Beardsley (2008). Methodological concerns about the education value-added system. Educational Researcher, 37, 65-75. Slavin_2008



Robert E. Slavin (2008). Perspectives on evidence-based research in education. Educational Resaercher, 37, 5-14. Howe_2009



Kenneth R. Howe (2009). Epistemology, methodology, and education sciences. Positivist dogmas, rhetoric, and the education science question. Educational Researcher, 38, 428-440. pdf

BDL



David J. Bartholomew, Ian J"changing conception. Deary & Martin Lawn (2009). The origin of factor scores: Spearman, Thomson and Bartlett. British Journal of Mathematical and Statistical Psychology. 62, 569-582. Leuk, historisch. BDL



A. Shreider (1964). Method of statistical testing. Monte Carlo method. Elsevier.

Morrison_2009



Keith Morrison (2009). Causation in Educational Research. Routledge.



O'Connor, Johnson O'Connor (1934). Psychometrics. A Study of Psychological Measurements. Harvard University Press.



Thomas D. Cook & Donald T. Campbell (1979). Quasi-Experimentation. Design & Analysis Issues for Field Settings. Rand McNally.



George A. Anastassiou (2010). Probabilistic Inequalities. World Scientific. isn 9789814280785 981428078X



Stephen Stark, Oleksandr S. Chernyshenko & Fritz Drasgow (2012). Examining the Effects of Differential Item (Functioning and Differential) Test Functioning on Selection Decisions: When Are Statistically Significant Effects Practically Important? Journal of Applied Psychology, 89, 497-508. abstract



Schmidt, Frank L., and Ryan D. Zimmerman (2004). A counterintuitive hypothesis about employment interview validity and some supporting evidence. Journal of Applied Psychology, 89, 553-561. abstract



Timothy A. Judge, Amy E. Colbert & Remus Ilies (2004). Intelligence and Leadership: A Quantitative Review and Test of Theoretical Propositions. Journal of Applied Psychology, 89, 542-552.



Frederick L. Oswald, Neal Schmitt, Brian H. Kim, Lauren J. Ramsay, and Michael A. Gillespie (2004). Developing a Biodata Measure and Situational Judgment Inventory as Predictors of College Student Performance. Journal of Applied Psychology, 89, 187-207. pdf



Gilian B. Yeo & Andrew Neal (2004). A Multilevel Analysis of Effort, Practice, and Performance: Effects of Ability, Conscientiousness, and Goal Orientation. Journal of Applied Psychology, 89, 231-247.abstract



Weekley, Jeff A., Frank Blake, Edward J. O'Connor, and Lawrence H. Peters (1985). A comparison of three methods of estimating the standard deviation of performance in dollars. Journal of Applied Psychology, 70, 122-126.



Richard R. Reilly & James W. Smither (1985). An Examination of Two Alternative Techniques to Estimate the Standard Deviation of Job Performance in Dollars. Journal of Applied Psychology, 70, 651-661. abstract



Burke, Michael J., and James T. Fredrick (1986). A comparison of economic utility estimates for alternative SDy estimation procedures. Journal of Applied Psychology, 71, 334-339.



C-L. C. Kulik & J. A. Kulik (1982). Effects of ability grouping on secondary school students: a meta-analysis of evaluation findings. American Educational Research Journal, 19, 415-428.



J. A. Kulik, R. L. Bangert-Drowns & C-L. C. Kulik (1984). Effectiveness of coaching for aptitude tests. Psychological Bulletin, 95, 179-188.



J. A. Kulik, Robert L. Bangert-Drowns, James A. Kulik & Chen-Lin C. Kulik (1983). Effects of coaching programs on achievement test performance. Review of Educational Research, 53, 571-585. abstract



Robert J. Mislevy (1993). A framework for studying differences between multiple-choice and free-response test items. In Randy Elliot Bennett and William C. Ward Construction versus choice in cognitive measurement (p. 75-106). Erlbaum.

This really is a miserable definition; Mislevy can do much bettr than this. It is miserable because it does not exclude anything; anything goes here. Nevertheless, it has appeared in print, and as such it reveals a kind of over-simplifying that tends to be typical of a lot of psychometric work. It seems the thinking goes into the models themselves, not in the situations they are supposedly representing.
The educational decisions, by the way, are strictly reserved to instutional representatives. Mislevy does not see students as making their own decisions, whether on the basis of test results, or any other information. This, in my opinion, is unprofessional neglect that has somehow come to be regarded as professional - notwithstanding that one chapter in Cronbach and Gleser, 1957, emphasizing the individual decision maker.



Gideon J. Mellenbergh & Wulfert P. van den Brink (1998). The Measurement of Individual Change. Psychological Methods, 3, 470-485. abstract



David J. Weiss and Shannon Von Minden (2011). Measuring Individual Growth With Conventional and Adaptive Tests. Journal of Methods and Measurement in the Social Sciences Vol. 2, No. 1, 80-101. pdf



J.B. Carlin & Rubin, D.B. (1991). Summarizing multiple-choice tests using three informative statistics. Psychological Bulletin, 110, 338-349. sbetabinomiaalmodel



J. G. C. Verheij (1994, ter publicatie aangeboden). An improved maximum likelihood procedure for estimating the parameters of the beta-binomial distribution. betabinomiaal [Ter publicatie aangeboden, maar ik weet niet in welk tijdschrift; googelen levert niets op]



Henry Rouanet (1996). Bayesian Methods for Assessing Importance of Effects. Psychological Bulletin, 119, 149-158. pdf



Henry Rouanet (1996). Bayesian Methods for Assessing Importance of Effects. Psychological Bulletin, 119, 149-158. pdf



Michael T. Kane (1992). An Argument-Based Approach to Validity. Psychological Bulletin, 112, 527-535. abstract




Ju-Whei Lee & J. Frank Yates (1992). How Quantity Judgment Changes as the Number of Cues Increases: An Analytical Framework and Review. Psychological Bulletin, 112, 363-377. abstract



Deborah A. Prentice and Dale T. Miller (1992). When Small Effects Are Impressive. Psychological Bulletin, 112, 160-164. pdf



Peter Z. Schochet & Hanley S. Chiang (online first 2012). What Are Error Rates for Classifying Teacher and School Performance Using Value-Added Models? Journal of Educational and Behavioral Statistics. abstract



Barry R. Nathan & Wayne F. Cascio: Introduction. Technical & legal standards. In Ronald A. Berk (Ed.) (1986). Performance Assessment. Methods & Applications (1-50). Johns Hopkins University Press. [UB Leiden: PSYCHO B7.1.-14] book abstract




Richard J. Stiggins & Nancy J. Bridgeford: Student evaluation. In Ronald A. Berk (Ed.) (1986). Performance Assessment. Methods & Applications (469-491). Johns Hopkins University Press. [UB Leiden: PSYCHO B7.1.-14] book abstract




Walter van Dyke Bingham (1937/1942). Aptitudes and Aptitude Testing. Harper & Brothers Publishers. abstract


(Niet genoemd door Anne Anastasi, 1984).



Lotte Schenk-Danzinger (1953). Entwicklungstests für das Schulalter. I. Teil Altersstufe 5-11 Jahre. Wien: Verlang für Jugend und Volk. Curieus boek, geen psychologie maar pedagogie. Giga veel testjes, alles is subjectief.



Barbara S. Plake (Ed.) (1984). Social and Technical Issues in Testing. Implications for Test Construction and Usage. Erlbaum. pdf's




Edward F. Alf & Donald D. Dorfman(1967). The classification of individuals into two criterion groups on the basis of a discontinuous payoff function. Psychometrika, 32, 115-123.



Ellen Condliffe Lagemann (2000). An Elusive Science: The Troubling History of Education Research. University of Chicago Press. isbn 0226467724 review short review 1997 article: http://www.jstor.org/discover/10.2307/1176271 -->



F. Allan Hanson (1993). Testing testing. Social consequences of the examined life. University of California Press.online


p. 81 APA-statement over selectie met leugendetector! Als 85 % correct, dan worden meer kandidaten ten onrechte voor leugenaar uitgemaakt dan er terecht als leugenaar worden geïdentificeerd. p. 114 Bentham’s panopticum!



Hnry S. Dyer (1972). Recycling the problems in testing. In Proceedings of the 1972 Invitational Conference on Testing Problems. Educational Testing Service site. [niet online]


Dyer schetst enkele eeuwige misstanden rond testgebruik, maar vindt dat de onderwijstoetsen zelf prima in orde zijn (p. 87).



Robert M. Guion (1977). Content validity: the source of my discontent. Applied Psychological Measurement, 1, 1-11. abstract




L. Heyerick (1980). Het gebruik van foute woordbeelden in spellingtoetsen. Pedagogische Studiën, 57, 268-272.




Wilco H. M. Emons, Klaas Sijtsma & Rob R. Meijer (2007). On the consistency of individual classification using short scales. Psychological Methods, 12, 105-120. pdf




Cyril Burt (1921/1933). Mental and scholastic tests. London: P. S. King. fourth impression of the original text. archive.org


Een omvangrijk werk, over alles op dit onderwerp. Dus ook: rekentoetsjes, met normtabellen voor verschillende leeftijden. Burt beschouwt de rekentoetsen als toetsen (scholastic tests), niet als onderdeel van intelligentietests. Maar de scheidslijn is hier echt onduidelijk. Het was de tijd van algemeen enthousiasme over tests en toetsen, een nieuwigheid tenslotte. De zin van dit alles blijft een beetje behoorlijk in het duister.



William A. Mehrens and Robert L. Ebel (eds) (1967). Principles of educational and psychological measurement. A book of slected readings. Rand McNally. lccc 67-14694




Jackson, D. N. , & Messick, S. (Eds) (1967). Problems in human assessment. McGraw-Hill.




Kenneth A. Bollen (2002), Latent variables in psychology and the social sciences. Annual Review of Psychology, 53, 605-634. abstract




Ronald K. Hambleton & Stephen G. Sireci (1997). Future directions for norm-referenced and criterion-referenced testing. In Lorin W. Anderson (Ed.) (1997). Educational testing and assessment: lessons from the past, directions for the future (p. 379-394). Int. J. Educ. Res., 27, 355-445. pdf




Richard Phelps (Ed.) (2009). Correcting Fallacies about Educational and Psychological Testing [UB Leiden PEDAG. 64.B.44 ZICHTKAST] info




David A. Freedman (2009). Statistical models and causal inference. A dialogue with the social sciences. Cambridge University Press.


Mentioned by Jon Elster in publications on obscurantism in science. Presumably a reader of work by Freedman, edited by David Collier, Jasjeet S. Sekhon & Philip B. Stark. Waarschijnlijk een tweede editie van het in in 2005 uitgegeven werk?



Chester W. Harris (Ed.) (1963/1967 2nd). Problems in measuring change.n The University of Wisconsin Press.




John M. Gottman and Regina H. Rushe (1993). The Analysis of Change: Issues, Fallacies, and New Ideas. journal of Consulting and Clinical Psychology, 61, 907-910. pdf




Scott T. Meier (1994). The chronic crisis in psychological measurement and assessment. A historical survey. Academic Press. isbn 0124884407

info




Jum C. Nunnally Jr, (1959). Test and measurements: Assessment and prediction. McGraw-Hill.




Charles D. Spielberger & Peter R. Vagg (eds) (1995). Test anxiety. Theory, assessment and treatment. Taylor & Francis. isbn 0891162127




Benjamin S. Bloom (Chairman) (1967). Proceedings of the 1967 Invitational Conference on Testing problems. Educational Testing Service.




Ronald A. Berk (Ed.) (1982). Handbook of methods for detecting test bias. Baltimore: The Johns Hopkins University Press. isbn 0801826624.


ao: Lorrie Shepard: Definitions of bias. Janice Dowd Scheuneman: A posteriori analysis of biased items. Donald Ross Green et aliis: Methods used by test publishers to 'debias' standardized tests.



W. K. B. Hofstee (1983). 'Dood hout' kappen. Psychologie, oktober 1983, 8-9.




W. K. B. (1980?). Stellingen over docentbeoordeling. [paper voor een seminar op deze thematiek]




W. K. B. Hofstee (1995). Beoordelen: wetenschap of kunst? In: KNAW, Verslag van de Verenigde Vergadering van de beide Afdelingen der Akademie, 1995, 15-34.




Willem K.B. Hofstee (2009). It Ain't Necessarily So.


Wim noemt dit stuk (niet gepubliceerd?) een ‘wetenschaps-autobiografisch essay’.



Willem K. B. Hofstee (20002). Kwaliteit van beoordelingen. Conferentie Beoordeling in het kunstbeleid, Kees Kolthoff, Herman Marres, Werkgroep Cultuurbeleid PvdA. 11 juni 2002 Boekmanstichting, Amsterdam.




W. K. B. Hofstee (1964). De methode der onderlinge beoordelingen. NTvdPs 59-78.




W. K. B. Hofstee (1999). Principes van beoordeling: Methodiek en ethiek van selectie, examinering en evaluatie. Lisse: Swets & Zeitlinger. besproken door Van der Maessen de Sombreff




Modgil, Sohan Modgil & Celia Modgil (Eds.) (1987). Arthur Jensen; consensus and controversy. Lewes, East Sussex: Falmer. isbn 185000093X




Bloemers, Wim Bloemers (2014). De nieuwe assessment gids [het psychologisch onderzoek]. Een oefenboek. Ambo. isbn 9789026327346




Wim Bloemers (2003). Higher Education Interactive Diagnostic Inventory (HEIDI). The prediction of first year academic success for psychology students at the Open University of the Netherlands (OUNL). Proefschrift Universiteit Rotterdam. pdf pdf


De helft valt binnen drie maanden af. Er valt dus wat te voorspellen, zou je denken. Kader: Schouwenburg. Ik begrijp overigens niet waarom je in deze bjzondere situatie zou zoeken naar persoonlijkheidskenmerken als voorspellers (Bloemers schrapt op blz. 6 een reeks variabelen af, en houdt dan persoonlijke kenmerken over). Wim Bloemers zal dat ondertussen zelf ook wel hebben bedacht.



Trudy Dehue (2008). De depressie-epidemie. Amsterdam: Uitgeverij Augustus. isbn 9789045700953




Leila Zenderland (1998). Measuring minds. Henry Herbert Goddard and the origins of American intelligence testing. Cambridge University Press. isbn 0521003636




George Rasch (1980). Probabilistic models for some intelligence and attainment tests. Chicago: The University of Chicago Press. Expanded edition of the original 1960 text.




Denny Borsboom (2003). Conceptual issues in psychological measurement. Dissertation University of Amsterdam. download




Ivo Molenaar (1972). Dit is een uitdaging. Oratie.




David Berliner (2014). Exogenous Variables and Value-Added Assessments: A Fatal Flaw. Teachers College Record. webpage




R. F. van Naerssen (1962). Selectie van chauffeurs: onderzoekingen ten behoeve van de selectie van chauffeurs bij de Koninklijke landmacht. Groningen: Wolters. Proefschrift Universiteit van Amsterdam.




Nathaniel H. Hartshorne (Ed.) (1978). Educational measurement & the law (3-28). Proceedings of the 1977 ETS invitational conference. pdf




W. J. van der Linden & E. E. Ch. I. Roskam (1985). Testtheorie. Themanummer Nederlands Tjdschrift voor de Psychologie en haar Grensgebieden, 40, 379-451.




W. P. Fisher Jr., & Wright, B. D. (eds) (1994). Applications of probabilistic conjoint measurement. Themanummer International Journal of Educational Research, 21 (6), 557-664.


Het inleidende artikel van de editors is heel verhelderend. Het blijft allemaal binnen het meet-paradigma, en ik mis daar toch enige relativering (de functie van toetsen is niet uitsluitend en niet altijd om te meten, andere functies zijn vaak belangrijker).



Paul Davis Chapman (1988). Schools as sorters. Lewis M. Terman, Applied Psychology, and the Intelligence Testing Movement, 1890-1930. New York: New York University Press. isbn 0814714366




W. A. F. Hepburn, et al. (chair of Mental Survey Committee) (1933). The intelligence of Scottish children. A national survey of an age-group. University of London Press. [niet in UB Leiden]




Frederic M. Lord and Melvin R. Novick (1968). Statistical theories of mental test scores. With contributions by A. Birnbaum. London: Addison-Wesley. lccc 68-11394




M. Groen (1967). De voorspelbaarheid van schoolcarrières in het voortgezet onderwijs. Groningen: Wolters. proefschrift Universiteit van Amsterdam, met stellingen. [Enkele longitudinale studies van vijftiger-jaren groepen; proefschrift, promotor A. D. de Groot; gebruikt o.a. Cronbach & Gleser (1965) als methodologische basis]


Zie bv 13.3 (blz. 141 e.v.) De waarde van validiteitscoëfficiënten voor praktische beslissingen (vgl Wiegersma, 1964, pp. 36 e.v.). Een besliskundige geschikt-ongeschikt versus afgewezen - toegelaten.



Lee J. Cronbach & Goldine C. Gleser (1957/1965 ). Psychological tests and personnel decisions. University of Illinois Press.




Edward L. Thorndike (1904). Theory of mental and social measurements. New York: The Science Press. https://archive.org/details/theoryofmentalso00thor [the second revised edition 1916 is also available online https://archive.org/details/anintroductiont00thorgoog




A. D. de Groot (1961). Methodologie. Grondslagen van onderzoek en denken in de gedragswetenschappen. Den Haag: Mouton. De 12e editie, goeddeels gelijk aan die van 1961, is integraal online beschikbaar http://www.dbnl.org/tekst/groo004meth01_01/




A. H. G. S. van der Ven (). Time-limit tests. Nederlands Tijdschrift voor de Psychologie, 26, 580-591.




W. Molenaar (1974). De logistische en de normale kromme. Nederlands Tijdschrift voor de Psychologie, 29, 415-420.




W. J. van der Linden & E. E. Ch. I. Roskam (Red.) (1985). Testtheorie. Speciaal numer Nederlands Tijdschrift voor de Psychologie, 40, 379-451. [hardcopy]




Frederic M. Lord (1981). Problems arising from the unreliability of the measuring instrument. In Philip H. DuBois and G. Douglas Mayo: Research strategies for evaluating training (79-94). AERA Monograph Series on Curriculum Evaluation. Chicago: Rand McNally.




Ken Alder (2007). Lie detectors. The history of an American obsession. Free Press. isbn 0743259882




Sir W. H. Hadow (chair) (1924). Board of Education. Report of the consultative committee on psychological tests of educable capacity and their possible use in the public system of education. London: His Majesty's Stationary Office. paper The Committee's Report pp. 1-145. Appendices 146-238 a.o. by Cyril Burt. Full text (after August 21 2007) on www.dg.dial.pipex.com/documents/


Het rapport zelf is natuurlijk historisch, maar het bevat ook een waarschijnlijk interessant historisch hoofdstuk, Historical sketch of the development of psychological tests. Ik merk erbij op dat er wel een geschiedenis van tests, maar niet van beoordelen/examineren in staat.



James W. Pellegrino & Mark Wilson (2015). Assessment of Complex Cognition: Commentary on the Design and Validation of Assessments. Theory into Practice researchgate.net


Introduction to the issue on assessment of complex cognition.



Paul E. Newton & Stuart D. Shaw (2014). Validity in educational and psychological assessment. Sage. [als eBook in KB] info




Paul E. Newton & Stuart D. Shaw (2013). Standards for talking and thinking about validity. Psychological Methods, 18, 301-319. [Ik heb geen toegang] abstract




M. David Miller & Robert L. Linn (2000). Validation of performance-based assessments. Applied Psychological Measurement, 24, 367-378. abstract



Nambury S. Raju, Reyhan Bilgic, Jack E. Edwards & Paul F. Fleer (1999). Accuracy of Population Validity and Cross-Validity Estimation: An Empirical Comparison of Formula-Based, Traditional Empirical, and Equal Weights Procedures. Applied Psychological Measurement, 23, 99-115. abstract



Craig W. Deville (1996). An empirical link of content and construct validity evidence. Applied Psychological Measurement, 20, 127-139. abstract




Samuel Messick (1993). Trait equivalence as construct validity of score interpretation across multiple methods of measurement. In Randy Elliot Bennett and William C. Ward: Construction versus choice in measurement: Issues in Constructed Response, Performance Testing, and Portfolio Assessment. Erlbaum.


Of course, if different instruments intended to measure the same construct, do not do so (Stevens and Clauser, 2006, see below), then there is a serious validity problem at hand. This kind of situation does not validate Messick's construct validity approach, however, because absence of conflicting results does not indicate that the intended construct is being validly measured by both instruments. Sounds terribly Popperian.



Joseph J. Stevens and Patricia S. Clauser (www accessed 2006). Multitrait-Multimethod Comparisons of Selected and Constructed Response Assessments of Language Achievement. pdf




H.L.J. van der Maas, K.-J. Kan, D. Borsboom (2014). Intelligence is what the intelligence test measures. Seriously. Journal of Intelligence, 2, 12-15. download




Journal of Intelligence open access




Intelligencesupports open access




Freeman, A. Myrick Freeman, III (1993). The measurement of environmental and resource values. Theory and methods. Washington, D.C.: Resources for the Future. isbn 0915707691




A. W. F. Edwards (1972). Likelihood. An account of the statistical concept of likelihood and its application to scientific inference. Cambridge: Cambridge University Press. isbn 0521082994


Mijn toetsvragenverzameling-concept past uitstekend in opvatting van deze Edwards. Een directe analogie voor mijn toetsvragenverzameling is de zak met rode en witte knikkers, waar niet meer dan enkele handen vol uit mogen en kunnen worden gehaald, terwijl er toch inferentis over de verhouding tussen aantallen rode en witte ballen in de zak worden gemaakt. Deze knikker-analogie lijkt me heel bruikbaar om het idee van de toetsvragenverzameling als kern van mijn benadering te presenteren.

De opstelling van Edwards betekent dat ware score modellen in de psychologie zelden terecht zijn, althans wanneer er pogingen worden gedaan veronderstelde ware scores zo goed mogelijk te schatten (zowel klassieke ware score theorie, als latente trek modellen zouden op die manier 'verkeerd' kunnen worden gebruikt. ) Let op het onderscheid dat Edwards maakt tussen model en hypothese: een model is bijvoorbeeld het betabinomiaalmodel voor toetsscores, de hypothese zou kunnen zijn dat de parameters voor de genererende betaverdeling bepaalde 'ware' waarden hebben. Zie ook Edwards, hoofdstuk 4, waar een en ander in de context van Bayes' stelling verder wordt uitgewerkt. Het onderwerp is voor mij erg belangrijk, omdat mijn hele benadering is doordrenkt van impliciete aannamen die in de statistische wereld al heel lang onderwerp van bittere strijd zijn. Misschien moet ik de term 'ware beheersing' vermijden. Welke andere mogelijkheden zijn er dan? 'Veronderstelde beheersing' is een aardige kandidaat in de context van simulaties: doe een bepaalde veronderstelling over een stand van zaken, en leid volgens een opgesteld model af tot welke gevolgen dat in de wereld leid, vergelijk dat met empirische waarnemingen. Dat zou wel eens bij uitstek een benadering kunnen zijn waar de likelihood van Fisher goed bij past! Ook hier weer: zowel een bepaald model, als de waarden die de parameters in dat model aannemen, zijn aan de orde. Dat kan tot verwarringen leiden, en daarom moet ik daar van meet af aan duidelijkheid over scheppen.



Paul G. Hoel (1962). Introduction to mathematical statistics. Wiley. LCCC 62-18992




Herbert Hoijtink en Klaas Sijtsma (2009). Meten Onder Druk. Advies aan de CEVO Inzake de Normering van Eindexamens Voortgezet Onderwijs. pdf




Fritz Drasgow (Ed.) (2016). Technology and testing Improving Educational and Psychological Measurement. NCME. (also: Routledge). info


Very, very technocratic. In the service of big business.



Neil J. Dorans, Linda L. Cook (Eds.) (2016). Fairness in Educational Assessment and Measurement Routledge. [PEDAG.51.e.93] info




Henry Braun (Eds.) (2016). Meeting the Challenges to Measurement in an Era of Accountability. Routledge. info




Randy Elliot Bennett, Marc M. Sebrechts & Donald A. Rock (1991). Expert-System Scores for Complex Constructed-Response Quantitative Items: A Study of Convergent Validity. Applied Psychological Measurement, 15, 227-239. abstract




Bennett, R. E., Rock, D. A., & Wang, M. (1991). Equivalence of free-response and multiple-choice items. J. of Educ. Meas., 28, 77-92.




Bennett, R. E., Rock, D. A., Braun, H. I., Frye, D., Spohrer, J. C., & Soloway, E. (1990). The relationship of expert-system scored constrained free-response items to multiple-choice end open-ended items. Applied Psychological Measurement, 14, 151-162. abstract




Randy Elliot Bennett, James Braswell, Andreas Oranje, Brent Sandene, Bruce Kaplan, & Fred Yan (208). Does it Matter if I Take My Mathematics Test on Computer? A Second Empirical Study of Mode Effects in NAEP. open access




Suzanne Lane, Mark R. Raymond & Thomas M. Haladyna (Eds.) (2016). Handbook of test development. Routledge. info




Linda Darling-Hammond, Frank Adamson (Eds.) (2014). Beyond the Bubble Test: How Performance Assessments Support 21st Century Learning. Jossey-Bass. [KB eBook] info



Millsap, R.E., Bolt, D.M., van der Ark, A.L., Wang, W.C. (Eds.) Quantitative Psychology Research. Springer. [als eBoo in KB] preview


Looks like an outlet for publishing.



ETS Research Report Series web




J. Maynard Smith (1974). Models in ecology. Cambridge University Press. isbn Houden. Han van der Maas e.a. gebruiken dergelijke modellen om 'g' te verklaren




James S. Coleman (1990). Foundations of social theory. Cambridge, Massachusetts: The Belknap Press of Harvard University Press. isbn 0674312252




Han L. J. van der Maas, Kees-Jan Kan & Denny Borsboom (2014). Intelligence Is What the Intelligence Test Measures. Seriously. J. Intell. 2014, 2(1), 12-15; doi:10.3390/jintelligence2010012 open access




P. J. van Strien (Red.) (1976). Personeelsselectie in discussie. Meppel: Boom. isbn 9060092201




Robert Coe (2002). It's the Effect Size, Stupid. What effect size is and why it is important. webpage




Saskia Wools (2015). All About Validity. An evaluation system for the quality of educational assessment. Dissertation Twente University researchgate.net




Carol Burris (2016). A master teacher went to court to challenge her low evaluation. What her win means for her profession. post by Valerie Straus


VAM Value-added measurement.



American Statistical Association (2014). ASA Statement on Using Value-Added Models for Educational Assessmentpdf


http://nycpublicschoolparents.blogspot.nl/2016/05/breaking-lederman-decision-and-gallup.html via @dylanwiliam



Adaptive comparative judgement wiki




James T. Austin & Peter Villanova (1992). The criterion problem: 1917-1992. Journal of Applied Psychology, 77, 836-874. pdf




Susan Athey & Guido W. Imbens (July 2016) The State of Applied Econometrics - Causality and Policy Evaluation paper




Henry Braun (Eds.) (2016). Meeting the Challenges to Measurement in an Era of Accountability. Routledge. info




Samuel A. Livingston Marilyn S. Wingersky (1979). Assessing the reliability of tests to make pass/fail decisions. Journal of Educational Measurement, 16>/i>, 247-260. preview


Dit is het eerste artikel, voorzover ik dat kan bekijken, waarin de utiliteitsfunkties betrokken worden bij het betrouwbaarheidsdenken. En daar moet het natuurlijk ook naar toe. Een ander puntje is dat ik vermoed dat in de formules die zij gebruiken, en waarin ook weer een bekend veronderstelde mastery score functioneert, ofwel die mastery score mathematisch overbodig is, ofwel coherent moet zijn met de gespecificeerde utiliteitsfunktie.



Miao-Hsiang Lin & Chao A. Hsiung (1994). Empirical Bayes estimates of domain scores under binomial and hypergeometric distributions for test scores. Psychometrika, 59, 331-359. [betabinomiaal]


Dit is interessant om te zien hoe relatieve outsiders met het heersende paradigma van Hambleton, Novick, Mellenbergh en Van der Linden omgaan. Typisch Psychometrikaans artikel. Over praktisch belang gaat dit allang niet meer. preview




Fons van de Vijver (eindredactie) (2001). Deskundigen over het testen van etnische minderheden. pdf




W. H. P. den Brinker (mei 2002). Validering Nederlandse Differentiatie Testserie 2001. SCO-Kohnstamm Instituut.




I. W. Molenaar (1980). An insurance policy against unexpected data. (second version). Heymans Bulletins Psychologische Instituten R.U. Groningen. HB-79-447-EX. [betabinomiaal]



W. Molenaar (1973). Simple approximations to the poisson, binomial and hypergeometric distributions. Amsterdam: Stichting Mathematisch Centrum. Ook gepubliceerd in Biometrics, 1973, 29, 403-407.[betabinomiaal]



Denny Borsboom, Gideon J. Mellenbergh and Jaap van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061-1071. pdf abstract


A fundamental discussion on the concept of validity. Criticizes the idea of construct validity as strongly advocated by David Messick, tries to establish the idea of validity as theory-based measurement, analogous to measurement in the physical sciences. The example mentioned is the testing of psychological concepts and development on the basis of Piagetian developmental psychology.



Floor Thijs Jolanda Hoogervorst, Wim Pesch & Albert Ponsioen (2010). Vissen in troebel water. Het gebruik van de WAIS-III-NL bij (jong)volwassenen met lagere IQ's. De Psycholoog maart 2010 38-45.




6471D&AY Leiden University Selected chapters from Measurement and evaluation in psychology and education [Robert M. Thorndike & Tracy Thorndike-Christ], and Measurement and assessmet in education [tiem analysis for teachers. Cecil R. Reynolds, Ronald B. Livingston & Victor Wilson]. Pearson Custom Publishing. isbn 9781781343432


Ik vind het maar niks, maar dit is kennelijk een boek dat in Leiden op de literatuurlijst staat.



Jorg Huijding, Bas Hemker & Remko van den Berg (2012). Verantwoord en fair testgebruik. Welke rol heeft de Cotan? De Psycholoog, april, 47-52. besloten archief




Peter C. M. Molenaar (2004). A Manifesto on Psychology as Idiographic Science: Bringing the Person Back Into Scientific Psychology, This Time Forever Measurement, 2, 201-218. pdf




Steven J. Howard, Stuart Woodcock, John Ehrich and Sahar Bokosmaty (2016). What are standardized literacy and numeracy tests testing? Evidence of the domain-general contributions to students’ standardized educational test performance. British Journal of Educational Psychology.


The results are of limited value, because of the small number of subjects (N=91) and their age (grade 2 pupils). More interesting might be the theoretical framework.



Why Standardized Tests Don't Measure Educational Quality

Why Standardized Tests Don't Measure Educational Quality

W. James Popham (1999). Educational Leadership, 56, 8-15. webpage



S. J. Howard a.o. (2015). Behavioral and fMRI evidence of the differing cognitive load of domain-specific assessments. open access




Lynn S. Fuchs a.o. (2012). Contributions of Domain-General Cognitive Resources and Different Forms of Arithmetic Development to Pre-Algebraic Knowledge. Dev Psychol. pdf




Feinstein (1967). Clinical judgment [UBL]


Ferdinand Mertens noemt het in zijn https://didactiefonline.nl/blog/blonz/de-centrale-eindtoets-en-het-schooladvies Ik doe er niets mee: het ligt geheel buiten de psychologie.



The power of bias in economics research. John P.A. Ioannidis* , T.D. Stanley** and Hristos Doucouliagos pdf




Jay P. Heubert and Robert M. Hauser (Eds.); Committee on Appropriate Test Use, National Research Council (1999). High Stakes: Testing for Tracking, Promotion, and Graduation ISBN: 0-309-52495-4, 352 pages, 6 x 9, (1999) This PDF is available from the National Academies Press at: http://www.nap.edu/catalog/6336.html




Jay P. Heubert and Robert M. Hauser, Editors (1999). High Stakes: Testing for Tracking, Promotion, and Graduation (1999) [352 pp] pdf free pdf




de Leeuw, J., & Meester, A. C. (1984). Over het intelligentie-onderzoek bij de militaire keuringen vanaf 1925 tot heden. Mens en Maatschappij, 59, 5-26. [flynn-effect] pdf


[gevonden in gehetna: 93, Handleiding ten dienste van het onderzoek naar de Algemeene praktische intelligentie bij de keuringsraden. 1925 1 deel (nog geen scans van beschikbaar).



H. Y. Groenewegen Jr (1926). Het onderzoek naar het algemene praktische intelligentie bij de keuringsraden in 1925’, De militaire spectator 95 (1926), p. 634-645. [hardcopy map diff]




J. Stroomberg (1925). De beteekenis der psychotechniek voor het bedrijf. Proefschrift H.H. Rotterdam. herdruk mededeelingen uit het psychologisch laboratorium der rijksuniversiteit te Utrecht 1926




Eric Haas (1995). Op de juiste plaats. De opkomst van de bedrijfs- en schoolpsychologische beroepspraktijk in Nederland. Hilversum: Verloren. isbn 9065504222




Anne Anastasi (Ed.) (1966). Testing problems in perspective. Twenty-fifth anniversary volume of topical readings from the invitational conference on testing problems.. American Council on Education. lccc 66-23128 [UBL: Closed Stack 3A   2043 B 1  ]


Bundeling van bijdragen aan diverse ETS invitational conferences. Handig: index op de inhoud. Deze bijdragen zijn dus niet tevens beschikbaar als tijdschriftartikelen!



Defending Standardized Testing Phelps, Richard  Taylor and Francis  2005 [als eBook in KB]




Collection: Ofqual's reliability research page




James H. Capshew (1999). Psychologists on the march. Science, practice, and professional identity in America, 1929-1969. Cambridge UP. isbn 0521565855 Ch 4. Sorting soldiers, psychology as personnel management


WOII, also WOI (army alpha) of course



Hickendorff, M., Edelsbrunner, P. A., Schneider, M., Trezise, K., & McMullen, J. (in press). Informative tools for characterizing individual differences in learning: Latent class, latent profile, and latent transition analyses. Learning and Individual Differences. doi:10.1016/j.lindif.2017.11.001
 preprint


Theo J. H. M. Eggen (2004). Contributions to the theory and practice of computerized adaptive testing dissertation Twente. 9058340569 pdf









november 2017 \ contact ben at at at benwilbrink.nl      


Valid HTML 4.01!       http://www.benwilbrink.nl/literature/testpsychologie.htm http://goo.gl/BFQfp


holman