Paper 21 Januari ResearchEd Amsterdam [livestream youtube blocked, the org. is working on it ]

Note. My ResearchEd paper in Dutch is an outline only, more or less the same line of thinking, except for its emphasis on Dutch publications, situations, institutions and politics, while this English paper uses international examples. Twin paper on knowledge, for a presentation on 18 Januari [in Dutch].

proposal & CV


I have spent a good part of my working life in deliberate practice (Ericsson & Pool 2016) on issues of assessment in one way or another, from the design of achievement test items to transitions from education to the labour market. From ethical issues bound up with assessment, to human capital theory. From Imperial exams in China to higher education admissions by lottery. I might be exposed as a fraud every moment now ;-) Luckily I have been educated as a psychologist, it has always given my sometimes adventurous work a solid footing. A disciplinary one, that’s a better characterization. As a psychologist it is fairly easy to see how many ideas and reports on education are just folk psychology, psychological misconceptions, or simply bonkers. For example, the idea to teach for creativity is just crazy. Or reckoning with learning styles [Kirschner 2016]. The challenge then is to correct misconceptions, to replace folk psychology with sound science, to debunk sheer nonsense.

My disciplined vision on assessment is psychology, comprising testpsychology and psychometrics, the psychology of individual differences, experimental psychology, and cognitive psychology. It is my convinction that without a grounding in these subdisciplines, work on assessment risks to be misleading and damaging to the vulnerable parties to assessment, especially pupils and students, and ultimately to individual prosperity and to the economy at large. Lots of Dunning-Kruger type misconceptions here, because most people deem themselves to be experts in education, assessment as well as psychology on the self-evident ground of extensive personal experiences. Life has taught us psychology, isn’t it? Or is it?

To handle the complexity, I’d like to point out four characteristics or strands of assessment, assessment taken in a broad sense, not just acts or instruments of assessment.

1. assessment is deeply political
It is not possible to work on or with assessment without making choices (or avoiding them) that are consequential for individual persons or society at large. It is a strange thing having to say this, for in other sectors, take for example medicine, it is self-evident that one’s professional work is for the good of one’s patients, distanced from political intervention. Professional associations have developed standards of quality in educational testing, but are they followed up? Some, many?, testing organizations do not sport any psychological testing expertise at all, or have psychometricians isolated in methodological units. So, who is going to protect public education from the testing industry?

2. assessment will have backwash effects
Assessment is not the same as clinical psychological testing. Assessment will have backwash effects, washback, or feedforward, for better or for worse. The general principle in the social sciences as well as in psychology is that subjects will talk back to whoever is addressing or assessing them, they will change their behaviour strategically.

3. never disregard assessment content
It is shameful for the discipline of educational measurement not to have even the beginnings of a theory of achievement test item writing or design. Except the one by myself, in Dutch, based on a marriage between cognitive psychology and (Wittgensteinian) epistemology. Achievement and aptitude tests are heavily overrated as measurement instruments. They are not measurements at all, only samples from particular domains. More important, most of them do not comply to standards of quality, especially curricular alignment. [Hirsch] [Shulman 1986; esp.: PISA tests]

4. assessment is always more complex than estimated
It is almost not humanly possible, in any important case, to grasp its important contingencies in these three domains of assessment impact. The specialist might stay ignorant of somewhat remote negative impacts, the generalist might not quite understand the significance of specialist theories. The more important it is to solicit debate and seek out counterchecks, and try one’s best. The book by Hirsch, Why knowledge matters; Rescuing our children from failed educational theories is an inspiring example of the almost impossible becoming possible: addressing all those contingencies, not only of assessment programs, also of educational theories. The two are intimately linked together, of course.

My vision on assessment is therefore, in catchwords : 1) it is deeply political, 2) the assessed will talk back, 3) it is always important exactly what is assessed, 4) it’s always complex, therefore solicit counterchecks. Twitter thread. Most stakeholders pretend assessment not to be political, the assessed will not talk back, and content isn’t that critical. Those stakeholder attitudes will ruin public education, at least they will not stop ideologues/politicians from ruining it.

Reading again Hirsch Ch. 1 ‘The invalid testing of students’ then is an amazing experience, every page presenting many examples of the above. I will present some examples from the literature as well as from my own work on each of those four characteristics.

assessment is deeply political

Gene Glass, a top educational influencer in the US, is ashamed of what is happening in the field of educational measurement blog. Good question therefore: What is happening?

Even worse was to come, in the US as well as elsewhere, as we all will have experienced. Glass decided to distance himself from the moneychangers in testing:

As you surely must have noticed, there is now a huge cottage industry of home workers thinking up achievement test items to be used in tests produced by testing companies and foundations like the American ACT, ETS, Pearson, McGraw-Hill, or the Dutch Cito. Without exception, these moneymakers (yes, also the foundations) are developing now digital products that will be forced on schools by governmental bodies and governments. All for the good of pupils, and their education, of course. I do not believe even one bit of this. It is a matter of political science and investigative journalism to research the networked connections between these corporate interests and governmental bodies en politicians. One of the researchers in this field is Ben Williamson (2016), you can follow him on twitter.

Chinese Imperial exams
Another example of politicized assessment is furnished by Chinese Imperial exams of at least some 1200 years old. They were civil service exams, and plainly intended to prevent particular families to build dynasties of power in China. (Wilbrink 1997)

Dutch lottery admissions
Some assessments are not assessments at all: take for example the case of the Dutch lottery admissions to numerus clausus studies, in particular to medicine and dentistry. Because of a lost lawsuit in the early seventies about waiting lists for students in medicine, admissions to all studies of medicine in the Netherlands were restricted to a fixed number of open places, a numerus clausus. Because pupils in secondary were totally unprepared, a straight lottery was deemed to be the only justifiable, equitable, means of selection. In 1975 parliament voted unanimously for a temporary way of selecting students for the numerus clausus by means of a weighted lottery, being a straight lottery giving better chances to candidates with better grade point averages on their exit examination. Just this year this lottery system will be terminated. I was heavily involved then, and throughout my career, together with a small bunch of psychologists, one of them being a member of parliament in '75. The whole idea of a lottery in this situation being that yet another examination on top of the regular exit exams would be very costly and yet would not deliver valid new information. more info on the Dutch lottery assessment will have backwash effects

I have spent some time in historical searches (Wilbrink, 1997) for developments in assessment practices in the world, in Western Europe in particular, until about 1900. An interesting find was how in the French examination of prospective teachers in secundary education, the Agrégation, the ranking of candidates in a small number of years was changed into the pseudo-standardized form of ranking that we use today: grading. It is an important insight that the ancestry of our grading habits is the old system of ranking students.

Of course, the expert in educational measurement knows very well that grading is primarily grounded in the idea to differentiate students on the basis of their achievements. It is an educational tradition, an ideology if you want. There is nothing in psychometrics or educational psychology that prescribes this stylized form of ranking students as the one and only right thing to do. Worse: this constant ranking of pupils washes back on instructional practices as well as on student strategies. On pupil welfare also: in my humble opinion the grading and standardized testing of children and reporting their standing as percentile scores, is in violation of, for example, the United Nations Convention on the Rights of the Child. Routinely labeling children by percentile scores as happens in Dutch primary education (leerlingvolgsysteem) is a form of abuse, creating for many children an unsafe environment in the school that they are forced by law to attend.

Back to test development itself. Typically, test item writers work on the assumption that pupils having studied X, should be able to answer whatever they, the item writers, think up as reasonable questions on X with some probability. This mirrors the assumption of psychological tests: testees are not specifically prepared. Educational reality is another one: the core business of education is to prepare pupils for the achievement tests they have to sit. It does matter exactly how they will be questioned, and what the quality of the test items is. Test preparation strategies need not be a problem, where test questions are curriculum aligned. Adriaan de Groot in 1970) therefore proposed acceptability or equitability as a new criterium, next to reliability and validity.

In plain English, the acceptability principle entails that the pupil should be able to predict his or her test score, and therefore be able to strategically prepare for the test, and subsequently accept the outcome as fair.

‘Profitability’ in the quote is what in a decision-theoretic framework would be called utility or expected utility, in this case the utility to the sponsor of the test (teacher, school, government).

Also in 1970 a colleague of De Groot, Bob van Naerssen, emphasized the importance of enabling pupils/students to efficiently prepare themselves for sitting assessments by presenting a mathematical model for what it is to efficiently prepare, calling it a tentamen model. The approach chosen by Van Naerssen resembled that of De Groot in choosing the pupil as the main actor, only more explicitly so. That is a quite remarkable step to take, for psychometrics typically is a science in the service of the powers that be: employers, governments, schools, teachers. Not pupils, nor their parents. Van Naerssen made good use of pioneering work by Lee Cronbach and Goldine Gleser on the use of decision theory in psychological testing. Van Naerssen ran into difficulties in the further development of his tentamen model. In later years I took over and was able to solve most of the problems, see here, it is still a project that is not quite finished however. It is my conviction that this kind of decision theoretic modeling, taking the pupil as the main actor in his or her own learning, is the only correct approach to modelling student behavior under specific conditions of assessment.

Yet another take on backwash effects is by James Coleman (1990), using the framework of his social systems theory. It is possible to see pupils and teachers as two parties entangled in implicit negotiation on the scarce resources of time as well as grades. There is a nice book by Adriaan de Groot (1966), in Dutch called ‘Vijven en zessen’, a pun that I can not translate, it is about how teachers in secondary education use grades in competing with their fellow teachers for the time and attention of their pupils. That is more or less the idea of the negotation in the James Coleman mathematical model. Using my own dataset on students in the first year of the study of law at the university of Amsterdam, I was able to demonstrate the high validity of the Coleman negotiation model in this particular situation. (Wilbrink 1992).

I have not mentioned the well known backwash effects of high stakes tests used to hold schools and even teachers accountable for the achievements of their pupils, resulting in too many cases in driving out not-to-be-tested cultural content in the curriculum, or game time in Kindergarten. It is easy to see how destructive assessment can be, no need to consult experts like Gene Glass on this kind of problem.

never disregard assessment content

On the strand of issues with content I’d like to discuss reading comprehension and PISA Math. Let me say a few things about achievement test item writing first. After all, that is where item content materializes. Test item writing for me is equal to test item design, I will use both terms. I want to emphasize the design aspect, quality is at stake here. Item writing is usage, it suggests that there is nothing special about it, everybody can do it. And so it goes, getting paid for every item written.

The mainstream idea about item writing is that it is an art, every new item the result of a mysterious creative process. Mainstream, here, is what handbooks on educational measurement present in their chapters on writing the achievement test item.

It must surely be possible to develop a technology or theory of achievement test item writing. Such a technology would put an end to the idea of item writing as an art, it should be a craft delivering items with well known specifications of what it is exactly that the particular item tests for. I’d like to do away with the idea that the testing is of knowledge in the heads of pupils. After all, we can’t look inside those brains to see what is happening there, unless on the basis of cognitive architecture models and fMRI scanning. I’d prefer an epistemological approach, starting with descriptions of what exactly is the knowledge that we expect the pupils to master. What are the concepts they should learn, and which relations between those concepts.

The Newtonian law F = ma, force is mass times acceleration, is a relation between three concepts. Vocabulary is not a loose collection of words, they have meaningful associations also. Vocabulary is a big thing in education, as it is in the recent book by Hirsch. Vocabulary is just vocabulary: there is nothing to understand, knowing is all there is to it. The same with F = ma, there is nothing to understand here. Talking about understanding this Newtonian law is talking of other things than the law itself. Newton himself didn’t speculate about what that force might be, what might explain it (hypothesis non fingo). Erik Verlinde has a theory about it (it is information, dummy) (NRC, guess what: January 21). So had Albert Einstein (the bending of space-time).

I mentioned earlier that I wrote a book on the design of achievement test items. It has chapters on concepts and their relations, and one on problem solving, based on Newell and Simon’s information theory of problem solving. This problem solving is not a generic skill, generic skills do not exist. Problem solving is domain specific, it is conditional on the knowledge of relations between concepts, the technical term is ‘productions’. The book has another chapter on questions about text, about text fragments. I should have placed that chapter after the one on problem solving; after all, analytical and inferential questions on a given text are a subcategory of problems to solve, and those questions should be curricular aligned as well. And that’s it. In my book on item writing there is no place for so called generic skills, among them 21st century skills, reading comprehension, or contextual math items as one finds in the PISA Math tests. Also in psychology there is no place for what in the 19th century were called psychological faculties, no place for mental muscles that one might train to become more proficient. But that is exactly what those 21st century skills pretend to be: mental muscles. I have another paper on knowledge and skills, in Dutch, its references are mainly English ones however.

Let’s take look now at PISA Math. One of the problems with the PISA tests is that their validity is unknown. What we do know, from the PISA starting document by Andreas Schleicher, is that PISA is grounded in the ideology of constructivism, just another manifestation of progressivism, emphasizing tthe individual learner as well as general skills that enable the discovery of knowledge. PISA Math has its roots in Dutch RME Realistic Math Education. An uncontroversial thing to say is that constructivist math is controversial. One would expect a justification of this choice for constructivist math, or at least a warning (not just a mention) that PISA Math is a test based on constructivist math, not conventional math. None of this is to be found in OECD docs on PISA, at least I have not found it. The general idea of PISA Math is that pupils should learn to use math in situations of daily life. That is a rather muddled concept, fitting the concept of assessment centers in personnel selection better than that of achievement in education. I will leave it here, just noting that our Dutch ministers spent 500 million euro on a math exit exam that is just a look alike of PISA Math, and it is a wreck that will be abolished soon after the upcoming election.

I must mention that one specific problem with content in education: most of educational research does not consider specific content at all, because the researchers do not have the expertise to consider specifics of the math or language curriculum, etcetera. Adding to this worrisome state of affairs is the fact that American curricula differ from school to school. As Shulman i 1986 pointed out, almost a century of this kind of educational research has resulted in politicians thinking content is not that important. I am not sure Shulman is right here, neglect of the importance of content is also a hallmark of the progressivism that swayed American education since the nineteenthirties, as described by E.D. Hirsch in his Why knowledge matters.

I have promised to say something clever on reading comprehension. I do not know enough of language education, luckily E.D. Hirsch does, read his impressive book! Dan Willingham and Gail Lovette (2014) wrote an informative piece on reading comprehension, starting thus:

The general admonition is: do not try to teach skills that do not in fact exist. You are wasting the time of pupils. Generic skills, and reading comprehension is one of them, do not exist. Willingham and Lovette: &ldquoo;When it comes to improving reading comprehension, strategy instruction may have an upper limit, but building background knowledge does not; the more students know, the broader the range of texts they can comprehend.”

assessment is always more complex then estimated
I must be running out of my time by now. A good illustration of the complexities of assessment is furnished by what is happening in the US with the Common Core State Standards and its associated large scale testing programs, parents opting out their children, CCSS committee members opting out of the committee, disturbances of school life and curricula.

A Dutch equivalent is the Law of 2010 of language and math state standards (referentieniveaus) and associated attempts to force simplistic test arrangements onto exit exams in secondary education.

The UK will have its own problems with large scale testing. For foreigners it is quite difficult to follow what exactly is happening in education in Brexitania.

I will leave it here.

not used

André Chervel (1993). Histoire de l’Agrégation. Contribution à l’histoire de la culture scolaire. Paris: INRP Editions Kime.


The idea of achievement tests as measuring instruments is at fault. Better think of a test as a sample. Still better: a knife, cutting through the bush to get useful samples of whatever it is that counts at the end of a course, a lecture, or an explanation.

tweet thread Never think of achievement tests as measurements, they’re just samples. Edgeworth already told you so, didn’t he? read free In search for a metaphor for achievement testing, the (medical) puncture might be useful: short duration, painful, samples suspect tissue. Aha, why puncture healthy tissue at all? #insight [educational measurement] For an application of the puncture-principle see: the case of lottery-based admissions in the Netherlands webpage

[PISA, TIMSS] The question then becomes also one of ‘validity’ of curricula / methods / ed structure. Good question, imho tweet

Thousands of primary schools' rankings upended by new Sats. School leaders say volatile results have vindicated their concerns over rushed implementation of tough new exams Richard Adams and Helena Bengtsson Thursday 15 December 2016 09.30 GMT The Guardian article

Zie ook artikel van staatssecretaris Nick Gibb here.

I do have a photocopy of Diana Pullin (1983). Debra vs Turlington: Judicial standards for assessing the validity of minimum competency tests. In George F. Madaus: The Courts, validity, and minimum competency testing. Kluwer-Nijhoff. 

also; the 1977 ETS invitational conference. ‘Educational measurement & the law’ [a.o. a piece by Barbara Lerner] Also by Barbara: 
Barbara Lerner (1978). The Supreme Court and the APA, AERA, NCME Test Standards: past references and future possibilities. American Psychologist, 33, 915-919.
Lerner, B. (1979). Tests and standards today: attacks, counterattacks, and responses. New Directions for Testing and Measurement, 3, 15-31.
Lerner, B. (1979). Legal issues in construct validity. In: Construct validity in psychological measurement. Proceedings of a colloquium on theory and application in education and employment. Princeton, N.J.: ETS. 

In the seventies I learned a lot about principles of law in connection with assessment [Ch. 6 in ]. The idea was picked up by Job Cohen for his PhD [scan of the book made available on my website], resulting in the only Dutch treatment of rights of students (in higher education, because of available jurisprudence; the principles are applicable in primary and secondary education also, of course). Job Cohen had a further career in politics and as mayor of Amsterdam. ========================================

The report was written by Prof. Eric. A. Hanushek from the Hoover Institution at Stanford University and CES ifo and by Prof. Ludger Woessmann from the Ifo Institute for Economic Research, CES ifo, and the University of Munich, in consultation with members of the PISA Governing Board as well as Andreas Schleicher, Romain Duval and Maciej Jakubowski from the OECD Secretariat.

Jakubowski, M. (2013), Analysis of the Predictive Power of PISA Test Items, OECD Education Working Papers, No. 87, OECD Publishing, Paris.
The predictive power of the PISA test items for future student success is examined based on data from the Longitudinal Surveys of Australian Youth (LSAY) for the PISA 2003 cohort. This working paper analyses how students’ responses to mathematics and problem-solving items in PISA 2003 are related to the students’ qualifications in education in 2007 and 2010. The results show that items do differ in their predictive power, depending on some of their deep qualities. PISA mathematics and problem-solving items are grouped into various classifications according to their qualities. This paper proposes 16 new classifications of items. Among mathematics-specific item classifications, two are found to be significantly related to future student success: those that assess knowledge, understanding, and application of statistics; and those related to rates, ratios, proportions, and/or percent. These items frequently require students to apply common mathematical concepts to solve multi-step, non-routine problems, think flexibly, and understand and interpret information presented in an unfamiliar format or context. Among classifications that are not specific to mathematics, items that were classified as using reverse or flexible thinking are found to be related to student qualifications in both mathematics and problem solving. These items require students to be able to think through a solution at various points during the problem-solving process, not just at the start.

Peter W. Airasian & George F. Madaus (1983). Linking testing and instruction: policy issues. JEM 20, 103-..

Pearson, IBM Watson and cognitive enhancement technologies in education Posted on November 4, 2016

by Ben Williamson Ben Williamson blog

Ou Lydia Liu, Lois Frankel, Katrina Crotts Roohr ( 2 June 2014) Assessing Critical Thinking in Higher Education: Current State and Directions for Next-Generation Assessment text

Michael Fordham (20 June 2015). Why don't we have set texts in history? blog

Julia Lerch, Patricia Bromley, Francisco O Ramirez, John W Meyer (December 13, 2016). The rise of individual agency in conceptions of society: Textbooks worldwide, 1950–2011 research-article free access


Richard P. Phelps (2014). Synergies for Better Learning: An International Perspective on Evaluation and Assessment. Article in Assessment in Education Principles Policy and Practice · October 2014

Han L. J. van der Maas, Conor V. Dolan, Raoul P. P. P. Grasman, Jelte M. Wicherts, Hilde M. Huizenga, and Maartje E. J. Raijmakers (2016). A Dynamical Model of General Intelligence: The Positive Manifold of Intelligence by Mutualism Psychological Review, 113, 842-861. pdf

Daisy Christodoulou Making Good Progress? :The future of Assessment for Learning

ISBN: 978-0-19-841360-8 Publication date: 02/02/2017 if condition commented for Marval tsk 4996603 Paperback: 224 page

Christodoulou preview, o.a. over falen van AfL en wat Wiliam ervan vindt: de politiek heeft het verdraaid naar AoL, assessment of learning, en monitoren van pupil progress, instead of using it to empower students. preview

Poet: I can’t answer questions on Texas standardized tests about my own poems

Does Emotive Computing Belong in the Classroom? By Ron Spreeuwenberg Jan 4, 2017 blog

Richard Phelps 13 januari 2017. Study Highlights Best Practices For Establishing and Updating K-12 History Standards via Pioneer Institute. tweet

Katy Murphy (October 9, 2016). Grim dropout stats force California colleges to rethink remedial education The Mercury News blog

Placement tests that misplace students. It’s a classic methodological point that anything intended to be remedial has to be evaluated for its working correctly. Think decision-theoretically. For the methodology. see my 1980 articles (in Dutch) in Tijdschrift voor Onderwijsresearch.

Are schools ready for the power and problems of big data? By Benjamin Herold . (The Future of Big Data and Analytics in K-12 Education) EdWeek webpage

Helen Ward (18th January 2017). Primary assessment: 5 reforms proposed by experts today tes article online

Report of the ASSESSMENT REVIEW GROUP (January 2017) Redressing the balance. download

David R. Krathwohl (2002). A revision of Bloom’s taxonomy: an overview. Theory into Practice, 41 #4, 212-218. pdf-->

20 januari 2017 \contact ben apenstaartje

Valid HTML 4.01!