Nederlands Tijdschrift voor de Psychologie en haar Grensgebieden, 26, 360-376. Didakometrisch en Psychometrisch Onderzoek, juni 1970. see here for a transcript.


De Groot (1970). Some badly needed non-statistical concepts in applied psychometrics


The article contains a number of controversial statements. De Groot was perfectly able to see how his statements would be controversial, but it was his debating style to deliberately use controversial statements.

De Groot's ideas on acceptability are important in themselves. They were never elaborated further by him, nor by others, with two exceptions. In a non-trivial way the tentamen model articulated by Van Naerssen (1970) may be regarded as an application of the idea of acceptability in a pragmatic—decision-theoretic—way. The dissertation by Job Cohen—now the major of Amsterdam—on 'study rights' shows another way to make acceptability a reality in university life. This book therefore should be available in the public domain.

Adriaan de Groot died in 2006. He made important contributions to cognitive psychology as well as to methodology in psychology, and almost single-handedly introduced multiple choice testing in education in the Netherlands.


Ben Wilbrink

see here for a transcript.

This article seems to be written in maybe an off-hand way. In reality, though, it contains a lot of ideas on critical issues in education and educational assessment, implicitly referring to earlier work, or containing the seeds of promising future work on the issues involved. The article therefore is an important key to the thinking of Adriaan D. de Groot, and that of many of his colleagues in the Netherlands, on issues of assessment, especially so in education.

As an annotater I will try to be objective in my observations. That objectivity will however be frustrated by the fact that I am not a neutral observer here. I have participated in discussion and debate on most of the issues mentioned in the article. Therefore I will disclose my relation to De Groot, and my position on the important issues.

I have not been a student of De Groot, having studied in Utrecht. His 'Methodology' (translated in 1969) must have been on my list. I began my career in a small research center of his University of Amsterdam. De Groot and Van Naerssen, among many others, were on the center's advisory board then, De Groot having strongly supported Kees Kolthoff's idea to establish such an educational research center. These were years of student protests and even revolutions, especially so at the University of Amsterdam. Fierce discussions in the nation's papers on selection, heritability of differences in intelligence, and lotteries to decide admissions to numerus clausus studies such as medicine, discussions triggered more often than not by positions taken publicly by De Groot. In 1972 I publicly supported his position on heritability. In 1975 De Groot publicly declared my position on admission by lottery as defeated by arguments from a mathematician and a statistician, quod non. Etcetera. In 1977 I publicly attacked De Groot's idea of multiple choice items as being 'objective': because selecting the alternatives and declaring one of them the 'best' is a subjective act, isn't it? Later I asked him why he always said multiple choice items—they should be four-choice as well!—were objective. The reason was pragmatic: the newly surrected Cito (a Dutch ETS, let's say) should not be bothered with a host of different item formats! Well, not quite the standard of science upheld in his Methodology! De Groot gave me permission to publicize his confession, so here it is. In 1984 he tried to come to the rescue of the educational center, now in danger of being abolished, regrettably in vain. In the nineties we regularly exchanged letters on issues of assessment, until he indicated his health did not allow him to continue writing. One such issue: what is the history of our habits of grading students' work? De Groot, of course, wrote a famous work (1966) on grading: 'Vijven en zessen.' He never even thought there might be some history involved here. Good question Ben, amazing! While not knowing each other very well personally, there was a fine understanding of each others work, as well as of the differences in opinion we had. The last time I saw him was in a 2006 documentary on the island of Schiermonnikoog. He figured there, among other inhabitants, as a friendly old man in his nineties, being cared for by his wife and the island's nurse.

Down to the nitty gritty. The annotations will more often than not be at the level of individual sentences. I will count sentences within paragraphs. Therefore, 3.2.7 refers to the seventh sentence in paragraph 3.2, note 1.2 to the second sentence in note 1. If you want to return regularly to the original text, you might open a second window to keep the original text at hand.

note 1.1.

Was there something in the air making De Groot sensitive to the issue of acceptability? It is undoubtedly the case that this idea of acceptability already made him write his 1966 'Vijven en zessen.' One may suspect two important influences here: the work of Hartog and Rhodes (1936) on the grading of essays, and Posthumus (1940) on the constant percentage of pupils flunking grades in the HBS, in the period roughly from 1875 until 1940 (the world changed multiple times over, yet this percentage stood as a kind of gold standard .... unbelievable). Even more important must have been his personal experiences as a math teacher. On top of that, in the sixties the times were changing, acceptability became an issue on the streets of Amsterdam itself.

During the 1966 'Holland Festival' in Amsterdam, the police did not know how to react to playful provocations otherwise than by using force. So yet another festival happened beneath the 'Holland Festival' banners on the Damrak and Rokin. The rule of the 'regenten' was not accepted/taken for granted any longer. (There does not seem to be a good translation of this term 'regent.' It is an aristocratic governor, being chosen in that position by the aristocracy not on the basis of merit but on that of birth. That is a caricature, of course, but the unhappy major of Amsterdam, Van Hall, did just the things to strengthen exactly this kind of image. University professors also could be vulnerable in this respect, very much because of their own behavior. Authority, of course, is authority only in as far as it is accepted as such. People, especially students, no longer unthinkingly accepted authority. On the streets of the world in 1967 America's president Johnson was called a murderer. In Paris 1968 students nearly toppled Charles de Gaulle from his presidential seat. The Dutch government gave students a voice in the government of the universities and their departments. Nice country. Kees Kolthoff researched the voice students and assistant-professors had in governing the departmental affairs, his supervisor was Adriaan de Groot. This research started before the occupation of the administrative center of the university—het Maagdenhuis—and was subsequently not fully published: the research itself was made obsolete by the events of spring 1969.

note 1.2.

These must have been memos from the 'Afdeling Examentechnieken,' stencilled reports distributed in small numbers. I do not have any of them. It might be the case that De Groot read some papers at the big national congresses on educational research in 1966 and 1968 (the one with a lot of critical students among the public), I will have to check on that.

note 1.4.

Empirical research on questions of acceptability was not undertaken, at least I am not aware of its existence. The Ph. D. research by Job Cohen uses empirical data, of course, being the verdicts of commissions of appeal in matters of assessment, not quite the kind of empirical research De Groot had in mind. 1.1.1.

'Psychometrics is a social science.' Psychometrics, to De Groot, is an inclusive concept: it should include the ways its techniques are applied, include questions of equity and ethics of those applications as well. It is everything the APA Standards (1999) are about—that's a lot. It is everything that psychologists deem important and label as (construct) validity.

This psychometrics—construct validity—might be a powerless concept because it might mean anything to anybody. That is exactly the point taken by successors of De Groot at the psychology department in Amsterdam. Borsboom, Mellenbergh and Van Heerden (2004) take construct validity to task, and promote exactly what De Groot in his par. 1.2 deplores: they choose a realistic, natural science, position. If one tries to measure something psychological, the assumption is for that something to really exist. If one cannot assume real existence, then why measure that trait etc. at all?

The clever thing to do, of course, it to distinguish measurement proper from the uses of tests and test results. This article does not consistently do so, the APA Standards do not do so. Borsboom et al. do not make the distinction either; e.g., educational 'measurement' being more like a 'social contract' between stakeholders in education does not fit nicely in their conception of validity, except on the level of individual items being valid or not to the purpose at hand.


Measurement theories: De Groot discusses them in his Methodology. Test theories: Lord and Novick (1968) is exactly what De Groot must have had in mind. In those years De Groot and Van Naerssen worked on their book on educational assessment, which appeared in 1969, and was in fact this test theory applied to the construction and use of multiple choice tests in education. Test theory in the latter book was so dominant, that issues of acceptability were not discussed seriously.


This statement is somewhat ambiguous. Of course De Groot himself endorses strong measurement and strong prediction. The point he wants to make probably is this: More important than strong measurement and strong prediction is a strong conception of what it is that is to be measured etcetera.

The problem throughout this paper is that the distinction made between different stakeholders remains somewhat ambiguous. Surely, testees are the focus of the article, but the article somehow does not quite succeed in regarding them as the primary decision makers—as Cronbach and Gleser would say. De Groot tries to protect them, he does not give them voice. There is a flavor of authoritarianism here, even though the intententions are commendable. I have to add immediately that even fourty years later the thinking of educational measurement experts remains fundamentally authoritarian. Assessment in the medieval universities was more equitable than it is today. Do not misunderstand me: the issue is not whether students should formally assess their own accomplishments—generally they should not—but exactly their empowerment to be able to choose optimal strategies. The 'empowerment' is just what is missing from De Groot's transparency definition (3.4.1), he does not see the testee as possibly the primary actor in education. In this respect Van Naerssen (1970) in his tentamen model took a somewhat stronger position, while not quite attaining the concept empowering the testees themselves.


One of the first serious applications of the decision-theoretic framework as outlined by Cronbach and Gleser (1957) was in the selection of drivers in the Dutch army. A summary of this work by Van Naerssen appeared as an appendix in the 1965 edition of Cronbach and Gleser's book. Personnel selection is more or less the only field where decision theory has been applied successfully, see the work of Schmidt about 1980 in particular, on the quantification—dollar value—of expected utility, and my own 1990 simulation of expected utilities of complex selection procedures.

How about using decision theory in educational measurement? The remarkable fact is that a lot of work in this field has been done by Dutch psychologists in the seventies (reviewed by Wim van der Linden in 1980, Applied Psychological Measurement), using statistical decision theory while neglecting the economic decision theory as used by Cronbach and Gleser. The exception is my own work, published in 1980; it was not understood by my friends using concepts from statistical decision theory. Economic and statistical decision theory, of course, are related to each other; the main difference is in the use of the normal or extensive method of evaluating expected utilities, a crucial difference because the mathematics of the statistical approach is dazzling.

De Groot therefore was in a position to make good use of the Cronbach and Gleser approach. Especially so because in their book an important issue is the difference between institutional and individual decision making. The testee as decision maker, was De Groot aware of this difference? The article leaves one with the impression that De Groot missed the opportunity to develop his concept of acceptability in this direction of the testee as individual decision maker.

note 2.

This is not clear. A source of misunderstanding is the distinction between utility and expected utility. Distinguishing between utility and expected utility would have made it possible for De Groot to make his acceptability and transparency more precise and quantify them. Instead, the article gets stuck in qualitative analyses of pass-fail scoring etcetera. However, to do so De Groot would have to develop the kind of model Bob van Naerssen was working on: the tentamen model. Nevertheless, it might have been better for De Groot to use the concept of utility, instead of inventing his own terminology.


Sloppy analysis here makes De Groot blind to the important issues in a dramatic way. He associates 'utility' and 'values', uses the dictum that is not useful to compare values of different persons, and declares the decision theoretic way closed.


Of course, profitable to the testee should have been De Groot's answer to this question. The other stakeholders would and should agree with this choice. If preparing for the test would not be profitable to the testee, why would he do so? It is in the interest of all stakeholders to give the testee a big stake in preparing for the test, at least if that test is an achievement test. The case of intelligence testing is a game in quite another league. Well, this is exactly the position taken by Van Naerssen in his (1970) tentamen model: the testee is the primary decision maker, the other stakeholders have to decide secondarily on examination rules, quality of tests, etecetera.


An important omission here is the probability function over the utility. The utility function over possible outcomes, however weighted, in itself does not contribute anything useful to the decision making of the indivual involved, the testee. The testee also needs information on the probability of each of the possible outcomes to become manifest. Weithing the utility function with the probability function results in the expected utility. This expected utility may be better or worse than the expected utility of an alternative, e.g., the alternative of investing more preparation time. For the technique see my (1980) html


"Obviously, however, this is impossible." Is that a fact? No. These things can be done. They have been done. The most elaborate model is my Strategic Preparation for Achievement testing model, the SPA model html. This model is a variant and extension of the Van Naerssen tentamenmodel.


Determining utility functions is a routine procedure, already described in student handbooks such as Schlaifer (1959). The other assumption, that value systems of individuals should be combined, is curious. Why should it be necessary to consider only combinations of individual utility functions? It is perfectly possible, e.g., to consider only the utility function of one particular individual, or of some rather typical individuals separately. There is no necessity first somehow to combine values that are not commensurable. This misconception resembles that of the statistical decision maker who seems not to be aware of the possiblity to consider utilities at the possible decision point only, instead of combining somehow (expected) utilities over de whole length of the score dimension (normal form versus extensive form analysis, see my 1980 html)


The discipline of law. At the time of writing, the idea that the discipline of law might offer solutions to particular problems haunting psychologists and psychometricians, was quite unusual. My own confrontations with lawful restrictions or the limitations of justice were in the early seventies (admissions and numerus clausus) en the later seventies (the law in a general sense limiting what teachers are free to do in assessing students, a theme later to be elaborated by Job Cohen in his dissertation). In America only in the seventies a lawsuit against Berkeley (Bakke) was a wakeup call to psychologists involved in selection.

note 3

I have not seen the memorandum 'Psychology and democracy' of 1966.


Equitability is not synonimous to acceptability. But then, in the sixties equity theory ( was not yet present, neither had John Rawls published his Theory of justice. De Groot was looking for the right terms. Job Cohen was to substitute some of his terms by ones that had a clear meaning in law or jurisprudence. E.g., in Dutch: 'kenbaarheid' (what is knowable) rather than 'doorzichtigheid' (transparency).

note 4

The literature mentioned might be brought up to date [e.g., Owen (1986) None of the above; Hanson (1993) Testing testing; Wiggins (1993) Assessing student performance; Lemann (1999) The big test] but the analysis of psychologists taking a defensive attitude against criticisms of testing, still is valid in 2007.


De Groot fails to address the question of costs of more acceptable procedures to the parties and individuals involved. More acceptable procedures might cost lots and lots of time of the teachers. Teachers needing weeks or even months every year for the scoring and grading of tests and papers therefore do not use that time to instruct students directly. This does not seem to be tradeoff that in itself is equitable. It is in the interests of the students first and foremost to get good instruction. However this may be, the point is that acceptability comes at a price, it might be a snake biting its own tail. It is not obvious at all that 'more' acceptable procedures in this wider sense always are acceptable.

Of course, massive costs of testing were highly unusual in the sixties and therefore of no concern to De Groot. Not any more, though, in the nineties or in this new century, in the Netherlands (e.g., 'leerlingvolgsystemen') or in the US (the No Child Left Behind Act and its massive use of high stakes testing).


Remark the use of the concept of a contract, meant to be taken literally as a contract between the sponsor of the test or examination, and the testee or examinee. The analogy here definitely is the contract between employer and employee. De Groot later repeatedly used the contract idea in analyses of selective admissions: the institution admitting the candidate offers her a contract containing the rights and duties of both parties. The duty of the institution is to offer instruction of good quality and guaranteeing a successful educational career; the duty of the student being to make appropriate use of the instructional opportunities offered. In recent times the idea has surfaced again in a rather one-sided form in Dutch higher education: students not earning a certain level of grades at the end of the first year of study, should leave the institution. This kind of coercive contract would not get the approval of De Groot.

3.2.1 objectivity

Objective procedures need not be free of subjective bias (see annotation note 6). De Groot is claiming way too much here. Historically, De Groot might have been motivated especially by abuses in oral examinations to advocate multiple choice tests: he never seems to contrast multiple choice testing and short open-ended testing. Indeed, the last forms generally are regarded to be approximally equally objective.

note 6 objectivity

De Groot had a sense of urgency promoting the use of what he used to call 'objective' achievement tests, explained to be tests consisting of four-choice items exclusively. The exaggeration was politically motivated, he confided me in writing (and allowed me to publish it), in order to avoid the newly surrected testing institute in Arnhem, the Cito, being drowned in technicalities and discussions of a multitude of different item formats that otherwise might be found attractive to use also. De Groot must have known that already the use of the term 'objective' is an exaggeration here: American usage of the term 'objective' included short open-ended items that could not be scored by machine. The closed form of multiple-choice was nicknamed in the US as 'frozen subjectivity.' The latter term eminently indicates the problem with the definition De Groot gives in this note 6, a definition he gave elsewhere also. Of course, the items themselves have to be designed by human effort, the wrong alternatives have to be chosen by humans, humans have to choose the best alternative in each multiple-choice item. There simply is no way in which it ever would be possible for a machine to 'objectively' execute all these design decisions also. If it did, it would have been cleverly programmed to do so, by whom?

In the Netherlands it was an article of faith, a dogma, for all humans other than Adriaan de Groot, that multiple-choice tests were objective and therefore of the highest quality possible. The Cito in the seventies still used four-choice items exclusively. I told them in 1977 html to loosen up somewhat, because the holy MC was not holy after all, an insight that would allow them to wholeheartedly explore the use of open ended achievement test items. Boy, this 'complete objectivity' really was a big issue in the Netherlands! A drag on progress, really.

3.2.2 objective, therefore equitable?

De Groot claims: "objectivity of procedures is a prerequisite to equitable decison making." The assumption, then, is that decisions have to be made, and therefore assessment procedures are needed.

In 1972 (Selectie) De Groot emphasizes that practicals ('H-onderdelen' he called them) need not be completed by having students take a summative test also. That is an interesting position taken: students need not always be tested, lab work may be assessed on the fly, subjectively. The idea of subjectivity being allowed in the lab situation, of course, is that students will not be failed for their practical unless they never did turn up, burned down the classroom, etc.

3.2.5-6: objectivity? it depends

The argument therefore seems to be: Whether objectivity is required depends on the contingencies.

De Groot is thinking hard on what acceptability might mean to students. He knows there are no empirical data on hand, yet he continuous thinking what from an institutional standpoint might be advisable to seek to implement—such as more objectivity—,instead of how exactly we might obtain that empirical information, what questions one then should ask of future testees, or how one might control for different aspects of acceptability in realistic situations.

3.3.5 the next draft, flunking: does it matter?

Minor point: the same judges already were responsible for guiding the student to produce an acceptable result the first time! This is sloppy assessment.

Prolonged training because of flunking a grade, course, or project surely will result in some learning effects. But will these be enough to compensate for the time lost? De Groot says so, but this is exactly the question in a lot of rather entangled research on the effects of 'flunking the grade'—generally being answered in the negative, surely so in systems where flunking the grade is massive, such being the case in the Netherlands.

Complete transparency

'Complete transparency' of a testing and decision situation obtains, by definition, if the subject has available to him all the information he needs for developing his personal, best possible test preparatory and test taking strategies.

3.4.1 complete transparency

The problem with the definition given by De Groot is that it is contingent on whatever test and test situation is offered the student: As long as test and test situation are made transparent to him, the test and test situation might even be abominable. Can one get transparency without acceptability? That evidently seems to be the case.

3.4.6 no rights?

According to De Groot applicants, also applicants for admission to private institutions of education, might have 'no official rights' at all. One wonders how much De Groot really discussed points like this with his acquaintances in the Netherlands or in the US (e.g., Brown v. Board of Education 347 US 483 (1954)!). Of course applicants have rights, or might claim rights, or even take rights. However this may be, in the following he voices his expectation that applicants in the near future indeed will demand transparency. Did they? Almost fifty years later one cannot but answer the question in the negative.

3.5 implications of transparency

This will not do. De Groot sums up the implications of complete transparency, instead presenting sloppy formulations of particular states of affairs.

(1) "Non-objective procedures are non-acceptable." How come? Objectivity is a matter of degree, isn't it? What is non-objective, then? Is is 100%, or can it be 51,3% also? Being non-acceptable, is that an absolute thing also? Or could it be, e.g., 1%? A litlle bit? Once in a hundred times?

(2) "Questionnaires are non-acceptable." As far as the scoring keys are secret. But why is a secret scoring key in itself a threat to transparency? If my life depends on my true answers to a questionnaire, e.g., a medical one, wouldn't it be dangerous for me to know the scoring keys?

(3) Here De Groot presents the clear case of achievement testing, the kind of testing one should prepare oneself adequately for. In an important sense, Van Naerssen's tentamen model offers the possibility to quantify what transparency might mean in this kind of situation: the situation is to be characterized in a number of variables that might take different values according to whatever the teacher's or institution's policies in achievement testing and examining are.

(4) Obviously what De Groot is writing about here is the test instruction, e.g., what to do if one does not know the right answer or the best answer to a multiple choice item, not the test scoring key.

(5) The point simply is this. Take for example a test consisting of 20 four-choice items and 1 essay type question. The essay type question might be scored on a scale from 0 to 20, while each MC-item scores 0,3 or 1. Such information should be known to the student, he should understand its meaning thoroughly.

De Groot here is thoroughly confused on the relation between different weights of items and their difficulty. The phrase 'of equal average difficulty' is testimony to that. Difficulty of item A is a group characteristic. What might a difficult item for 'the group' of students, will be a known item to a number of students from the group. Transparency always is transparency to this or that student, or to most of the students, preferably to all of them.

(6) Here De Groot is not quite transparent in his explanation of what it is about cutting scores that should be transparent to students. As it stands, these three sentences had better be removed from the article.

Yet this failure is remarkable, for De Groot had published somewhat earlier his own method for setting cutting scores; the kernitemmethode, the method of core items. This method is almost impossible to explain to students, I think. The idea is that some achievement tets items represent the core of the subject matter better than others. Identify a number of those 'core items', and use their difficulties to determine what the cutting score should be. This is highly unsatisfactory. It does not explain why it is that some achievement test items should be acceptable as test items, while not representing the core of the subject matter. Identifying which items belong to the core, which items do not, evidently is 'non-objective' and therefore 'non-acceptable' (point (1) above). It does not explain how to transpate the mean difficulty of these so-called core-items into an acceptable cutting score. And so on, and so further.

What does not seem acceptable to De Groot is grading on the curve, Downie's method, because in those cases the resulting cutting score will depend on whatever other students also will take this same test. Yet, in the case of selection, a numerus clausus, e.g. in admissions, it would suffice to tell the applicants what the quotum is ..... . That information does not tell them how well they should prepare to win a seat! Yet De Groot does not call this kind of situation one that inherently is non-transparent.

Students must have a clear idea of the level of achievement required for a pass. This sounds nice, of course they should have a clear idea. But what does it mean? Just ask the student what she thinks her current capability is, and how it relates to the level of mastery—a number correct score—required for a pass. One critical ingredient here is the chance aspect in sitting the test: given the student's mastery, the expectation of the test score is a binomial density, isn't it? What do students know of chances on tests?

I am not finished yet. There is this little problem of quota selection. Terribly complex from the point of view of the candidate: here her chances will depend on whoever else is trying to get in, on how multiple testresults will be combined or weighed to determine a rankorder. The selection ratio is only one of the critical aspects here. What is worse, merit ranking in a quota-situation can not be done in a consistent and therefore fair and transparent way. The usual approach to tackle this problem is to try to make the assessment more 'objective,' a kind of fairness by fiat, which is make believe fairness or transparency.

The question then becomes this: If assessments are so badly transparent, what does it mean to ask for transparency? There are no easy answers to this question. De Groot gives one the impression there are. Maybe a more sensible approach would have been to recognize the impossibilities here, and work out acceptable procedures on that premiss. I am not aware of later work by De Groot taking up one or more of the issues pointed out here.

(7) 'Certainty scoring' (zekerheidsscoring) has been shown to be affected by characteristics of personality, or at least people like me think that has been shown to be the case. That surely poses an acceptability problem for may kinds of complex scoring, or even for choice items in general because the tendency to guess or not surely is not equally distributed across all types of personality. Previous coaching—or lots of experience—might attenuate the connection. Today, certainty scoring does not seem to be used in the Netherlands. It is in the UK, though.

The point here seems to be that whatever scoring method is used, even in the most transparent of ways, there always is a darker side to it. How dark must it get to become non-acceptable? Again, it depends, in part it depends on the personality of the testee.

Decision procedures also may be more transparent, less transparent, and indepently of their transparency they may be more acceptable or less acceptable, depending on the personality and othe rcharateristics of indivual testees and on a host of particulars of the testing and decision situation. At least one needs a full blown tentamen model here, i.e. a model that allows utilities to be specified in a coherent way (not any wild guesses, as was the case in the dissertation of Van der Gaag).

3.6 justifiability of differential decisions

One really needs a good theory of justice here. The one by Rawls would do, surely he had enough material published by 1970 that would have been useful to De Groot. Instead, De Groot goes for a catch-as-catch-can with psychometrics. Might be interesting too.

De Groot is unnecessarily restrictive in speaking on differential decisions only. The broader concept here is grading, Vijven en zessen as De Groot wrote himself (his 1966 book under that title, impossibly to translate but meaning 'shilly-shallying' about grading. In education, almost all grading essentially is ranking. Evolutionary, old-fashioned ranking changed gradually in the kind of grading in the 20th century (for details see my (1997 pdf) Assessment in historical perspective). Grading, therefore, is a form of merit ranking (for some literature on the subject and a few fascinating issues in relation to grading, see my page ../projecten/meritranking.htm. Contemporary literature on desert treats the justice aspects of merit ranking (e.g., Serena Olsaretti (Ed.) (2003). Desert and justice. Oxford University Press).

3.7 a minimal difference might be decisive

Sure, 'psychometric arguments are never sufficient for a justification of differential decisions near the border line.' What arguments will be sufficient is a question not even mentioned by De Groot. We are still in the domain of theory of justice here, and De Groot does not warn the reader that such is the state of affairs.

Instead, he is starting an argument about good and bad luck, and even so fails to distinguish plain error in the testing from the sampling characteristics of testing, which are two totally different things. Achievement tests are samples of items drawn from an appropriate domain of test items; repeatedly testing the same testee using new randomly sampled tests will result in a test score distribution known as the binomial distribution. Scores 'deviating' from the mean are not in error. Some of them might be lower than the passing score set on the test, the procedure of using passing scores introduces a chance element in the decisions to be taken on the student's progress. True errors result from, for example, mistakes in the design of particular items, the hour of the day, noises on the street, having had a sleepless night, a wrongly keyed item, a prejudiced jury member.

Standing as a rock is De Groot's observation that testees themselves will have to bear the risk of whatever variability of test scores they may encounter on their journey through education. Or in situations of psychological testing, for that matter. In his analysis one misses two things: the first being that testees, human beings and even statisticians typically are not very good at estimating these risks, to say the least; and secondly that some institutional arrangements might be a lot less vulnerable to these kind of risks than others. I am hinting here at the objectionable custom in many European countries to set passing scores on separate tests, grade results, exams, or whatever — except separate items in tests etcetera. Terrible amounts of valuable student time get wasted by demanding students to repeat tests, grades, or whatever it is that they did not 'pass.' The problem, called conjunctive (passing scores set) versus compensatory (grade point average) regulation of exams etcetera, was recognized as such by De Groot and Van Naerssen, yet only Bob van Naerssen tied to deal with it explicitly in a number of his publications. Of course, in practice the conjunctive method gets compromised by allowing a little compensation here and there (as well as within separate tests, but somehow teachers seem unable to recognize this simple fact).

Quite another approach is explicitly showing students what the odds are. I have tried to do so in the dental curriculum as well as in the first year of the study of law at the University of Amsterdam (see Voorthuis and Wilbrink (1987 html; the publication is in Dutch, but the graphics will tell the story). Analysis of the data from the law project did show students to be able to make a good guess at what the result would be on the test they were about to sit, but for the very first test that is, the one where they still were not sure about their own standing in the group.

3.8 what students might deem acceptable, or not

This paragraph is rather weak. It would have helped for De Groot to present empirical data on what students in fact do think about these things, but such data never have been gathered. It does not seem to be useful to annotate the wild guesses De Groot did pen down here.

3.9 face reliability

Better forget this paragraph too. It did not even occur to De Groot that, more often than not, the cutting score on a test will be in a place where there are a lot of students because there is no such thing here as a natural dichotomy between those who master the course content, and those who do not. Mastery is a graded characteristic, not a categorical divide as there is between being a man or a woman, psychotic or not, etcetera. One reason more why passing scores are out of place in educational assessment.

3.10 near-equality

Meant as an illustration of the kind of reasoning 'acceptability analysis' is likely to produce, it is however rather confusing. De Groot here seems to use a kind of argument that led him earlier on in the sixties to devise his cutting score method using 'kernel items', making a rather artificial divide between 'important' versus 'marginal' items used in a particular test. Just another example of trying to force a natural categories on a graded thing?

note 8

A note on the fact that younger pupils do not yet choose strategies themselves, for them acceptability is less of an issue than for older pupils. He warns explicitly against discriminating the educationally handicapped by subjecting them to the same tests (and norms) the 'normal' students have to sit.

note 9

A warning against overestimating the usefulness of reliability coefficients. De Groot forgets to mention here the fact that the reliability of tests is a function of the heterogeneity in the group of students. Rememebr: all grading is ranking.

4 Procedures in acceptability analysis



note 10



What to do in the case of a personnel selection test battery containing a personality test? Offer the candidates the option to draw a lottery ticket instead of sitting the personality test? I come to think of this case because of my analysis of the selective admission to the Dutch Police Academy (see the report here). To begin with, the chance offered the candidate should equal the selection ratio on the use of the personality test; on reflection, one might want to correct this chance in order not to invite the stray psychopaths to take the lottery option instead of the test. All of this assuming the selection ratio on the use of the personality tests is known, for example a 95% ratio in order to filter out extreme personality types only.


titles mentioned in the annotations

I will not provide a list here. The information given should enable the reader to locate books and articles, in other cases links have been provided. If in trouble, mail me.

April 25, 2010 \ contact ben apenstaartje

Valid HTML 4.01!