annotatie Borsboom 2003: Psychological measurement

Why annotate this book? Because it is about a host of problems that have haunted me since the sixties. The problems are pressing, because they touch on every test and examination, and especially the uses typically made of test results.

For example, the question of what it is that makes a test—better: a test item—valid. The design of test items, especially so items in achievement tests, critically depends on what one views as their validity.

Another issue is that psychometrics does not seem to be appropriate to much that pertains to assessment in education. It would be very much better to have adequate theory based on decision theory. Borsboom, of course, restricts his analyses to psychometrics proper. Nevertheless, his analyses migth tell a story that is directly relevant to the question whether psychometric techniques are appropriate in educational assessment.

According to the title, the hunt is on—the hunt on measurement. What is happening in educational assessment, however, is not primarily 'measurement' at all, notwithstanding the fact that items in examinations etcetera should be valid in the sense Borsboom elucidates in his chapter 6. In educational assessment a complex kind of social contract between the parties and individuals involved is working out. Yet politicians and major institutions like ETS in the US and Cito in The Netherlands pretend that measurement is their business, and they are doing a perfect job.

In fact, that is a very old idea. Founding father Edward Thorndike already called his (1904) book so, and indeed it is about the use of statistics in educational assessments. Well, there is mention of what later would get to be called validity, but it is a mere two pages (160, 161), let's say that we can put the number of approximately two percent on the amount of attention he then gives to issues of validity. Thorndike, of course, strongly promoted quantification of about everything. Do not blame Thorndike, however, for he has done American education an enormous favor by analysing the lack of validity in arithmetics education, and arithmetics testing especially (his 1924). And proposing methods to design valid test items, as we would say today.

Ch. 1. Introduction

The reference to Lorrie Shepard (1993) is incomplete. It is a chapter in an edited book. See below, Literature.

Ch. 2. True scores

Ch. 3. Latent variables

Ch. 4. Scales

Ch. 5. Relations between the models

"The last part is based on Borsboom, van Heerden, & Mellenbergh [2003]; I must, however, say that I no longer subscribe to the conclusion of that paper."

Ch. 6. The problem of validity

6.1 Introduction

... the most central question one can ask about psychological measurement, which is the question of validity.

p. 127

And yet, the way Borsboom is going to define validity in these pages might make it somewhat less central to measurement, because a valid instrument—they only come in the color of 100% valid—might be inaccurate and might just as well validly measure also other characteristics than it is intended for. So, let us be careful here.

Simple question: "whether a test measures what it should measure."

p. 127

This question, according to Borsboom, "is legitimate and crucial." Is it also the validity question? Not in the literature Borsboom cites here, and that precisely is the point he wants to make.

The essential tension, then, is that a test places the testee in a somewhat unnatural situation, making it problematic whether, for example, a simple typing test truthfully represents the testee's current typing capability as manifested in office room achievement. Calling the latter achievement the criterion to be used to evaluate the appropriateness of the typing test, is legitimate, honest, and must in this case necessarily be so. Of course, this example is not scientific research, it is psychology applied. I do not think Borsboom has a serious argument with this line of reasoning, assuming the existence of this typing capability characteristic. Calling the correlation with such a criterion the 'criterion validity' of the test is what he objects to. The validity question to Borsboom is whether the typing test measures typing capability, assuming this characteristic to exist and causally affect the test's outcomes.

"Cureton took the essential question of validity to be 'how well a test does the job it is employed to do' (p. 621)..."

That is p. 621 in Cureton's 1950 Handbook of Educational Measurement.

The quote is from Kane, 2001, p. 319

What difference is there between the vision of Borsboom, and that of Cureton? Well, not very much in the ideal posed, at least so it seems, but in the way it has been developed concretely. Is it a measurement problem to develop a good test, or is it the problem to devise a selection instrument that is acceptable to the parties involved—working out a social contract? Borsboom does not go into this kind of questions, in this paragraph. Well, what I'd like to know about the analysis Borsboom makes of the problem of validity is whether it is restricted to what is happening in the psychological laboratory. If not, what about applied psychological testing, is it the same as the testing in laboratory situations, or is it somewhat more involved? Complex? If only because, other than in the laboratory situation, in applied settings the testees have a clear interest in the outcome of the testing.

My working hypothesis is the following. There is an important difference between tests and test items. The test tends to be in the applied domain: diagnosis, selection, placement, grading, etcetera. The individual test item is a miniature experiment, it should be valid in the sense that the experiment ascertaines ('measures') what it is meant to measure. Turning things around, meaning that items get designed to better serve the testing goals, runs the risk that ultimately nobody knows what an individual item measures or even should measure. The position of Borsboom on the last issue is perfectly clear: do not turn those things around (p. ... in the later paragraphs of the chapter, or the Borsboom, Mellenbergh and Van Heerden, 2004, article).

Borsboom could have made a better case by explicitly recognizing the difference between test items and tests, at least in this chapter (some of the measurement models might break down in the single test item case, but then the concept of unidimensionality might have been called to the rescue).

"If something does not exist, then one cannot measure it. If it exists, but does not causally produce variations in the outcomes of the measurement procedure, then one is either measuring nothing at all or something different altogether. In these two cases, a test does not possess validity. In all other cases, it is valid."

p. 128

Validity is not complex, faceted, or dependent on nomological networks. It is a very basic concept and was correctly formulated, for instance, by Kelley (1927, p. 14), when he stated that a test is valid if it measures what it purports to measure.

p. 128

Denying complexity in the face of ambiguity might not be a very good idea here. Good measurement typically is not something as straightforward as the former quote seems to suggest; more often than not the characteristic—take heat as an example—is 'measured' by making use of its known relation to other characteristics—the duration of a constant heating process in combination with temperature—allowing direct measurement. [I am referring here to the work of Black on heat phenomena, the beginning of thermodynamics (heat is movement), and on the 18th century] 'A test measuring what it purports to measure' is not a simple idea at all, it is a clear principle to get things moving. The moving will be an intellectual challenge, let's produce some heat.

In his work Borsboom loves to contrast psychological measurement and measurement in physics. I'd rather like to emphasize what both might have in common. If possible, I will use the physics of free fall to present examples. See my literature/freefal.htm page, also for the literature on free fall mentioned below (and above, excuse me).

There is a famous experiment by Galileo Galilei, the exact details of which have become available only recently in the work of Stillman Drake (1990). To study the regularities in falling bodies, he constructed a perfectly polished plane, levelled 7 degrees, to let a bronze ball roll (fall) down. He marked the places where the ball was after equal time periods (on the beat of a song). The remarkable result: the lenghts travelled by the ball after 1, 2 etc. periods were 1, 3, 5, 7, 9, 11 etcetera. The perceptive reader will notice, as Galilei did, that these distances are the differences between successive squares of the same distances: 3 = 2² - 1²; 5 = 3² - 2²; 7 = 4² - 3². Etcetera. Imagine the impact this discovery must have had on Galilei!

The link to the above quote from Borsboom now is the following. It is known that Galilei designed the experiment explicitly without considering any theory whatsoever (there were a lot of visions on free fall, at the time). Maybe there is one exception, if you would like to call it a theory: the idea of nature being regular. The point being, of course, that the Galilei experiment is an important example of measurement not being driven by or defined by theory.

There is even more to it: there was no clear theory at the time as to what caused the freed ball to roll down, let alone how exactly such a cause could work out. Neither is there today, the reader is referred to the works of Max Jammer on force and mass (two successive books on mass, to be precise, the latter one especially on late 20th century developments). The point here is: even supposing causes to be at work, it will not always be possible to fruitfully theorize on the 'mechanisms' of the causes (as Dijksterhuis would say), there evidently being no 'mechanisms' in the substantive sense of the term. Does this sound strange? Then watch your kids or grandchildren play with magnetic toys: what substantial causes do you see?

The above observations may be important to the theorizing of Borsboom also, for example in the difficult case of (differences in) intelligence, an example he uses in many places in his book. Important, not exactly in the sense of offering a model, but by emphasizing that it may pay off to be critical of everything that might force itself upon us as logical and therefore true, possibly without any good reasons at all.

Borsboom (p. 128) quotes (Kelley 1927, p. 27) stating "that a test is valid if it measures what it purports to measure." Is this opinion restricted to tests purporting to measure, or might it be extended to tests 'employed to do a job' and doing it well? Or is there some middleground, such as tests designed to do a job well, using test items measuring what they are intended to measure?

validity avant la lettre

"Measuring precisely a fact which you do not want is worse than measuring inexactly the fact that you do want."

"It is not undesirable to make inferences, but it is highly undesirable to confuse them with measurements or to leave them without critical scrutiny."

Thorndike, 1904, p. 160, 161.

Thorndike (1904) is still very close to the psychological laboratory, even while here primarily addressing school teachers! The dangers of measuring in the psychological laboratory, especially at the time, were to bring testees in highly artificial situations, resulting in the precise measurement of things one did not want to measure in the first place. Thorndike must have used his experiences with this particular kind of difficulty to appreciate that the same kind of thing might be happening in educational assessment. His very influential 1924 book on arithmetics surely is testimony to that.

Making inferences is, for Thorndike, something different entirely from measurement. Borsboom signals current conceptions of validity being exactly what Thorndike does not want to endorse: "as applicable to test score interpretations only." (p. 128).

"unreliability, item bias, and other supposedly undesirable characteristics of tests bear no direct relation to validity.

p. 129

The verdict on the quoted strong statement will have to wait for the analyses in the next paragraphs. But let us take the issue of item bias. By definition it might be the case that a biased item does not validly measure what it is purported to measure in at least one non-trivial subpopulation. Is this not a direct relation to the issue of validity?

Sure, reliability is wholly about the question of measuring whatever it is that the item or test is measuring, and therefore not a case of measurement at all, one might say. At least there should be a definite idea of what it is that the test or test item should measure. It took Thorndike (2004) 159 pages on all kinds of reliability issues to finally propose the absolute priority of the measurement question above that of reliability.

It is proposed that the question of validity must be taken to apply only to the question whether one is measuring the right attribute, not to the question how well one is measuring that attribute.

p. 129

Well, this quote resembles the ones from Thorndike, 1904. There is a problem here, though, and a serious one as well, I suspect. One may be convinced of measuring not very well the right attribute, and being proved wrong in that conviction. A famous case that might be applicable here is the measurement of temperature and heat, in the eighteenth century concepts there were not yet sharply differentiated. How about applying the validity question here in the sense proposed by Borsboom? (Slotta, Chi, & Joram, 200; James D. Slotta and Michelene T. H. Chi (2006); Russell McCormmach (2004); and Duane Roller (1950). See the projecten/physicseducation.htm page for these references)

I have already made a distinction between the individual test item and the multi-item test. The level of the individual test item seems to be what Borsboom really is talking about. My own interest in the individual test item is its design (in Dutch mainly), where the highest priority is to design items to be valid.

Another distinction not made by Borsboom, yet especially important in educational assessment, is that between institutional and individual interests or viewpoints. The measurement perspective is an 'institutional' one. The individual student preparing for the next test or exam, is quite another story. His or her viewpoint is extremely important. Students not having the slightest idea how best to prepare themselves for the next test is a disastrous situation. Valid measurement from the perspective of the individual student therefore is extremely important in education. In a roundabout way it also is for institutional interests. The kind of modeling here has been introduced by Bob van Naerssen in 1970 as the tentamen model. I have built further on the foundations laid by Van Naerssen, the reader can find an exposition of the model, as well as the collection of Java-applets used to 'run' the model, here.

In the kind of student model mentioned, the concept of mastery figures: defined on a collection of (constructable) test items from which individual tests will be randomly selected. Does this mastery, the score that would be obtained of the individual were to answer all items in the item set, 'really exist'? The question answers itself: No. Should 'mastery' really exist in order to build a theory using this concept? No. The concept is instrumental only. The process involved, from the perspective of the individual student, is a binomial process. Each of the items selected into the test might be one he masters, or not. As far as this student is concerned, the selection of items from the item set is strictly random. [If the set consists of subsets, consider only a particular subset]. Does the binomial process really exist? Not in the view of the constructor of the test. Surely in the view of the individual student. Anybody doubting this will lose money betting on it's non-existence.

My annotations on the following paragraphs of chapter 6 will reflect my special interests in the design of individual test items and in the binomial process confronting the student preparing for educational assessments.

6.2 Ontology versus epistemology

Literature

Denny Borsboom (2003). Conceptual issues in psychological measurement. Dissertation University of Amsterdam.

Denny Borsboom (2005). Measuring the mind. Conceptual issues in contemporary psychometrics. Cambridge University Press.

Denny Borsboom and Gideon J. Mellenbergh (2004). Why psychometrics is not pathological. Theory & Psychology, 14, 105-120. pdf Download

Denny Borsboom, Gideon J. Mellenbergh and Jaap van Heerden (2002). Functional thought experiments. Synthese, 130, 379-387. pdf

Denny Borsboom, Gideon J. Mellenbergh and Jaap van Heerden (2003). The theoretical status of latent variables. Psychological Review, 110, 203-219.pdf

Denny Borsboom, Gideon J. Mellenbergh and Jaap van Heerden (2004). The concept of validity. Psychological Review, 111, 1061-1071. pdf

M. T. Kane (1995). An Argument-Based Approach to Validity American Psychologist, 50, 741-749.

M. T. Kane (2001). Current concerns in validity theory. Journal of Educational Measurement, 38, 319-342. jstor

Truman Lee Kelley (1927) Interpretation of educational measurements World Book Company.

Lorrie A. Shepard (1993). Evaluating test validity. In L. Darling Hammond: Review of research in education (pp 405-450), vol. 19. pdf

Denny Borsboom (2003). Conceptual issues in psychological measurement.

dissertation University of Amsterdam

Annotated by Ben Wilbrink