warning. The JAVA-applets have been compiled under a JAVA version 6 that since has been declared obsolete because of security problems. JAVA does not support applets any longer. I have not yet (as of 2023, but now working on it) been able to redesign the software
For simple analyses it is of course possible to use WOLFRAM's generator for the binomial distribution (score distribution given mastery), the beta density (likelihood foor mastery), and the betabinomial distribution (predictive score distribution); I will present the necessary links and documentation.
Any questions: contact me. It is my experience that this innovative work does attract attention zero nada niente; if you are interested, then you are the exception, so do not hesitate to contact me.
I have begun to immerse myself again in JAVA development, using the BlueJ platform. It should be possible te develop an application with a simple binomial simulation and analysis program. I'll keep you informed. Once that hurdle has been taken, it must be possible to replace the applets with runnable applications in a relatively short time. Fingers crossed. After all, the main function of the applets is input and output. The machinery of the models itself need not be updated at all.
In the mean time, it should still be possible to run the applets on machines using JAVA6. Or to run the applets from within BlueJ 3.1.7 (de laatste versie is die applets ondersteunt) on a machine still using the old JAVA6.
This chapter presents the first module of the SPA model, ‘SPA’ standing for Strategic Preparation for Achievement tests—see the introductory chapter.
The SPA model consists of a series—partially a cumulative series—of modules dedicated to particular functions in the SPA model, such as generating binomial distributions given the mastery level, generating the likelihood of mastery given a score on a preliminary test, given the likelihood of mastery generating the predictive score distribution for the test one has to sit, specifying objective—first generation—utility functions on test scores, specifying learning curves, evaluating expected utility along the learning path, given the expected utility function evaluating the optimal investment of study time in preparation for the next achievement test, and using the last results specifying the second generation utility function on test scores.
advance organizer
Technically this first module is a rather simple one, compared to the modules following it. Conceptually, however, there are some fine points that should be understood in order to fully appreciate how the SPA model applies to most kinds of realistic assessment situations.
An assessment situation here is understood to be primarily the situation confronting the individual student. The SPA model is about the strategic options available to the individual student in preparation for this achievement test or that examination. Therefore the items in the test may be regarded to have been drawn randomly from a large collection of appropriate test items, meaning only that the individual student does not know in advance exactly which items the test will consist of.
The model will assume mastery to be defined on the collection of test items that every particular test is sampled from, disregarding whatever differences between particular items might exist. In a way, that is what models should do, isn't it: abstract away from the petty details to shed a clear light on the main issues. Yet it will be possible to flexibly use the model in situations that patently will not fit into this straightjacket. Test questions come in different kinds: multiple-choice as well as essay questions, easy as well as difficult ones, or on two different subjects. The point is: apply the model to each kind or subdomain separately, then combine the results. The intruments of the SPA model will allow you to do so, provided the number of subdomains is two. Finer subdivisions probably will not result in better model studies; if more subdivisions should be needed, apply the the model repeatedly to each of them separately.
Is the SPA model restricted to the analysis of summative tests? It is possible to analyze the effects of splitting up a summative test in two or more consecutive tests on strategic behavior of students, in a way splitting it up in two or more formative tests while their combination will count as the summative one. Therefore the SPA model will apply also to situations of formative testing, instructional feedback, etcetera.
Is the SPA model restricted to ‘objective’ tests? No, as long as particular scores can be assigned to particular test items. Portfolio assessment is an example. The problem here might be that the number of observations is small compared to the number of items in an objective type achievement test. But then the unique power of the SPA model is to analyze the strategic situation regarding the combination (over time) of a number of separate tests, portfolio's, or what have you.
Objective tests might come in the format of multiple-choice questions asking the student to explain the answer given. The SPA model need not know about that complication, as long as every question is reported as being answered adequately, or not. A good answer accompanied by an inadequate justification might be scored as zero, and vice versa.
Most assessment situations will be amenable to analysis using the SPA model. Yet technically the model handles only the number correct on a test of given length. A start has been made to model the three-valued case: good, wrong, not attempted (see applet 1m).
The basics are rather simple, it is through combinations that the model is able to handle many alternative kinds of assessment situations also.
The first module is the single most important building block of the SPA model: supposing your mastery or that of a particular student to have a known value, it evaluates what the distribution of scores on a particular test could be if the test were to be repeatedly sampled from the same domain of test questions, and one would sit every such test. The fiction here is that one knows what the level of mastery is, mastery being defined on the domain of test questions. In the sequel it will become clear that this fiction nevertheless is functional in evaluating real world situations, simply by making an inventory of what would happen if this assumed mastery takes on a series of specific values. It works more or less as a catalyst. The building block, then, is the routine to generate a score distribution on a given particular value of mastery. This generation may take the form of either an analytical evaluation or a simulation (or one may use both, for illustrative purposes, or as an independent check of the results of one method against those of the other).
The general model of achievement testing is not only implemented as a mathematical model, but—largely independently—as a computer simulation as well. The generator that powers the simulation is the simulation of test scores for a fixed/given level of mastery. The mathematical model and computer simulation are instruments that together have been made available in the form of Java applets, that may be used in your browser. At the bottom of this page you will find the module 1a advanced applet. The module 1 straight applet is here: applet 1, its 3-valued extension is applet 1m; its advanced version is here: applet 1a advanced. All applets of the SPA model have been collected in the applets page [note that this page wil open slowly because of the many applets in it] as well as in separate pages for every applet.
Figure 1 illustrates the main points of the search for what the test score might be, given the—secretly known—value of mastery on the questions belonging to the course content domain.
The assumption that mastery has a known value will prove to be extremely productive, because it is possible to systematically change the assumed value, and see what happens. Let us, for the time being, patiently assume we live in a somewhat platonic world of make believe.
A recent development is the avalilability of WolframAlpha and its possibilities to evaluate statistical distributions, such as the binomial distribution, number of items 20, mastery 0.6:
The instrument—available as applet 1—can be used for all kinds of purposes. Other SPA modules will use it as a building block. More directly, it allows the estimation of the probability that a score will at least be equal to a given reference or cutoff score. In figure 1 the probability of scores falling above the drawn vertical is rendered. If it is a passing score, the probability is that of a pass for a student having mastery 0.6.
On some tests it is possible, if an item is not known either correctly or incorrectly, to guess an answer from a number of alternatives.
May 13, 2009. Roughly there are three possibilities: knowing the answer correctly, knowing it incorrectly, or not knowing it. Knowing the answer incorrectly is not a rare case, to the contrary, it happens rather frequently. I have not yet accommodated this possibility in the model. Therefore, if you would like to use the guessing option, absorb the probability of answering incorrectly in the probability to guess incorrectly. For example, if the probability of knowing an item from the domain incorrectly is 0.2, then the probability of answering or guessing a threechoice item incorrectly is: .2 + .67 times (1 - mastery - 0.2), therefore the probability either to know it correctly or to guess it correctly is: mastery + .33 times (1 - mastery - 0.2)
May 13, 2009. Without the guessing parameter, the model effectively assumes wrong answers as well as answers not known to be scored zero. That seems to be a reasonable assumption in many educational situations. Wrongly knowing an answer might be worse than knowing that one does not know the answer. On the other hand, learning without making mistakes is not really possible. In the balance, what can one say about evaluating wrong answers versus knowing not to know answers? There will be exceptions, of course. For physicians, for example, it is probably a serious matter to not knowingly make mistakes. In testing situations it seems appropriate to allow candidates to make educated guesses whenever they are not quite sure what the right answer is. Randomly guessing, however, definitely is detrimental to whatever reasobale purposes testing does have. There is a vague boundary between what still might be an educated guess, and what not. Therefore, allow candidates a bonus score on items not attempted. Do not tempt them to guess wildly. A good idea might be to allow students to justify their educated guesses, as well as the other answers they provide. The three-valued model will accommodate the bonus option.
Specify a reasonable guessing probability, and the plot of test scores will shift somewhat (or a lot, that depends) to the right. Guessing is not (yet) implemented in all modules; after all, it should be of marginal significance only. Proper procedure in multiple choice achievement testing forbids instructing students to always guess on items not known. Scoring of items left unanswered should be a certain bonus, or alternatively the scoring of items not answered rightly should be, for example, minus a reasonable fraction of a point. However, allowing students a bonus on items not attepted, changes the model to a three-valued one, implementen in apllet 1m.
The three-valued model—right answer, wrong answer, or not attempted—has been implemented, it is however not yet explained and discussed in this chapter. I am working on it.
The special thing about this way of building an assessment model is that it models the question answering process (the modelling of strategy is the subject of 'higher' modules in the SPA model). This approach is radically different from that found in almost all books on educational measurement, but then the educational measurement literature is not concerned at all about student strategies, not in empowering students to attain the results their talents allow them.
Education is about acquiring new knowledge and insights. Insisting on high levels of mastery would result in a terrible waste of time and motivation. In educational assessment, therefore, levels of mastery typically center on rather low values of 0.5 to 0.8. Think about what this means: the student having mastery 0.6 does have exactly that chance to correctly answer the next question that is presented to her. And the second question, as well as the third, etcetera.
Low levels of mastery are here to stay. Educational assessments then of necessity are games of chance. Therefore, the actors involved will have difficulties to understand or predict the outcomes of tests in this respect. One of the oldest tracts on probabilities in games of chance was written by Christiaan Huygens (1660) in the middle of the seventeenth century, rather late in the cultural history of Europe. Today most people and even statisticians themselves have difficulties in evaluating or predicting outcomes in complicated chance events, such as the total of correctly answered questions in educational assessments (The Monty Hall problem is an example, see Zomorodian, 1998). In this respect statistics is not special; physics, for example, is a discipline where people's mental models tend do differ from the laws that empirically have been established since the late middle ages (Hestenes, 1992).
The generator applet is an instrument that will assist in understanding how test scores are generated from chance events on the item level. It can be used, for example, to contrast intuitive predictions—resulting from one's own mental model—with the real results for a great number of tests—games of chance—combined. In the SPA model the chance process that generates scores, given the level of mastery, is the fundamental process that will allow rather precise predictions and therefore precise strategic choices as well.
Implementing a testing model in the form of a computer simulation means that somewhere, somehow, item scores must be generated using a random number generator. The item scores will be summed to give test scores. Random numbers on the interval from zero to one will be used to determine whether a particular item will be answered correctly or not. It is customary to indicate a proportional value as the demarcation point on this interval, call it the level of mastery. Generated random numbers equal to or below the assumed mastery level will counted as correctly answered or known items. The latter term is somewhat imprecise, but will nevertheless be used freely, always understanding it behaviorally defined as correctly answered. This much has been implemented as 'the generator', it is the instrument applet 1, the interface of which is shown in figure 1. above.
An achievement test takes a sample of the knowledge of the testee. Being only a sample, the test score depends on the particular sample used. Everybody in the educational world knows this. Few, however, thoroughly understand its implications.
Let us make this notion precise. Suppose that you know your mastery to be exactly 0.8. A fair sample of test items will give you for every item a chance of 0.8 to answer the item correctly, or to know the right answer. Using a lottery machine or a random number generator on your computer, it is quite simple to simulate a test score—thousands of test scores if you want—always supposing your mastery to be exactly 0.8. Most of the time the resulting test score will not be exactly 80 out of hundred. The program doing this for you—it will run in your browser—is presented in this page. Experiment a little, choose other values for number of runs, number of items, or mastery. To plot the kind of schemes in figure 1.2 and 1.3, use the older applet on that same webpage, choose options 100 or 101.
This small program is called the generator because it is the building block that enables the simulation of simple as well as more complex assessment situations in ways to be treated in the following modules of the SPA model.
To be really useful the model has to deal with the fact that nobody ever knows her or his 'true' mastery level. While one will never know whether true mastery is 0.8 or 0.6, or some other value, it is possible to decide whether 0.8 or 0.6 is the more likely value, given some observations. This is very much the same method as using a balance to decide whether one diamond weighs more than another one, without needing to know the exact weight of either one. How to determine this likelihood will be handled in the next module, called the mastery envelope. The generator will enable the model to deliver the trick, in combination with some real world information on your current level of mastery as revealed in preliminary testing or in the answering of textbook questions.
Plain and simple as the generator's binomial model may seem, there are nevertheles a number of issues at stake here, among them a difference in approach between this SPA model and the received view.
The received view in educational measurement is that measurement is about estimating the 'true mastery' of students, given their test scores. In that view the teacher is the decision-maker. The teacher supposedly wants to make an educated guess whether or not every individual student 'truly' deserves grade A or B, a pass, or whatever. That approach does not recognize the simple fact that in education it is the observed score itself—possibly translated in one or another kind of grade—that the game is about. 'True scores' do not count. How could they: they are not known, and estimated true scores simpy are not a valid currency in this field. The received view also has little or nothing to say about student strategies in the face of tests or examinations, except as remarks made outside the body of the theory itself, such as admonitions to students, or restrictions in the application of theory.
Time and again I will spell out the contrasting characteristics of the SPA model and the received view in educational measurement. This is absolutely necessary in order to keep an appropriate perspective on what the SPA model is trying to accomplish and how this differs from what educational measurement theory offers actors in education.
The received view in psychometrics has been sharply criticised by Joel Michell (1999), and in educational measurement by Ellen Condliffe Lageman (2000). I have no issue with that. What I am worried about is whether the Michell critique applies equally well to the SPA model approach, in particular the mastery construct and the application of the binomial model. I will have to show that the SPA model does not claim anything about measuring attributes, not even mastery. The last formulation may well be too defensive. I will have to look into the work of Denny Borsboom (2005), among others.
What is it that is 'received' in the psychometric view? The received character of psychometrics, from the perspective of an individual model, resides in its institutional character, its 'measurement' of differences between individuals. This is less a problem of the technical machinery of psychometrics, than it is a cultural one of what it is that test scores typically are used for. In the United States the educational climate is one of fierce competition, at least among the (parents of the) children that have any chance at all in the educational system. Standardized testing is the way the competition is run—or does it run the competition? On the European continent it should be less difficult to disentangle psychometrics from a competitive educational culture, and indeed there are recent high profile publications that give attention to intra-individual processes and therefore invite the formulation of models at the individual level (Borsboom, Mellenbergh and Van Heerden, 2003, 2004).
Mastery seems to be a platonic concept, then why try to define it? Because the concept of mastery plays such a significant role in the model—and in education for that matter—an operational definition is necessary, even if the 'operation' might be somewhat platonic itself. There is nothing wrong with an abstract operational definition. After all, important physical concepts such as 'mass' and 'time' do not even have satisfactory definitions.
This definition obviously can not be literally true, but it will do for the development of the model. Some qualifiers are self-evident. Such as mastery being understood to be at a specific moment, because it will not stay the same from one moment to the next. Or that there may or may not physically be a sizeable collection of items to be sampled for every test that has to be constructed; if there isn't, it can be imaginary (see Borsboom, Mellenbergh and Van Heerden (2002 pdf) on thought experiments). And that the specific criteria mentioned need not be educational objectives but might be established in educational practice; the usual meaning will be that items will be sampled from a specified item domain or knowledge domain.
Adapted from Alan H. Schoenfeld (2007). What Is Mathematical Proficiency and How Can It Be Assessed? In Alan H. Schoenfeld (Ed.) (2007). Assessing mathematical proficiency (59-73). Cambridge University Press. p. 69-70. pdf
What about items the student answers incorrectly, believing the answer to be the correct one? Lots of achievement test items get answered incorrectly, not because students guess wrongly, but because they wrongly think to ‘know’ the right answer. The box supplies a somewhat special example. For an analysis of how students get their answers wrong in tests of arithmetics, see Hickendorff, Heiser and Van Putten (2009, Table 1, p. 335 pdf)
Misunderstanding as a variable is needed where guessing is a possibility. Think of it: a test of fourchoice items, either the testee thinks to know the answer, or she might guess the answer, depending on the scoring formula used. Hopefully guessing is discouraged by granting a bonus for every item left unanswered. Now the point to remark is that thinking to know the answer may result in either a correct answer, or an incorrect one. Regrettably, in the technical literature it typically is assumed that testees either know the correct answer, or have to guess it (for example, see how Lord and Novick, 1968, chapter 14, and section 15.11, handle the issue; the closest they get is on p.352 "... examinees know the answer to an item and actually answer it correctly ...', but they do not follow up what would be needed to model the case of knowing the answer wrongly). The danger of this oversight lies in the use of formula scores that correct for chance success. Not recognizing that testees may get items wrong because of misunderstanding the content, they will get punished for alledgedly having guessed their wrong answers. If the wrong answers get the guessing treatment; they might be scored -1/3, assuming the guessing is on four options, effectively wiping out one right answer for every three wrong answers. In the SPA model, using the guessing parameter will imply that a value has to be subsituted for the misunderstanding parameter.
May 13, 2009. The parameter ‘misunderstanding’ has not yet been implemented, except in the ‘old’ applet 1 In the meantime absorb the probability of misunderstanding in the guessing probability, as indicated near the beginning of this chapter in another May 13 note.
2 december 2017 PM An important possibility is for students to give wrong answers for valid reasons.
The concept of mastery in education is somewhat analogous to that of mass in physics (Jammer, 1991); there are many kinds of definition possible, and many proposals have been made.
Eventually I will have to say something about the concept of true score in psychometrics (Lord and Novick, 1968, p. 1), about that of mastery in educational theory (Benjamin Bloom), and about what it is to 'know' something, in psychology (learning theory, expert knowledge) as well as in philosophy (I will probably mention only what Carl Hempel has on the concept that is important in the development of the SPA model)
The concept of mastery may be intuitively understandable in the case of multiple choice tests, but even in this case there is the problem of how the alternatives should be keyed (my 1977). Consider the case of the assessment of essay questions, a research question that is taken up again and again, one being the monumental study of Hartog and Rhodes published in 1936 (see illustrative data from this research below). The problem is, so many circumstances influence the score being given to a particular answer, that somehow even the notion of students have a particular 'mastery' becomes rather foggy. There does not seem to be an easy way out by defining mastery on the domain of questions (and possible answers) being independently assessed by a host of knowledgeable assessors. A concrete example in the Dutch literature is the research of Don Mellenbergh on the assessment of essay exams, for example his (1971). A real test of 8 essay questions on physiology got the 'Chinese' treatment: answers were uniformly typed, drawings were redrawn by one person, four assistent professors independently assessed all anonymous questions in random sequence. On procedures in Imperial China's civil service exams: see f.e. my 1997. The consistency of the physiology exam assessments was analyzed in a number of sophisticated ways.
It is probably a very good idea to apply Judi Randi, Elena L. Grigorenko, and Robert J. Sternberg (2005). Revisiting Definitions of Reading Comprehension: Just What Is Reading Comprehension Anyway? In Susan E. Israel, Cathy Collins Block, Kathryn L. Bauserman, Kathryn Kinnucan-Welsch: Metacognition in literacy learning : theory, assessment, instruction, and professional development. Erlbaum.
The point of mentioning this kind of research is that the processes getting one from a given mastery all the way to the assessed answers on an achievement test will be rather complicated, and the results therefore somewhat fuzzy. The process is a randomizing process, while the assessment of answers on top of that is distorted by a host of different kinds of factors. How to model this randomizing process? It is not the kind of randomization in the case of guessing on questions not known. Answers produced, given a particular mastery, are not clearly recognized for what they are worth. It must be possible to use a simple model for the distortion, using quantifications from the research literature the study of Mellenbergh belongs to. Distortions at the item level may be assumed to be independent from item to item; because their expected value is zero, they do not affect the binomial model given mastery. The real problem therefore are distortions applying to a series of items in the test, or the test as a whole. To have a lenient instead of a stringent assessor will make a large difference. The way to introduce these distortions into the model is to put a distribution on the mastery given, because the assumption of a point value no longer is a natural one to make. Nothing much turns on the choice of a particular distribution; the binomial distribution traditionally has been used to model random fluctuation, as has the normal distribution. In the context of a binomial model on given mastery, the preferred function to use is the beta density on parameters a and b. The mean of the density a / (a + b) equals the mastery assumed given, its variance then depends on the sum a + b, let us call it the 'consistency.' The higher the number, the better the consistency. Think of the number (a + b - 2) as the number of items in a (preliminary) test: chapter two will make it perfectly clear what is meant by this.
The binomial model now has become a a marginal binomial model: the score distribution is the marginal projection of a bivariate distribution. It is known to be the betabinomial distribution, on parameters n, a and b, n being the number of items in the test. As far as the mathematics and the simulation go, it has in a way already been implemented in the prediction model treated in module three. The full prediction model, then, allowing a distortion on the mastery given, could use a bivariate beta density as likelihood. However, we are here in the land of insecurity and fuzziness, there is no clean process such as sampling items from a domain, therefore it is no use to complicate the program this way. A satisficing approach then would be to subtract some information from that obtained by the score on a preliminary test; the practical thing to do is to virtually reduce the number of items in the preliminary test. How much? I will have to study available research—such as Mellenbergh (1971)—to get reasonable percentages to do this.
This paragraph has yet to be written. The concept of an item domain should provide a demarcation between achievement test items that can be considered to be appropriate to the testing purpose at hand, and others that are not appropriate. It would be nice to have an explicit item design theory to assist in defining demarcation rules, if only by providing examples of items definitely belonging to the item domain. Of course, on the design of achievement test items I have written my (1983), now being updated (in Dutch) here and eventually also translated or complemented by an English version here.
The item domain need not exist physically in the form of an item pool. To the individual student it is irrelevant how items in the test are chosen, as long as they look to her as being independently chosen in the same way for every ordinary test and its preliminary equivalent (see chapter two on preliminary tests as a source of data on mastery). Of course, coaching on a specific test is ruled out in the model, although coaching on a specific test format—such as coaching on the SAT I—is no problem, as far as the SPA model goes.
An example of a highly sophisticated use of the domain concept is to be found in Schulz, Lee and Mullen (2005 html) 'A domain-level approach to describing growth in achievement.'
The binomial model is rather neutral regarding the question who is assumed to be the decision maker. Lord and Novick (1968) only looked at institutional decision making, and therefore missed the potential uses of the model in individual decision making. Van Naerssen (1970 html) truly saw the importance of regarding the student as the primary decision maker in educational assessments, and their teachers as secondary decision makers. After all, the students themselves decide how much to invest in test preparation, possibly taking into account the characteristics of the examination and its rules, as well as the strength of the competition. Teachers decide on test length, combination rules, passing scores, test item quality, and possibly a host of other issues that might influence student preparation strategies.
In psychometrics the testee or student is almost never—but see Cronbach and Gleser, 1957—regarded as the primary actor. Authors working in economic theory do, however, see for example the literature on human capital theories.
Teachers, test developers, institutions, they all belong to the category of secondary decision makers. Their actions, in particular the way they will decide once the data are in, will influence the strategic choices made by their students in preparation for these same tests, and therefore they had better be aware of how and why students choose their preparation strategies.
It is not clear to me yet how the choice of the student as primary decision maker in the SPA model affects the model choice in this chapter. It looks like assuming her mastery known instead of trying to estimate it needs some explanation, but then in chapter two the SPA model also uses empirical data to estimate this individual student's mastery. Individuals do use data too, isn't it? The demarcation point therefore will probably be that the SPA model assumes an almost physical process—almost the same or plainly the same as in item sampling models—directly regarding the student's test preparation, while the received view hesitates to do so and uses the process for a posteriori analyses only, and mainly for assisting institutional decision making.
Mastery being assumed as given, the distribution of repeatedly generated testscores—by randomly sampling items from the domain used for the definition of mastery—is the binomial distribution. This much is granted in the received view because it is standard statistics.
Assuming mastery to be m, and the test having n items randomly sampled from the domain, the probability to know the answer only to the first x of the n items is
mx (1 - m)n - x.
Of course, knowing x items in a test of n items is possible in many different ways other than knowing only the first x items. The exact number of ways x items can be known out of n is
n! / ( x! (n - x)! ),
therefore the probability of knowing x out of n items is
[ n! / ( x! (n - x)! ) ] mx (1 - m)n - x
where n! = n * (n - 1) * (n - 2) * .... * 2 * 1, and 0! = 1.
Collecting the bunch of possible values of x from zero to n results in the binomial distribution
Σ [ n! / ( x! (n - x)! ) ] mx (1 - m)n - x, summation over x = 0, 1, 2, .... n.
The boxed information is highly abstract because it is the distribution empirically generated by a number of runs going to infinity, a fact that is hidden in the concept of probability used here. Ever met infinity? Even apart from that, the formula for the binomial distribution may be intimidating to some readers. They will be saved by the simulation that is designed to exactly model the process of testing. This process is what we are concerned about, and it will help enormously to know that the simulation will approximate the binomial distribution, the better so the larger the number of observations or runs. The formula is a convenient way to describe the kind of distribution that is generated by the process used to test students. It is not the other way around, nothing is fabricated here so as to get binomial distributions.
Lord & Novick (1968) use the binomial model only once to model a truly binomial process: that of guessing. The idea of a binomial model in test theory therefore is not a new one. They did not, however, see how the binomial model could otherwise be of use. They did look into a number of possibilities to use the binomial model for its mathematical convenience, but their verdict was negative. The stumbling block here has been the assumption that every test is a group-administered one; they were unable to see how for the individual student the test could be regarded as a binomial process, i.e. its items as being randomly sampled for the individual student. Switching the standpoint from that of the institution to the student herself should make it easy to see that from the student's perspective the test in all essential respects is one that can be regarded as randomly sampled from a domain of questions especially for her, while the same test has that characteristic also for all other students taking it. Psychometricians tend to be bewildered by the last notion, because they interpret random sampling in a literal sense, as they should in an institutional model: random sampling for student A means that A's test is another sample than B's randomly sampled test.
So far, this is a fair description of the difference between the SPA model and the received view in psychometrics. This difference is small, even smaller than the attention Cronbach and Gleser (1957) gave to the difference between institutional and individual decision-making, yet it entails enormous consequences in the development and application of theory. It is not the case that the received view should be amiss in the field of institutional decision making; the point is that it has left the needs of the individual testees untended for: it is an unbalanced theory. Even the last reproach is not at all a serious one, unless one looks to the educational sector. There, testing is not the somewhat mysterious instrument of the diagnostician, used to the good of the testee only; assessment is at the heart of the educational mission, instrumental to that mission. In education, everything should be done to get the student's attention for her own interest in her education, and therefore in the preparation for tests. This is quite the reverse of the received view assumption: that testees in no way have been coached to the test, because that would invalidate all and every interpretation of test results. This basic tension between psychological diagnostics versus educational assessment, between the measurement approach of the received view and the individual decision-making approach of the SPA model, is what will concern us in this and the following chapters.
Binomial processes were already well understood in the seventeenth century, when games of chance were enthusiastically studied by men such as Pascal, Huygens and (Jacob) Bernoulli. Tossing a coin or a die results in equal chances on specified outcomes, and combinations and permutations of outcomes could be evaluated then. The reverse problem was not yet solved: given a number of outcomes, specify the chance process they would likely be the result of. The difference between the SPA model and the received view does resemble this old difference between a priori statistical reasoning, and a posteriori statistical inference. The typical reasoning in the received view is from given results to the mastery that might have given rise to them, such as Lord and Novick (1968) using the binomial model to get an estimate of the number of items known instead of guessed correctly. In the SPA model the reasoning is the other way around: estimating current mastery, predicting what the test outcome could be. The basic statistical theory in both cases, of course, is the same; it is the interpretation that will differ.
The SPA model being a model of students taken individually, it follows immediately from the definition of mastery that the traditional notion of item difficulty is absent from this model, as is the notion of reliability of test scores. Of course a group of students will produce item scores that allow item difficulties to be computed. In the SPA model, however, these item statistics fall out of the model. Another way to describe the situation is to say that item difficulties have been absorbed in the (definition of) mastery. Changing the composition of the item bank by adding more difficult items—'difficult' in the sense of, for example, Carroll (1987)—changes mastery as defined on that item bank. The composition of the item bank is assumed to be fixed, at least during the period students are preparing for the test.
With regard to the variance of item difficulties in the item bank, measured for a group of students, an important remark is that this condition should be reflected in the learning curve on this domain of knowledge and the way it is tested. The smaller this variance, the more linear learning will be. This topic will be treated further in the module on the learning model.
It is said that the binomial model is not an appropriate model to use in educational settings because obviously test items differ from each other in difficulty, and the binomial model does not seem to recognize this. The term 'binomial model,' however, is used for a mathematical formula, and that formula itself, of course, is appropriate. The point is: what it is that the binomial model is supposed to model. In the SPA model it is used to model the generation of item and test scores given the mastery of the individual student. The concept of item difficulty in educational measurement is defined on the results of groups of testees; one student does not count as a group, so there is no meaningful definition of item difficulty for that student.
Of course it is possible for questions in the domain to meaningfully differ in how difficult they are in the normal meaning of the word 'difficult.' If these differences really are too big to disregard them, the domain can be split into two subdomains of approximately equally 'difficult' questions, and the SPA model applied to the two subdomains.
It might seem a false start for the model to be built on the notion of mastery levels known to the student without error. But that is not exactly the way the binomial model will be used in the SPA model. The approach chosen in the SPA model will be to test a number of possible levels of mastery, and see what happens. In no way is the traditional psychometric notion of true mastery involved here. (On the relation between classical theory's true score and construct scores as implied in construct validity, see Borsboom and Mellenbergh, 2002 pdf.)
Nevertheless, I will grant there is something platonic involved in the exact level of mastery as something given. There will be more abstract notions in the model. Being a model of strategic behavior the SPA is a predictive model, and what can be more abstract than the future? The problem of induction is involved here: the sun always having risen in the East, will it do so tomorrow? What is worse, it is not at all sure how to interpret outcomes on the test or examination as a succes or a failure for the prediction, except by aggregating the results for a group of students. But the model is one of individual decision-making, remember? Individuals can't wait for the answers to these philosophical objections, they have to act now, and they had better act on a clear model than on no model at all. See also Hofstee (1979) on the issues involved here. Observing the regularity in strategic student behaviors, it is sure they do have their own models, unarticulated as they are (Becker, Geer and Hughes, 1968, did some articulating). The SPA model might just be the kind of model that allows queantification of the experience they have accumulated in many years of schooling and of the tested life. The issues at stake here, have been articulated in research on decision making, see for example Koehler and Harvey's (2004). Blackwell handbook of judgement and decision making for a comprehensive treatment of the normative, descriptive and prescriptive decision making issues that the SPA model also is concerned with.
Assessment, more often than not, is not a simple activity. It consists of scoring the correctness of answers itself, valuing the worth of this or that correct answer against each other, and translating these scores into grades, ranks, or whatever. Scoring the correctness of answers in itself might simply be a question of expert judgment, but in the educational situation the expert judgment inevitably will get mixed up with judgments on progression of the group of students or of this particular student. To keep things manageable, assume that expert judgment dominates, and that experts among each other agree in this kind of judgment. For the same reason, assume that differences in value of correct answers can be expressed in the form of weights assigned to the items or the answers involved. The implementation of the SPA model does not allow weights of individual items to differ from each other. What it does allow, however, is that there are subdomains containing items having the same weight, possibly differing from the weight of items in another subdomain. Grading can be regarded in at least two complementary ways. The first aspect of grading is captured in the form of utility cirves, treated in chapter four. The second aspect of grading is its possibly subjective nature, one assessor being more or less lenient than another in the grading process. This leniency affects the grading at the test level, and might be accommodated in model studies by reducing the strength of the information on mastery that otherwise is available, as already suggested earlier. The magnitude of reduction in some typical situations will have to be indicated yet, based on the results of empirical studies available.
the heterogeneous item domain
If one or another actor should want to divide the knowledge domain of the item bank into different parts, being different in item difficulties or what not, the SPA model can be applied to each of the parts separately, partial mastery now being defined on the particular part. It is an empirical or analytical question whether or not such splitting up of the domain makes any sense to the student choosing a preparation strategy. It is easy to see that stratified sampling of items from the sub-domains might make a small difference, but it will be small indeed. Giving sub-domains different weights fixing the relative numbers of items in each subdomain, and randomly sampling from the lot, still is a binomial procedure.
- the received view, such as Lord and Novick p. 250-251
Simonton borrows from Csikszentmihalyi the terms domain en field. The domain of a scientific discipline "consists of a large but finite set of facts, techniques, heuristics, themes, questions, goals, and criteria," together "the population of ideas that make up a given domain." "The field consists of all those individuals who are working with the set of ideas that define the domain."
The point of the contribution by Simonton is "the central features of productivity across and within careers can be explicated by assuming that creativity operates like a stochastic combinatorial procedure." In combination with recent neurological models of learning (Anderson's ACT model; parallel distributed processing, Rumelhart and McClelland 1988) Simonton's idea applies to achievement in education as well, provided assessment is not abused for alle kinds of purposes alien to the growth of the individual pupil.
The definition of mastery given earlier may be extended to subdomains of the domain of test items. The individual student then might have a different level of mastery according to which subdomain is considered. Criteria for subdivision of the item domain may be anything that the user of the SPA model deems important. One such criterion might be item difficulty in the classical sense. Recognizing subdomains then is a broader concept of categorizing test items than by their group-defined difficulties alone.
Suppose two subdomains have been identified, the student's mastery of the first subdomain is a and that of the second is b. In the random sampling case, let the probability that the next item comes from the first domain be p, from the second subdomain q. The probability that the next item will be answered correctly by the student is p * a + q * b. The probability to also answer correctly the second item drawn is p * a + q * b. Etcetera. In other words, in the case of random sampling from subdomains the appropriate model still is the binomial model, mastery of the domain now being a weighted sum of subdomain mastery levels.
stratified sampling from subdomains
The subdomain concept has been described above. Stratified sampling differs radically from random sampling. In fact, stratified sampling reduces to the case of multiple tests, let us call them subtests. In the case of two subdomains, the student having mastery a and b respectively, one subtest of x items is sampled from the first domain, another subtest of y items is sampled from the second domain.
The complex character of the stratified sampling case is that the combined test score no longer follows the binomial distribution. The sum of two binomial scores is not a binomial score, unless the parameters of mastery and number of items have equal values. The simulation, however, is quite straightforward: add the two simulated subtest scores. The theoretical analysis in the case of two subtests uses the two binomial distributions and the fact that they are independent of each other, because their mastery levels are assumed to be known. The bivariate distribution is constructed to evaluate the probability of every possible combination of subtest scores, and summing the probabilities in the vector of possible test scores. The advanced applet at the end of this webpage offers the opportunity to do this kind of exercise by choosing 'subdomains' instead of 'second set'.
item weights in 2nd subdomain
It is not unusual for tests to be constructed from two or more different kinds of items, for example a number of multiple choice items, and a smaller number of constructed response questions or essay questions. This is another reason to regard the test domain as composed of two or more different subdomains, and sampling to be stratified from the subdomains.
The advanced applet offers the opportunity to assign a special weight to items in the second subdomain. The illustration shows the result for a test composed of 40 multiple choice questions and 4 problems that are scored either 0 or 10. In the learning applet 5 the concept of item complexity will be treated: problems that are scored 0 or 10 must be rather complex items. Complex items can be scored in either of two ways: 1) giving credits to partial knowledge, in which case the difference between them and less complex multiple choice questions will be blurred (one can treat them as collections of simpler items), or 2) giving credit to answers that are complete and correct only. The two scoring methods have an important differential impact on the strategic situation students find themselves in, and thereby on the results and the quality of education.
The remarkable shape of the distributions in the illustration follows directly from the special composition of the test, and may serve as a reminder that it never should be taken for granted that score distributions have a simple bell shaped form.
closing remarks on stratified sampling of subdomains
The option of stratified sampling from two subdomains will be implemented also in the remaining six modules of the SPA. Basically this is possible because this option is just a special kind of combination of two separate tests.
Because scores on the two subtests will be added to give a total score (full compensation), the utility function (applet 4 The Ruling) will be a single one on these total scores.
Another special characteristic of the two subdomains case is that it allows the student to strategically assign her time more to one of the subdomains, less to the other. The technique of indifference curves can be used to model the strategic situation (in the strategist module 7a, yet to be implemented; the indifference curves technique has been used already in my 1995).
group statistics
If you are prepared to use a beta function to represent the distribution of mastery in a group or a subgroup of students, then the betabinomial model applies, and you may use applets 2 and 3 to evaluate this case. Remark that again the concept of item difficulty wil not be needed.
Assessment of achievement is a thoroughly contingent process. Scores or grades obtained tend to be contingent on a myriad of events and circumstances, the more important of them being 'who does the assessing' and 'who else will be assessed together with me.' There is no such thing as perfectly fair grading of achievements, as Arrow has proved—for Arrow's Impossibility Theorem see Vassiloglou and French (1982). The problem then is how to deal with these contingencies. The first option is to take all particular circumstances as given and as somehow absorbed into the definition of mastery. The second option is to regard things such as the differential weighting of items in particular subdomains and the 'reference point' on the total score scale as the operational definition of whatever is contingent in this particular test situation. In defence of these exercises in closing one's eyes to important contingencies in particular testing situations one might take this assumption simply as a first step in the analysis of the strategic situation presenting itself to the individual student, the analysis itself taking on a kind of iterative character eventually resulting in more realistic assumptions as more information is coming in. A good example of this kind of iteration is the analysis (module four) starting with objective utility functions being defined by the rules of the examination—in particular the rules on the combination of scores or grades into total scores or gpa's—and using these to develop second generation truly individual utility functions (modules eight and nine) on whatever will be regarded to be the results of the assessment or examination.
A special situation is that where the student has the choice of the items to answer, such as the themes on which to write short essays from a larger number of themes offered. For a preliminary analysis using instruments of the SPA model, just ignore this particular circumstance. This kind of testing procedure might offer the student special strategic opportunities of choice, depending on the way the results of participating students will be treated or ranked. See Wood and Wilson (1980), and the literature referred there, on the intricacies of this kind of free choice situation. In module four, the ruling, more will have to be said about the kind of problems facing the student aware of the strategic opportunities and difficulties.
Playing around somewhat with the generator applet 1 you will remark that the baseline of the plot varies with the number of items you declare the test to have. In this way it is possible to give every bar of the chart the same width. The other way around would have been possible also, keeping the baseline the same length no matter the number of items. Because the number of screen pixels is finite, it will not be possible then to give every bar the same width, and the plot will look ugly and in some cases might even be misleading.
Just in case you might not already have noticed: the world's events, things and beings tend to be discrete in character, grainy if you want. Continuity is a thoroughly metaphysical concept. Modelling in the field of achievement is a continuing struggle with the discrete structure of the object of the modelling effort.
The concept of probability is a tricky one. I will use a short introduction by Kyburg (1990, chapter three) as my basic reference. Since the SPA model does not make use of subjective probabilities, these are not an issue here (the interested reader might consult Staël von Holstein, 1974). The binomial distribution, of course, is a concept in classical statistics. Special points will arise in other modules, for example the concept of likelihood in the module of that name.
I will have to explain here what it is, exactly, to ‘randomly sample’, whether from a binomial distribution, or from a set of items.
Henry E. Kyburg, Jr. (1990). Science & reason. Oxford University Press.
C-A. Staël von Holstein (Ed.) (1974). The concept of probability in psychological experiments. Reidel.
Just for the record: the main use of the binomial model is as a working horse in more involved models, such as is the case in the SPA model. Even so, it is an important instrument in itself. Maybe the most important application is in introducing the student into the world of statistics. And in introducing the teacher into the world of chance aspects of testing for achievement.
To assume mastery to have a known value seems a rather platonic thing to do. It may help to study a situation where one and the same concrete piece of work, such as a Latin translation, is rated independently by a number of raters. Lots of data of this kind have been produced in the research reported by Hartog and Rhodes (1936). One of these data sets is presented below. As the ranked version shows, raters tend to agree on the 'best' and 'worst' work delivered, showing less agreement on the bunch of intermediate cases. Ranking the works within raters takes away some of the differences in leniency between raters, which seems fair to the raters, and at the same time unfair to the pupils who will in their daily school life have to do with their one and only Latin teacher. Yes, it is also possible to rank raters within pupils, which will show differences in leniency between raters (not shown here). The dataset is available in this applet 1; option 193600 as 'simulation' will plot all data; option 193601 will plot the data of pupil 1, etc. until 193615; 193621 will plot the data of rater 1, etc until 193633.
Rater Rater Obj. 1 2 3 4 5 6 7 8 9 10 11 12 13 Obj. 1 2 3 4 5 6 7 8 9 10 11 12 13 ------------------------------------------ ------------------------------------------ 1 | 27 32 45 27 32 30 33 40 41 35 44 35 27 1 | 9 11 5 8 10 10 6 6 13 6 9 6 10 2 | 32 40 47 26 35 41 29 40 48 36 45 34 33 2 | 7 5 4 9 8 4 10 6 8 4 8 8 9 3 | 34 40 44 36 38 39 28 42 55 30 52 43 42 3 | 6 5 6 5 5 5 12 4 2 10 2 3 3 4 | 08 27 26 16 24 19 18 21 35 18 20 21 23 4 | 13 12 14 12 13 13 14 14 15 13 15 13 13 5 | 23 39 42 26 35 36 40 35 50 31 48 32 42 5 | 10 8 8 9 8 7 3 10 6 9 6 10 3 6 | 32 40 40 29 36 36 35 37 47 35 49 38 41 6 | 7 5 10 7 7 7 5 9 9 6 4 5 5 7 | 40 45 51 42 46 44 46 43 54 47 52 47 48 7 | 4 2 3 4 2 2 1 3 4 2 2 2 2 8 | 55 56 59 51 53 49 45 51 57 52 54 52 53 8 | 1 1 1 1 1 1 2 1 1 1 1 1 1 9 | 10 33 35 22 31 29 32 31 47 24 41 30 25 9 | 12 10 11 11 11 11 8 11 9 11 11 11 11 10 | 35 40 43 45 41 39 29 38 50 32 42 34 38 10 | 5 5 7 3 3 6 10 8 6 8 10 8 7 11 | 15 26 33 15 20 14 29 22 44 24 36 19 23 11 | 11 14 12 13 14 14 10 13 11 11 12 14 13 12 | 43 40 53 46 40 44 33 47 55 41 47 41 38 12 | 2 5 2 2 4 2 6 2 2 3 7 4 7 13 | 43 37 42 36 28 40 38 40 50 35 49 35 38 13 | 2 9 8 5 12 5 4 6 6 6 4 6 7 14 | 07 27 31 13 19 22 23 26 43 10 35 23 23 14 | 15 12 13 14 13 12 13 12 12 15 13 12 13 15 | 08 20 20 08 13 12 16 18 37 13 31 15 15 15 | 13 15 15 15 15 15 15 15 14 14 14 15 15
The point I want to make in showing the variability of ratings of the same work by independent assessors is this. The outcome of the process of grading one particular translation is heavily influenced by chance. In this example there is nothing mysterious about what is given: it is the translation delivered by, say, student #1, not some platonic 'mastery.' It is also perfectly clear that independent raters do not agree very well with each other in the rating of this particular translation, yet they have clear instructions on how to rate the works and they are experienced raters. In daily practice the translation of this student will be graded by her teacher only, making the grade obtained the result of the chance process of who precisely will be her teacher: one of these thirteen raters in the Hartog and Rhodes experiment?
It is possible to use the applet to produce the kind of score distribution as shown here for individual students/works. Look first at the scale of the ratings: the maximum is sixty, raters use the whole range of the scale of sixty points. It is immediately evident that a scale of sixty points does not adequately reflect the precision of the ratings: a scale of 10 points would be better. Using sixty points in a situation where ratings are seen to scatter wildly and therefore to be imprecise is not a wise thing to do. It should therefore be possible to use the generator applet to produce the kind of score distribution in the figure: declare the test to have ten items, simulate thirteen times assuming mastery to be, for example, 0.50. Figure 2 shows one result. What does the imitation tell you about the precision of the ratings in the Hartog and Rhodes experiment? Assessment of a Latin verse translation of 22 lines seems to be rated approximately as accurately as a ten item test indicates the achievement that a student known to have mastery 0.50 should be capable of. In a nutshell this example illustrates an important reason—next to sheer cost—why essays and translations have been replaced with objective tests on such a massive scale in the twentieth century. Yet this flight into objectivity might be a case of myopia, for it remains to be seen whether the combination of a larger number of essays—over the year, or even over more school years—is not as good or even better than the combination of an equal number of objective tests. The SPA model is built to answer just that kind of question.
These data from Hartog and Rhodes (1936) will be used for illustrative applications of the second and third modules as well.
misunderstanding not having been made use of here, it effectively is assumed to be zero. I have to redo the exercise, assuming misunderstanding to have a realistic value.
While guessing is known to lower the validity of tests (Lord & Novick, 1968, p. 304), other things being equal, it is not generally known that guessing heightens the risk of failing under pas-fail scoring for students having satisfactory mastery. The figure shows a typical situation. The test has 40 three-choice items, its cut-off score in the no-guessing condition is 25, in the three-choice items condition the cut-off score is 30. The remarkable thing is that the probability for a student having mastery 0.7 to fail the 25 score limit is 0.115, while the probability to fail the 30 score limit under forced guessing is .165. Of course, the 0.7 mastery student might be better at guessing than the 33% rule assumes, but that is not the point here; in that case the cut-off really should be higher than 30, aggravating the problem yet further.
The binomial model is not strictly necessary to argue the case, of course, but it helps being able to quantify the argument. Suppose the student is allowed to omit questions she does not know, meaning she will not be punished for this behavior but instead will obtain a bonus of 1/3rd point for every question left unanswered. Students having satisfactory mastery will have a reasonable chance to pass the test. Those passing will do so while omitting a certain number of questions. It is perfectly clear that some of these students would fail the test if they yet had to guess on those questions. In the same way, some mastery students initially having failed the test, might pass it while guessing luckily. This second group is, however, much smaller than the first one, and they still have the option to guess. The propensity to guess is higher, the lower the expected score on tests, see Bereby-Meyer, Meyer, and Flascher (2002).
The amazing thing about this argument is that I do not know of a place in the literature where it is mentioned. There has of course been a lot of research on guessing, omissiveness, and on methods to 'correct' for guessing, but none whatsoever on this particular problem. That is remarkable, because students failing a test, might claim they have been put at a disadvantage by the scoring rule that answers left open will be scored as at fault. This is a kind of problem that should have been mentioned in every edition of the Educational Measurement handbook (its last, excuse me, next to last edition 1989 by Robert L. Linn). Lord and Novick (1968, p. 304) mention the problem of examinees differing widely in their willingness to omit items; the interesting thing here is their warning that requiring every examinee to answer every item in the test introduces "a considerable amount of error in the test scores." The analysis above shows that in the particular situation of pass-fail scoring this added error puts mastery students at a disadvantage.
Understanding probability is difficult. Even statisticians fail to understand the chances involved in taking tests and examinations in educations, let alone students and teachers, or parents and politicians. Intuitions about binomial chances in throwing dice or playing cards can cost one dearly. The same will be the case in preparation for an achievement test of n items which is equivalent to n times throwing a coin loaded with the mastery of the student involved. Therefore the literature on naive statistics and how to change from naive concepts to scientific ones will be relevant here.
This paragraph asks the reader to answer a series of questions, or discuss them with colleagues, to test for and develop further understanding of the binomial process as applied to model important aspects of achievement testing. Many questions do not have definite answers, or should be answered in an intuitive way, or by estimating what a good answer might be. The idea is that coming to understand the probability aspects of assessment involves change of naive concepts about them—what chances are you talking about, then?—to clear, definite and testable concepts. It will be a struggle of the mind, tentatively formulating what might be the case, refining insights, confronting them with actual facts, integrating them with whatever it is that you have learned in a lifetime about the tested life.
The boxed questions have been discussed by Paul Newton (2005). Having thought about the questions yourself, it might be revealing to read the abstract of his article, or the article itself.
The boxed questions have been labelled suggestively 'chance, probability, error, or what?' Well, what do you think, have you played around somewhat with simulations of tests offered by the applet 1? Do one such simulation, then look at where 'error' might be. Surely, you will see chance working out, you might interpret the resulting frequency plot as probabilities to get this or that particular score, or a pass on this test. But 'error'? Is it an error for the mean of the simulated score distribution to depart from what mastery has been assumed to be? How would you know? In this particular case you have assumed mastery to be known, therefore you are able to subtract mastery and observed mean, and might be tempted to call the difference error. Well, don't. There is no error whatsoever in generating testscores on a mastery assumed to be given, unless I have made an error in the computer program. All variability that you will see is the result of sampling items from the item domain, a perfectly legitimate procedure. There is no error involved in this sampling itself. Of course, much larger samples might be taken, resulting in tests being more predictable to the students. Never forget this again: the sampling of test items from the item domain results in variability that definitely is not 'in error.'
The items themselves might be at fault, however. That would count as real error. In the binomial model that kind of error is absorbed in the definition of mastery: mastery is the proportion of items in the domain that would be answered correctly. Whatever the quality of those items.
The student might have a severe headache, harming her test results. That would count as real error. In the binomial model that kind of error is also absorbed in the definition of mastery, which should be read that the items in the domain would be answered under the very same circumstances—including personal health—that the test itself is taken. This is part of the platonic scheme, nothing much turns on it but your understanding this to be so.
The student's work might be mistaken to be that of another student. That will count as serious error too, a kind of error that happens every now and then. This kind of error is not in the binomial model at all.
The list might be continued, for the number of different kinds of possible errors might be infinite. The point is: in achievement testing there will always be errors, lots of small ones and some serious ones, and there will always be sampling variability, a lot of variability for small tests, less variability for large ones or for combinations of smaller ones.
The applet, simple as it is, nevertheless is in itself a powerful instrument to search one’s implicit ideas about the probabilistic character of tests and testing, see for example Tversky and Kahneman (1971 html). To this purpose the reference point has been added to the applet, and the proportion of cases scoring equal to or above the reference, interpreted to be the probability of scoring at least equal to the reference.
It will be a sobering exercise for a specific situation to do the following:
In pass-fail testing, were the student to have mastery equal to the reference (= cutoff score level), the probability to pass the test will be .5, no matter how short or long the test. Throwing a coin would do the job perfectly well.
The assumption that the mastery of the student is known, to you, or to the student herself, is a rather platonic one. To get more realistic assumptions, the next applets in the SPA model will be needed.
[I have applied the binomial concept in the first attempt to formulate the SPA in 1978. Because it describes a basic process in achievement testing, every serious model probing behind the brute empirical facts of testing must allow the binomial model a place in its inner workings.]
[In the 80's I have applied this binomial concept in projects in the departments of law and dentistry in the University of Amsterdam, to give first year students a global idea of the risks involved in strategies denying the statistical insecurities following from tests composed of a limited number of items, sometimes items of bad quality on top of that.]
strong true-score theory's neglect of the student decision maker
In psychometrics the binomial model is being used in strong true-score theory, for example in Lord and Stocking (1976). Assuming the binomial model to hold allows strong inferences from observed score distributions. Loosely speaking, observed scores from groups of students have a larger standard deviation than would result if all students are equal in mastery. Using operations research techniques Lord and Stocking were able to use the observed difference to obtain good intervals for true scores. It is quite remarkable to see how much energy and intelligence is invested in this kind of after-the-fact data mining, while at the same time the position of the students as prime actors is completely disregarded. Read that article, both for its emphasis on the binomial model and its neglect of students as strategic actors.
sample fluctuation
In the recieved view, sample fluctuations in individual scores detract from the reliability of the testscores in the interindividual interpretation. They are even the biggest detractor of reliability in almost all of its variants in the received view.
There is something rotten here, however. The point is, there is nothing erroneous in sampling questions from the domain, however small the sample may be. The sampling does not in itself introduce any kind of error. The individual total test score may contain some error, sure, but it is not the sample itself that is 'in error.' Faulty questions, wrong keys, ambiguous wording of questions, questions not being specific to the course content: all of these are errors. Assuming the total score to be free of this kind of error, it is an exact score. The mainstream psychometric literature does not recognize this fact. In a recent article, Newton (2005) addresses issues of measurement inaccuracy, carefully using the term 'inaccuracy' instead of 'error.' I have yet to study the article.
Refusing to view sample fluctuations as measurement errors implies that analyses of the quality of selective decisions in terms of 'right' and 'wrong' decisions are vacuous. This is a kind of issue that has been known, of course, in the decision-making literature, because selective decisions—as well as pass-fail scoring—seem to imply the use of treshold utility functions. Wrongly so, see my 1980a and 1980b.
The point to emphasize here, however, is that achievement tests—being combinations of two or more test items—are trying to 'measure' proportions. No, the individual test item in the individual student model, does not try to do so; it simply but validly measures—if designed well—whether or not the student masters the specific content the item is about. This item validity is the kind of validity as conceived by Borsboom, Mellenbergh and van der Linden. Whether the item is reliable, depends on the quality of its design, wording, layout, printing; all of them risk being in error in one way or another.
The idea that the generation of test scores is a binomial process is tightly connected with the choice of the student as the main actor. Once the student is seen as an active decision maker confronting a situation whose results will not be deterministically determined, it will be evident that the test score can be conceived as the result of a binomial process.
This idea was presented in my (1978), and it must implicitly have been present in the initial presentation of the tentamen model by Bob van Naerssen (1970 html).
Also from 1978 is the idea to use simulation as a didactic device. During the 80's the idea was used in projects in the departments of dentistry and law of the university of Amsterdam, informing students in their first year of the insecurities and ambiguities in the propedeutic examination (schemes like that in figure 1.2 and 1.3 above, were used then). In the 90's, computer power made it possible to use simulation techniques as a research tool in the development of the model itself, the idea being to construct in this way an independent check on complex analytical results, or even to replace mathematical analysis in situations that it might be impossible to model in analytical terms.
The Java code sections present the routines that are crucial to the particular module's operations. For the generator module these are the random number generator used, the simulation of the binomial process, and the numerical evaluation of the theoretical binomial distribution function.
The Java code fragments will be presented in the format of screen shots of the program code. Therefore, some household chores in this code will not be explained in the text. Java itself will not be explained either, the interested reader can consult Palmer (2003) or the textbook authored by the developers of BlueJ, among many other sources.
The claim that the generator module uses a random number generator to produce simulated test score distributions should not be taken at face value. The code presented here makes it clear which random number generator is used, where it has been documented, and how the random numbers are used to produce item and test scores.
The random number generator is rather sophisticated, see the Sun documentation (java.sun.com/j2se/1.4.2/docs/api/java/util/Random.html) and Knuth (1981) for the fine details. It is possible to call the routine billions and billions of times and obtain numbers that for all practical purposes in the modelling of achievement testing will be truly random.
In time the generator instrument will offer more options. For certain purposes it should be possible to use the same seed for two or more simulation runs, for example. Another option should be to use a less sophisticated but much faster random number generator, maybe the one available in java.math (see here), or a custom made one.
The code for routines such as plotting and evaluating statistics will not be presented here. The user will always be able to check the correctness of those hidden routines by running the program and using appriopriate and sometimes extreme parameter values.
The analytical plot is the binomial distribution. The mathematical function is simple and treated in general statistical textbooks: f(x;n,p) = {n! / ( x! (n - x)!} px(1 - p)n-x. The mean of the binomial distribution is the number of items n times mastery p. The variance is np(1 - p).
Because the binomial function contains factorials, the numerical evaluation of the function is rather complex. It is possible to use routines derived from algorithms in Press, Flannery, Teukolsky and Vetterling (1989). In particular these are betacf (the continued fraction for the incomplete beta function), betai (the incomplete beta function) (both p. 153-155), and gammln (the natural logarithm of the gamma function) (p. 177). Instead, a recursive routine is used, as shown. The recursive routine follows directly from the formula for the binomial dustribution function.
A fine introduction to Java is:
Grant Palmer (2003). Technical Java. Developing scientific and engineering applications. Upper Saddle River, NJ: Prentice Hall.
My programming environment is the free BlueJ interactive Java environment. "The BlueJ environment was developed as part of a university research project about teaching object-orientation to beginners. The system is being developed and maintained by a joint research group at Deakin University, Melbourne, Australia, and the University of Kent in Canterbury, UK. The project is supported by Sun Microsystems." It is easier to use and faster than Apple's own interactive programming environment Xcode.
proof of correctness
Edsger W. Dijkstra, if I remember well, has made the proposition that the correctness of computer programs is not something that can be proved in a strict way. Intuitively it is evident that trying to prove the correctness of a program results in a proof that itself is kind of a computer program, and its correctness would need yet another proof ... . Kurt Gödel is famous for his treatment of this kind of logical problem.
Therefore, the correctness of the programs or applets will be under permanent scrutiny. In the 'Testing the applet' paragraphs I will indicate some methods to obtain indications on the absence of errors of logical or other kind in the applet.
random numbers
The random number generator used is a critical part of the simulation of the binomial process. The point here is not whether the simulation results approach the analytical results the better the larger the number of observations. A faulty random number generator might be perfectly capable to produce such results. The critical test of a random number generator is whether the generated numbers may be regarded as random. Such a test is available, but is not implemented (yet) in the applet. In the older Pascal version of the SPA program it has been implemented, but that is of no use here. There is no alternative test you might do on the results of the applet. Of course there are theoretical results on the qualities of the random number generator used here, the code in the above paragraph has the reference.
simulation results approaching the analytical result
A test that applies also for most of the other applets is whether simulation and analytical results approach each other the larger the number of observations is chosen. If there is a systematic and unexplained difference, then something is amiss in the program. The test is rather powerful, because the two methods are largely independent of each other. Some housekeeping chores, of course, will be shared. The plotting procedure, for example, might be faulty in a way that equally distorts the results from the simulation and the analysis.
The Generator generates binomial distributions. Therefore the means and variances should approach the theoretical values the larger the number of observations is chosen. Remember that the statistics reported in the analytical case are the results of evaluating the distribution as generated, NOT bij evaluating the formulas for mean and variance of the binomial distribution. This test would be better if the third and fourth moment results would be output also, which is not the case, and will not be implemented in the future. Too much detail in the output will divert attention from the main points of the method implemented: illustrating the stochastic character of test scores and enabling the user to investigate the situation by changing parameters etcetera.
july 23, 2005. The idea to give the theoretical distribution the same coarseness as the simulated one has proved to be disastrous for the exactness of the results in the more complex modules to follow (for reliable analytical results the number of 'pseudo' observations should be set very high, in fact so high that simulating on that number would take too much time for comfort). Therefore the program will have to be changed, and several of the illustrations in these chapters will have to be replaced.
Another major change in the programming will be to combine all modules in the strategy applet, making it possible to do all kinds of exercises that now still have to be done in their own dedicated applet.
For small numbers of observations kinky results might be produced due to rounding. Remember that the theoretical distribution is given the same 'coarseness' as the simulated distribution, multiplying it by the number of observations, and making resulting frequencies integer valued. The latter procedure is a simple rounding one, and will result in rounding errors that generally will not cancel each other. A simple count, as in the illustrated case, can then result in a 'missing' value in the theoretical distribution. Of course it is possible to design an algorithm that always results in the theoretical distribution having the same number of 'observations' as in the simulated distribution. Leaving the rounding out would solve the problem also. But that is not the point. What is the point, then, is that comparing a small number of (real or simulated) observations against its theoretical counterpart is a tricky business. To the student, however, that is what life is: a small number of tests is all she will have to think about.
H. Becker, B. Geer and E. C. Hughes (1968). Making the grade: the academic side of college life. New York: Wiley. (first 68 pages on books.google.com)
Denny Borsboom and Gideon J. Mellenbergh (2004). Why Psychometrics is not Pathological. A Comment on Michell. Theory & Psychology, 14. pdf
Denny Borsboom, Gideon J. Mellenbergh and Jaap van Heerden (2002). Functional thought experiments. Synthese, 130, 379-387. pdf
Denny Borsboom, Gideon J. Mellenbergh and Jaap van Heerden (2003). The theoretical status of latent variables. Psychological Review, 110, 203-219. pdf
Denny Borsboom, Gideon J. Mellenbergh and Jaap van Heerden (2004). The concept of validity. Psychological Review, 111, 1061-1071.
pdf
John B. Carroll (1987). New perspectives in the analysis of abilities. In Royce R. Ronning, Jane C. Conoley, John A. Glover, and Joseph C. Witt (Eds.) (1987). The influence of cognitive psychology on testing. Buros-Nebraska Symposium on Measurement and Testing. Volume 3 (pp. 267-84). Hillsdale, New Jersey: Erlbaum.
Makes one think. Jack is looking for an opening for the development of a technology of designing the difficulty of test items. I think he found a good one. This does not touch directly on the SPA model, though. If alle items in the item set were to be upgraded in difficulty, the mastery as defined on the upgraded set simply would be somewhat lower. The point of the SPA model is that it models the world as perceived by the student, not the world of the designer of test items. See my work on test item design on this website here, in chapter one of the Dutch text I will go into the possibilities opened up by this article by Jack Carroll.
For further annotations on this chapter see my carroll1987.htm
Lee J. Cronbach and Goldine C. Gleser (1957/1965). Psychological tests and personnel decisions. Urbana, Illiois: University of Illinois Press.
Philip Hartog and E. C. Rhodes (1935). An examination of examinations. London: Macmillan.
Philip Hartog and E. C. Rhodes (1936). The marks of exaiminers. London: Macmillan.
D. Hestenes, M. Wells and G. Swackhamer (1992). Force Concept Inventory. The Physics Teacher, 30, 141-158. http://www.modeling.asu.edu/R&E/FCI.PDF [the link is dead (2-2008), search the Arizona State University site itself]
Marian Hickendorff, Willem Heiser, Cornelis van Putten, Norman Verhelst (2008). Solution Strategies and Achievement in Dutch Complex Arithmetic: Latent Variable Modeling of Change. Psychometrika online.
Christiaan Huygens (1657). De ratiociniis in Ludo Aleae. In Frans van Schooten Exercitationum Mathematicarum. Leiden: Johannes Elsevirius. (1959). Van Rekeningh in Spelen van Geluck. In Frans van Schooten Mathematische Oeffeningen. Amsterdam: Gerrit van Goedesbergh. Recently reprinted (in Dutch): (1998). Van rekeningh in spelen van geluck. Utrecht: Epsilon. Translated and introduced by Wim Kleijne. Also in the complete works volume 14, available as pdf on http://gallica.bnf.fr/
Max Jammer (1999). Concepts of Mass in Contemporary Physics and Philosophy. Princeton University Press.
"The book begins with an analysis of the persistent difficulties of defining inertial mass in a noncircular manner and discusses the related question of whether mass is an observational or a theoretical concept. (...)Destined to become a much-consulted reference for philosophers and physicists, this book is also written for the nonprofessional general reader interested in the foundations of physics."
Available (in 2005) as e-book $9.95 (Adobe Reader format) | ISBN: 1-4008-0406-X
His earlier (1961) book on the subject is Concepts of Mass in Classical and Modern Physics. Cambridge, MA: Harvard University Press, reissued by Dover, as have been his Concepts of force and Concepts of space, the history of theories of space in physics.
Donald E. Knuth (1969/1981). The art of computer programming. Volume 2, Seminumerical algorithms. Amsterdam: Addison-Wesley.
Derek J. Koehler, Derek J., and Nigel Harvey (Eds) (2004). Blackwell handbook of judgement and decision making Blackwell.
Ellen Condliffe Lagemann (2000). An elusive science: The troubling history of education research. University of Chicago Press.
J. M. Linacre (2001). Percentages with Continuous Rasch Models. Rasch Measurement Transactions, 14, 771. html
Robert L. Linn (Ed.) (1989). Educational Measurement. Third edition New York: American Council on Education / Macmillan.
Frederic M. Lord and Melvin R. Novick (1968). Statistical theories of mental test scores. London: Addison-Wesley. (Chapter 23)
Frederic M. Lord and Martha L. Stocking (1976). An interval estimate for making statistical inferences about true scores. Psychometrika, 41, 79-87. preview
G. J. Mellenbergh (1971). Een onderzoek naar het beoordelen van open vragen. Nederlands Tijdschrift voor de Psychologie. 1971, 26, 102-120.
E. Matthew Schulz, Won-Chan Lee and Ken Mullen (2005). A Domain-level Approach to Describing Growth in Achievement. Journal of Educational Measurement, 32, 1-26. html.
R. F. van Naerssen (1970). Over optimaal studeren en tentamens combineren. Rede. html
Paul E. Newton (2005). The public understanding of measurement inaccuracy. British Educational Research Journal, 31, 419-442.
Grant Palmer (2003). Technical Java. Developing scientific and engineering applications. Upper Saddle River, NJ: Prentice Hall.
Press, W. H., B. P. Flannery, S. A. Teukolsky and W. T. Vetterling (1989). Numerical recipes in Pascal; the art of scientific computing. London: Cambridge University Press.
Press, W. H., B. P. Flannery, S. A. Teukolsky and W. T. Vetterling (1992 2nd). Numerical recipes in C; the art of scientific computing. London: Cambridge University Press.
Dean Keith Simonton (2003). Scientific creativity as constrained stochastic behavior: The integration of product, person, and process perspectives. Psychological Bulletin, 129, 475-494.
Amos Tversky and Daniel Kahneman (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105-110. html
Marilena Vassiloglou and Simon French (1982). Arrow's theorem and examination assessment. British Journal of Mathematical and Statistical Psychology, 35, 183-192.
Ben Wilbrink (1977). Het verborgen vooroordeel tegen andere dan meerkeuze vraagvormen. In Stichting Onderwijsresearch: Congresboek Onderwijs Research Dagen (p. 219-222). html
(1978). Studiestrategieën. Examenregeling deel A. Amsterdam: COWO (docentenkursusboek 9). 800k pdf
Herziene versie 2004, al beschikbaar voor hoofdstuk 1 t/m 4. html<
Ben Wilbrink (1997). Assessment in historical perspective. Studies in Educational Evaluation, 23, 31-48. html
Robert Wood and Douglas T. Wilson (1980). Determining a rank order when not all individuals are assessed on the same basis. In L. J. Th. van der kamp, W. F. Langerak and D. N. M. de Gruijter: Psychometrics for education debates (p. 207-230). Wiley.
Afra Zomorodian (1998). The Monty Hall problem. (unpublished?) pdf
Miller, C. M. L., & Parlett, M. (1974). Up to the mark: A study of the examination game. London: Society for Research into Higher Education. [ws niet in UB Leiden, noch in KB]
John R. Bergan and Clement A. Stone (1985). Latent class models for knowledge domains. Psychological Bulletin, 98, 166-184.
from the abstract This article reviews the use of latent class models in testing hypotheses of importance in validating the structure of knowledge domains.
Beth Davey, George B. Macready (1990). Applications of Latent Class Modeling to Investigate the Structure Underlying Reading Comprehension Items. Applied Measurement in Education, 3, 209-229. "This article demonstrates the usefulness of latent class modeling in addressing several measurement issues, particularly those related to hierarchical structures among skills." [I have not seen this one]
Bereby-Meyer, Y., J. Meyer, and O.M. Flascher (2002). Prospect theory analysis of guessing in multiple choice tests. Journal of Behavioral Decision Making, 15, 313-327.
Denny Borsboom and Gideon J. Mellenbergh (2002). True scores, latent variables, and constructs: A comment on Schmidt and Hunter. Intelligence, 30, 505-514. pdf
Denny Borsboom (2005). Measuring the Mind. Conceptual Issues in Contemporary Psychometrics. Cambridge University Press.
R. L. Brennan and M. T. Kane (1977). Signal/noise ratios for domain-referenced tests. Psychometrika, 609-625.
p. 609: The word domain is particularly appropriate here, since our development is largely based upon principles of random sampling from a specified domain (or universe) of items and the nature of scores referenced to such a domain. p. 611: It will be useful in the following discussion to make a distiction between the degree of precision or the noise in measurement, and the relative precision or dependability of measurement. The noise reflects the magnitude of the errors that actually arise in a particular measurement procedure. Usually, however, an estimate of noise alone is not very useful. It is also necessary to know the degree of precision required for the kind of decision under consideration.
Brink, Wulfert P. van den (1977). Het verken-effect. Tijdschrift voor Onderwijsresearch, 2, 253-261. [in Dutch]
Brink, Wulfert P. van den (1979). Het optimale aantal antwoorden per item. Tijdschrift voor Onderwijsresearch, 4, 151-158.[in Dutch]
Brink, Wulfert P. van den (1982). Binomial models in the theory of mental test scores. Evaluation in Education: An International Review Series, 5, 165-176.
Brink, Wulfert P. van den (1982). Binomiale modellen. Dissertation. University of Amsterdam. [in Dutch]
Brink, W. P. van den, and P. Koele, P. (1980). Item sampling, guessing and decision making in achievement testing. British Journal of Mathematical and Statistical Psychology, 33, 104-108. https://doi.org/10.1111/j.2044-8317.1980.tb00781.x ("It is concluded that multiple choice tests of usual length are highly fallible decision aids.") pdf
Groot, A.D. de (1970). Some badly needed non-statistical concepts in applied psychometrics. Nederlands Tijdschrift voor de Psychologie, 25, 360-376. Partly available in html
Gruijter, Dato N. M. de (1987). Wilcox’ closed sequential testing procedure in stratified item domains. Methodika, 1, 3-12.
Hofstee, Willem K. B. (1979). Drogredenen met betrekking tot individuele kansuitspraken. Kennis en Methode, 433-445. [in Dutch]
Linden, Wim J. van der (1979). Binomial test models and item difficulty. Applied Psychological Measurement, 3, 401-411.
Samuel A. Livingston and Charles Lewis (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32, 179-197.
Mellenbergh, G. J. (1996). Het beoordelen van open vragen. Nederlands Tijdschrift voor de Psychologie, 26, 102-120.
Mellenbergh, Gideon J. (1996). Measurement precision in test score and item response models. Psychological Methods, 1, 293-299.
Joel Michell (2000). Normal Science, Pathological Science and Psychometrics. Theory & Psychology, 10, 639-667. pdf
Molenaar, W. (1973). Simple approximations to the poisson, binomial and hypergeometric distributions. Biometrics, 29, 403-407. Also: Amsterdam: Stichting Mathematisch Centrum.
Molenaar, W. (1977). On Bayesian formula scores for random guessing in multiple choice tests. British Journal of Mathematical and Statistical Psychology, 30, 79-89.
Nozick, Robert (1993). The nature of rationality. Princeton: Princeton University Press.
Niels H. Veldhuijzen (1980). Difficulties with difficulties. On the betabinomial model. Tijdschrift voor Onderwijsresearch, 5, 145- (shows that equal item difficulties are not required for the beta-binomial model to hold).
Ad H. G. S. van der Ven (1969). The binomial error model applied to time-limit tests. Dissertation. University of Nijmegen. Geeft een ardige behandeling van moeilijkheid van vragen, en van de standaard meetfout onder het binomiaal model (hfdst 7. Some implications of the theory with respect to the evaluation of individual precision scores).
Abraham Wald (1947/1973). Sequential analysis. Dover.
ch 5.: Testing the mean of a binomial distribution (acceptance inspection of a lot where each unit is classified into one of two categories
Ch 6: Testing the difference between the means of two binomial distributions (double dichotomies)
Alfred North Whitehead. Process and reality.
Ben Wilbrink (1995). A consumer theory of assessment in higher education; modelling student choice in test preparation. 6th European Conference for Research on Learning and Instruction, Nijmegen. Paper: auteur. html
F. W. Wilmink and K. Nevels (1982). A Bayesian approach for estimating the knowledge of the item pool in multiple choice tests. British Journal of Mathematical and Statistical Psychology, 35, 90-101.
Deze quincunx https://www.mathsisfun.com/data/quincunx.html laat fraai zien hoe toetsscores op toetsen met 9 vragen tot stand komen bij verschillend in te stellen niveaus van stofbeheersing (bv. left/right 10% is stofbeheersing 90%. Briljant modelletje!
simulator: https://www.geogebra.org/m/qkucpmmn
Wiki binomial
http://memory.psych.purdue.edu/models/lottery [link broken? 2-2008]Lottery Simulator
QuickCalcs: Random number generator
QuickCalcs: Binomial, Poisson and Gaussian distributions
Binomial Probabilities
http://www.fon.hum.uva.nl/Service/Statistics/Binomial_distribution.html?p=0.5&x=12&N=15 Cumulative frequency for the Binomial distribution
SISA: Gamma & Beta
Statiscope by Mikael Bonnier.
http://www.stat.uiuc.edu/courses/stat100/java/DataApplet.html [link broken? 2-2008] The Data Applet
This will be the first and the last time an applet is presented directly in the chapter text. All SPA applets have been collected in the special spa_applets.htm page. If the applet refuses to appear here, see the top of that page for possible reasons as well as actions.
Mail your opinion, suggestions, critique, experience on/with the SPA
http://www.benwilbrink.nl/projecten/spa_generator.htm