Strategic preparation for achievement tests
A model

Ben Wilbrink

model as well as text under construction — most applets are already available

warning. The JAVA-applets have been compiled under a JAVA version that since has been declared obsolete because of security leakages. I have not yet been able to update to new style JAVA or to construct applets based on Javascript. For simple analyses it is of course possible to use WOLFRAM's generator for the binomial distribution (score distribution given mastery), the beta density (likelihood foor mastery), and the betabinomial distribution (predictive score distribution); I will present the necessary links and documentation. Any questions: contact me. It is my experience that this innovative work does attract attention zero nada niente; if you are interested, then you are the exception, so do not hesitate to contact me.

modules individual applets

The model to be developed here makes the most of what Cronbach and Gleser (1957) called the individual point of view in testing, as contrasted to the institutional one. I need not explain that mainstream psychometrics, as applied in educational assessment, is especially concerned with testing as instrumental in institutional decision-making—such as admissions, examinations—or even political decision-making—such as following from the (U.S.) No Child Left behind Act. The suggestion that students might be decision-makers also, in the sense of students strategically preparing for their tests, is seldom heard of. Bob van Naerssen (1970 html) probably has been the first to articulate a model for this kind of strategic behavior. The most significant application of this kind of model is in the quantitative evaluation of the qualities of asssessments and examinations, making it possible to develop variants of assessment procedures that are less wasteful of the time of students and of the time of staff as well. The analogy here is to that of the effects of personnel selection using the decision-theoretic approach advocated by Cronbach and Gleser (for example, see my 1990 pdf simulation of complex selection procedures for techniques as well as for the literature)

The presentation in these project pages follows the construction of the model itself, and the actual buildup of the model from module one until at least module nine. This is not an ideal kind of presentation for all purposes. Therefore, another text will be prepared here in the article format, emphasizing the main points in the model itself and therefore taking the reverse order by starting at the highest module nine, while leaving out much of the details from the following project pages. The article will emphasize the kind of educational problems to be solved, and the characteristics of the spa-model making it an apt instrument to tackle these problems. One of the proofs of the feasability of the model is its construction itself, as accomplished in this document.

The model can be applied to almost all kinds of assessments where the assessed have been able to prepare themselves for the questions set. Because the generation of itemscores itself is at the heart of the model, it can be used to research many kinds of causal relations in the assessment process. Therefore the model allows designing assessments to be highly efficient to the goals of the designer. Are you one? Students will be able to find optimal strategies in the assessment situations they find themselves in. In fact, an important principle of design is that students should be able to find their optimal strategies in a natural way. Another such principle is to design the assessments in such a way that what is optimal to students (in the short term) also is optimal from the institutional viewpoint and to the long term interests of students.

Take the student—individually or collectively—to be the main actor in educational assessment. Surprising? Not at all, because so much depends on how—how well— students prepare for their exams. Put yourself in her—or his— place for just a moment, sitting this test. Whether a question is one she can answer or not is, from her individual point of view, a chance process that has been known for centuries as the binomial process, conditional on her mastery of the material the questions are based on or sampled from. This special state of affairs allows the building of mathematical models for optimal preparation strategies, given the rules on the combination of test scores in the examination, course, or curriculum.

That mastery scores typically are rather far removed from the extremes of perfect mastery or none at all is a feature characteristic of most of educational assessment, making every test item outspokingly a chance event. This state of affairs puts a high premium on the availability of a mathematical model, especially for teachers and institutions to be able to design assessment procedures so as to further efficiency of the educational enterprise itself.

Finding or developing optimal strategies is a decision-theoretic problem. Here again educational assessment constitutes a highly specific decision situation. Because of its clear combination rules for test scores, it is possible to translate these rules into objective utility functions representing the teacher's—or institution's—valuing of test outcomes. And because of its clear statistical process it is possible to easily evaluate expected utility curves. The objective utility structure can be exploited to find minimal cost strategies, cost here being the time that is to be invested in preparation for the tests of the course or examination. Van Naerssen (1970) was the first to follow this approach. Having evaluated all possible minimal cost strategies, it is possible to replace the formal utility functions with real utility functions. As far as I know, this is the first time—as of January 2006—in educational research that realistic utility functions have been constructed. The formal type of utility function, of course, has been known for longer, threshold utility being just such a type of function. More involved utility functions have been presented in the literature, just assuming some mathematical function being adequate to the task, or even using wild guesses of teachers and students (the Munchhausen approach to decisiontheoretic test theory).

The theory of decisionmaking, or expected utility theory, is a highly idealized model of human decision making. Almost never will the ordinary human being use a train of reasoning that even remotely resembles what is involved in expected utility theory; see, for example, Gigerenzer and Selten (2001) on the strategies people do use, or Hogarth (2001) on strategies people can learn to use. The contribution by Gary Klein (2001) is especially sobering, in tone the article amazingly resembles that of Patrick Suppes on statistical decision making. He comes up with promising alternatives such as progressive deepening (studied by De Groot, 1946 and 1965, in the strategies of grandmasters of chess), a technique that demands expert knowledge in its execution. In developing the SPA model in the past decennia I have experienced a progressive deepening of insight in educational assessment. In presenting the SPA model my hope is that it may trigger progressive deepening of insight in the reader looking for principles of design in educational assessment.

The SPA model is however closely related to expected utility theory, and in no way resembles the bounded rationality and satisficing approaches students must use in real school life. The SPA model, then, should be viewed as an instrument that will clarify the situation the students find themselves in. As such it can pinpoint serious discrepancies between ideal decision-theoretic and satisficing strategies of students, should they exist (they probably do in many situations). More importantly, it allows the teacher or institution to look into changes in assessment procedures that will allow students to use their time much more efficiently, even if using only satisficing strategies.

The recieved view on achievement testing—a somewhat narrower concept than assessment—has been documented in the standard work of Lord and Novick (1968) Statistical theories of mental test scores and in numerous textbooks on educational measurement—a slightly broader concept than achievement testing—based on the psychometric canon. The received view is a theory of the world according to authority: teachers, admissions officers, politicians. The position of individual students as autonomous individuals and responsible actors is missing here. A decision theoretic approach as advocated Cronbach and Gleser's (1965) Psychological tests and personnel decisions offers the possibility to choose the individual student as the actor of interest. Robert van Naerssen, who in the 1965 edition of Cronbach and Gleser was represented with a summary of his decision-theoretic study on the selection of truck drivers in the Dutch army, grabbed the decision theoretic opportunity and presented in the early seventies the outline of a theory called the tentamen model.

The general theory of achievement testing, Strategic Preparation for Achievement Tests (SPA), is a further development of Van Naerssen's tentamen model. As such it has been presented in English outline in my (1995) A consumer theory of assessment in higher education; modelling student choice in test preparation (html). It is a theory about optimal strategies available to students preparing for exams. Because of that it is also a theory on examinations and individual achievement tests, on how to construct and regulate them so as to achieve the best possible educational results. Methods of assessment have always been regarded an important aspect of the quality of education, see my (1997) Assessment in historical perspective html.

The general outline of the model and many of its interesting theoretical and practical points have been presented in many publications and papers since de mid seventies of the last century. A definitive and comprehensive publication, however, is not yet available. The techniques for evaluating the optimal strategies available to the student have not yet been rounded out and/or presented to the public.

The SPA models strategic student behavior. This does not mean, however, that it is a decision-making instrument for students, though it might be used so. The strength of the model is that it shows how strategic behavior changes with changes in the parameters of testing and examination situations. The SPA, in other words, is an instrument that enables researchers, advisors, faculty or student representatives to design educational assessment in rational ways, optimizing the achievement of important educational goals. In this way the SPA is a model that assumes human behavior to be rational. In actual practice human beings do their decision-making in evidently non-rational ways, most of the time. There is a vast literature on psychological decision-making that surely is relevant to the question whether the SPA is a model that might be used as an approximation to realistic student behavior. I will short-circuit the question by sticking to the 'rational man' assumption and using results of empirical research to test predictions or hypotheses following from the SPA. The SPA should correctly predict the kind and direction of change in student strategies following changes in the examination system or rules. Strategic student behavior is the kind of behavior described by Becker, Geer and Hughes in their Making the grade: the academic side of college life. Student strategic behavior has been shaped in many years of individual experience. On a short term basis this strategic behavior may not seem to be dictated by rational considerations, in the long term it surely is not irrational.

Analogous models?

The model builder runs the risk of trying to invent the kind of wheel that in some neighboring field has been known already for decades. I am not aware of any such models. Nevertheless, over the past decades I have tried to keep an open mind in this respect. It might just be possible, just to mention one candidate, that in behavioral ecology models have been developed (for predators) that are analogous to the modeling of the strategic position of students having to take tests and exams. Mangel and Clark (1988) discuss (par. 9.1) the situation of an individual predator evaluating the quality of his hunting site, using the rate of prey arriving at this site (in the spider's web). This situating looks perfectly analogous to that of the student preparing herself for an exit exam, and having to evaluate her own mastery. Every known item is prey caught, is the rate of catching prey high enough to go and sit the exam?

What you may expect to find in the project pages to follow is

This presentation of the project is very much a 'work in progress,' in this format begun in november 2004. The bulk of the work is the re-programming of many hundreds of pages of Pascal code in, hopefully, a hundred pages of Java. The first run will give you the rudimentary modular programs that in a second and third run will be refined and extended. The same approach will be used regarding the text that will present the model, its scientific grounding, and its potential uses in solving problems in educational practice.

You are invited to contact me at if you would like to use materials that have not yet been presented in these project pages. Keep me informed if you would like to use the applications in real world situations. Of course I can't take responsibility for the way the instruments will be used, or for the correctness of the instruments (always check the latest version that has been made available).

The ethical position that I have chosen in developing this theory is the following. A theory that is to be used by educational practitioners must be thoroughly understandable to them, and it must be possible to explain to students the actions taken on the basis of the theory. This is one strong reason to use simulation techniques instead of mathematics to develop the theory as well as the instruments. Another strong motivation is to develop instruments that empower teachers instead of making them dependent on good faith in experts.

Modules in the SPA

Module 1 The Generator: Insecurities quantified

the generator

+ applet

The model of strategic preparation for achievement tests SPA is implemented both as a computer simulation and—independently—as an analytical model. The generator that powers the computer simulation is the simulation of test scores assuming the level of mastery known. The generator makes it possible to choose a number of possible levels of mastery and see what might happen. Of course, the distribution generated is the binomial distribution, the oldest distribution in the history of statistics. The greater the number of runs is chosen, the closer the simulated distribution will approximate the theoretical distribution. The important thing here is that the user of the computer application needs to understand only the chance mechanism of flipping coins to make sense of the obtained distribution, there being no need to study the intricacies of combinations and permutations involved in the binomial formula for the number of successes in a series of throws.

See the chapter on the Generator for its scientific underpinning and its applications.

The bar chart pictured here, illustrates one such primitive simulation. The picture is a screen shot taken from the interactive instrument, you will find it on the applets page, there you may use the applet to run your own analyses and simulations. The menu will allow you to change the parameters of the simulation. There is also an advanced edition of the applet, offering the opportunity to simultaneously produce plots on two different sets of parameter values.

The simulation technique is a convenient device to easily obtain results sidestepping the mathematics involved in the statistical model fitting the particular situation. In this case the statistical model is the binomial distribution that will be plotted after choosing 'analysis' and 'start.' Simulation and analysis are completely independent from each other (using the same plotting procedure, though), yet the plot of the simulation will approach that of the analysis the better the higher the number of runs or observations is chosen.

Of course, the next test will result in one and only one score for this student. The binomial distribution represents her chances of scoring within a particular range or exactly this or that result. Other interpretations are possible, such as a simulation of 10 tests representing a possible result on all the test to be taken in a particular period by this student assuming mastery always to be known to have this particular value, or a simulation of 20 tests to represent what the scores might look like of twenty students all having the same supposedly known mastery.

It might come as a surprise to you to see the scores wandering off so far from the expected result, that being the proportion correct corresponding to the supposedly known mastery. But this is the point of the attempt to model assessment, isn't it: because we humans are not quite capable at estimating probabilities, let alone combinations of them, we need apt instruments to assist us in doing so.


A recent development is the avalilability of WolframAlpha and its possibilities to evaluate statistical distributions, such as the binomial distribution, number of items 20, mastery 0.6: Try some other values for the parameters.


Module 2 The Mastery Envelope: the Likelihood of it all

the generator

+ applet

No model without data to feed it. The data consist of the information the student has about her mastery of the domain of questions that might be asked. That information might be anything relevant, but for definiteness available information will be operationalized as the score obtained on a trial test of a certain number of items. The more items, the better the information.

The plot shows the 'area' that envelopes where mastery likely is located. Technically it is called a likelihood. The higher the curve, the more likely. The most likely value is the value corresponding to the highest point of the curve. See the module two text for further explanation.

The analytical plot may be obtained by evaluating the beta function, if that applies, or by exact analysis using the binomial distribution repeatedly (the advanced applet offers a choice between the two).


Information on the beta density:

WolframAlpha and the beta distribution, parameters number correct + 1 = 33 and number false + 1 = 9 Try some other values for the parameters.


Module 3 The Predictor: Place your bets

the generator

+ applet

Given some information on mastery it is now possible, using the Generator and the Mastery Envelope, to predict the range within which the result on the upcoming test with a certain probability will fall. The form the prediction will take again is a distribution function.

The spread of the distribution pictured might be a surprise: there is a sizeable chance for the student to fail the test, i.e. to score below the reference of 40. The trouble with measurement on the basis of small samples is that its outcome is not as predictable as it should be in an ideal educational world.

There are two givens that determine the precision—or lack of it—of the prediction: the preliminary information available to the student, and the number of items in the summative test. Click the figure to see the givens—parameter values—used in this case.

The technique to obtain the prediction by simulation is straightforward: randomly sample mastery values from the Mastery Envelope and use the Generator to simulate a testscore on the basis of every mastery value sampled.

Analytically the predictive distribution in many cases—stratified sampling from subdomains excluded—is a betabinomial distribution whose parameters are the number of items in the summative test, and the number correct and the number wrong on the preliminary test. The betabinomial model does not apply any more as soon as learning is added to the model, therefore a general routine to exactly evaluate likelihoods as well as predictions will be used in the model.


Information on the betabinomial distribution: distribution.

WolframAlpha and the betabinomial distribution, parameters number correct + 1 = 13 and number false + 1 = 5, number of items = 60: Try some other values for the parameters.


Module 4 The Ruling: How the result will count (his master's voice)

the generator

+ applet

Another kind of information needed for the SPA model is that about the testing or examination situation. How will the results on the test be graded, and what do these grades contribute to pass-fail decisions on the examination or end-of-year results? In the terminology of decision theory this is about utility functions.

The lucky situation in educational assessment is that the examination rules almost always will have been spelled out meticulously. This ruling will easily and quite remarkably translate in an objective way into utility or utility functions. The objectivity of these utility functions is remarkable because in the literature only subjective utility functions, or subjectively chosen utility functions, have been used. The applet illustrates the technique that will be applicable to almost any educational assessment situation.

Quite generally almost every testing situation in education is a case of threshold utility in combination with a certain range about the threshold—neutrally called the reference point in the SPA-model—where higher results on one test may compensate lower ones on another. The extreme cases of course are pass-fail testing—no compensation at all—and the grade point average system—allowing almost perfect compensation -.

The applet allows testscores being grouped for grading purposes, and compensation between grades asymmetrical around the cutting score. Be aware of the factual character of utility, it is not a free floating abstraction; utility always is taken to be valid for a particular student in a particular situation. For example, the negative or positive compensation allowed on a particular test is the freedom that is left to the student in question after taking account of any positive or negative compensation points she has left over from earlier tests, or has the opportunity to obtain in subsequent tests.

In the pure threshold utility case the function allows the evaluation of the chances to pass the test; the expected utility equals the probability of a pass. The generalization for the compensatory case is straightforward. In the full compensation case the expected utility equals the expected score (or grade, if that applies).

The utility structure of an examination program or curriculum is an important determinant of the effectiveness of the curriculum. The chapter will present an example of this special application of utility functions.

The problem in this approach is that the student has to deal with utility as determined by the ruling on the way test scores will be combined to determine the course or examination outcome. Therefore these utilities in an important sense represent the valuations of the teacher or the instution, not the true values scores would have for the individual student. Call these utilities therefore first generation utilities. For the moment they are the best we have. Once it is possible to evaluate optimal strategies, the 'true' utilities scores have for the student can be made explicit. These will be called second generation utilities, and will be treated in chapter 9. First and second generation utilities will differ from each other. It might be an important institutional goal to prevent discrepancies being so big as to result in efficiency problems.

Module 5 Learning: Curves of Insight

the generator

+ applet

In a way the Predictor is all the student needs to decide whether she still has to invest more time in preparation for the test. It would be nice, though, to have some insight in the amount of time needed to reach a more satisfactory level of expected outcomes. It is clear that the static Predictor should be the kick-off for any attempt to bring in some dynamics by assuming a learning model. This module presents the learning model options in isolation from the Predictor. The next module will combine the two to model the path of expectations.

The learning model assumes mastery to be known, and learning to be deterministic according to the particular model chosen. The prototype presents a choice between two kinds of learning model, called the accumulation model and the replacement model (Mazur and Hastie, 1978). The advanced applet offers the opportunity to plot a second curve or second set of curves according to the second specification for the parameters.

The interpretation in this modelling is as follows. Take the number correct score on a preliminary test as your value for mastery; after all that is your best bet. Preparation time comes in next: it is all the time spent in preparation for the preliminary testing. It is arbritarily lumped together and assigned the value of 'one episode.' The excess of episodes specified is the possible future trajectory. The number of bars specifies in how many parts every episode has to be chopped up to produce a reasonably smooth learning curve.

The problem with the accumulation as well as the replacement model is that they do not fit the complex learning in most curricula: they produce learning curves steepest at the beginning point and levelling off ever after. The SPA model tries to solve the problem elegantly by assuming that test items can be answered correctly only by knowing a certain number of basic facts or events, called the items' complexity. Mastery will still be defined on the complex items in the test and in the item bank, but learning is defined on knowledge of the underlying basic facts or events. Making test items more complex will produce curves that are level to begin with, sloping upward and only thereafter beginning to level off. Again, see the chapter's text for the fine points and scientific underpinning.

Module 6 Expectation: Expected Utility along the Learning Path


+ applet

Figure 6.1 Expected utility functions over three episodes for the replacement (blue) and accumulation (red) models [applet]

The learning curve is used to project expected utilities further in the future than at the moment of having taken a preliminary test. The applet does just that: plot the expected utility for every episode and for every point of the grid specified for every episode. In order to be able to plot the curve of expected utilities, for every one of them the predictive test score distribution has to be generated (either by simulating or by analysing it). in order to be able to evalute its expected utility.

The interpretation of the expected utility curves is that the point of inflection of the curve, if it does have one, indicates a good strategic point to aim for. The verdict on what are good or better strategies cannot be final yet, however, because somehow the cost of time yet to be invested has to be reckoned with. Time already spent economists would call sunk investment because it should not influence judgment on whether to go on investing. The case where the expected utility curve does not have a point of inflection illustrates the problem: if investing more time always results in still better expected utility, then when should one stop to invest time? The next two modules will try to solve the problem.


Figure 6.2 Screenshot of the advanced applet of module 6 (click it for the full figure)

For the applet itself click spa_applets.htm#6,
for the text on this module spa_expectation.htm.

Module 7 Optimal stopping on the last test


+ applet

Figure 7.1 Curves of investments expected to be needed to succeed for the last test in a series of tests. For replacement (blue) en accumulation (red) learning model. Vertical scaling equals the horizontal one.

To get higher expected utility the student has to pay the price of more time to spend. Investment costs and utility are quite different things, they should not be lumped together to immediately jump at subjective outcome utility functions, as happens in some places in the literature.

Optimization then poses the problem how much extra investment of time still is profitable. How long should the student go on investing extra time, assuming it is still possible to invest extra time? One elegant solution to the problem would be the following: extra investment of time is profitable as long as it results in a greater reduction in time needed in future situations to effect the same expected extra utility. In situations of strict pass-fail scoring this will reduce to the extra time involved in repeating (the course and) the test, or better: in the extra time involved in doing extra tests or tasks. The last test in a curriculum in a way always is scored pass or fail, for the aggregate result of the tests in that curriculum may or may not be sufficient to obtain a pass for that curriculum, or admission to the next curriculum, year, school or whatever. There are important cases where the last test formally will NOT have a simple threshold function, however. Such is the case where the last test is the last opportunity to make up for negative results that can be compensated for. Consider the compensating score on the last test the reference score or threshold. For all practical purposes then the last test in fact will have the new reference as its threshold or cutting score; to compensate for earlier negative points is the dominating strategy. The applet finds optimal strategies in preparing for the last test, as figure 1. shows.

The applet plots an expected cost function whose minimum value indicates the optimum preparation time, given the information on mastery available now in the operational form of number correct on a preliminary test of known length. For the Last Test in the examination etc. cost includes the possible cost following failure on the test, this is the kind of situation that very much resembles the situation Van Naerssen modelled in his tentamen model.

Module 8 The Strategist: Optimal stopping

+ applet

Figure 1 strat8.1all.gif

Figure 8.1 Curves of expected investments needed to succeed for the Next To Last Test. For Replacement (blue) en accumulation (red) learning model. The second test allows positive compensation (cyan and magenta, respectively). Vertical scaling equals the horizontal one.

Work on optimal strategies for the next to last test (NTLT) is reported. The NTLT situation lends itself to being used as an approximation for the general test situation. In this way the NTLT situation is not a specific one, other than that for the Last Test which is quite specific. Eventually the applet should cover most of the testing situations in education. The problem in developing this applet is that not only the variants in 'pure' utility functions must be accommodated, but also the changes that already obtained scores or grades effect in the utility functions of the remaining tests (the effects of compensating points earned or lost). Hopefully it will prove possible to simulate and analyze an entire series of tests comprising the curriculum or examination, but that will be taken up in yet another applet.

The applet plots an expected cost function whose minimum value indicates the optimum preparation time, given the information on mastery available now in the operational form of number correct on a preliminary test of known length. For the Last Test in the examination etc. cost includes the possible cost following failure on the test, this is the kind of situation that very much resembles the situation Van Naerssen modelled in his Tentamenmodel. For the Next To Last Test (NTLT) the cost includes profit or loss on the optimal LT strategy as a consequence of the result obtained on the NTLT. The NTLTest-analysis probably is a good approximation for the analysis of any test in the examination or curriculum, except the Last Test. The details are given in the chapter on The Strategist module; they will prove to be somewhat more involved because of the asymmetry between negative and positive compensation allowed. The beautiful effect of allowing positive compensation—as depicted above—is not paired off by a similar result for allowing negative compensation points to be compensated for on the LT. The spectacular effect of positive compensation is a forceful motivator in strategic preparation for educational testing. Allowing negative points on the NTLT to be compensated on the LT might spell disaster for the student, or be only marginally effective, depending on the kind of ruling on these matters.

Module 9 True Utility: What the result is worth (the student's calculation)


+ applet

Figure 9.1 Second generation utility curves depending on the kind of learning: replacement (blue) en accumulation (red) learning model. The first generation utility function is the green one. In all cases five compensation points are allowed, positive as well as negative ones. Of course both kinds of function here are scaled to be one at the reference score.

Module 4 develops the first generation utility function, module 9 now develops the second generation utility function that in a certain sense supersedes the first one, at least as far as the individual student is concerned. First generation utilities represent exchange rates in the market of compensation points—or grades—as defined by the teacher or the institution. For the student these exchanges in many cases will be 'unfair' because the time invested in them does not perfectly conform to this exchange rate, or even not at all. Therefore the second generation utilities represent outcomes valued in terms of time invested or yet to be invested.

The explanation of its construction basically is quite simple; earning compensation points allows optimal strategies on the Last Test 'costing' more—in the negative cases—or less—in the positive ones—than would be the case otherwise. These differences are differences of utility, they can be evaluated by finding the optimal strategies themselves. The reference score is scaled to have utility one. Failing scores have the lowest utility, it is scaled to be zero.

Because second generation utilities depend on the learning model, they will not be identical to the first generation utilities—the green function—that in traditional situations will be ruled to be linear in scores or grades. As the figure shows, the discrepancies between what the formal rules on compensation suggest to be the case, and what evaluation of real costs reveals, can be enormous, especially regarding (proactive) negative compensation points.

It is possible to change these discrepancies by changing the rules for the way certain results may compensate for other results. One possible technique is to define compensation on grades earned, and then scaling grades in such a way that they are approximately linear in time expected to be needed to earn them. There will probably be a small research literature on the last topic, possibly related to equating models, but I am not aware of concrete contributions as of this moment. What is needed, however, is a bunch of very simple rules that can do the job in most cases.

The next step in the research program then could be to search for relations between certain kinds of discrepancies between first and second degree utilities, and certain kinds of problems students have in that type of study course; think of delay, flunking grades, dropping out, procrastination.

Module 11 The Optimizer: Indifference curves

Another possibility to solve the optimization problem is to restrict the problem space to two tests competing for the scarce time of the student. The idea is to compare the extra expected utility extra investment of time in either test A or test B would realize. The technique economists use to visualize this situation is that of indifference curves. This solution to the optimization problem was presented in the 1995 Earli conference [164k html].
The method is attractive because it is exact in the solution of the problem of the optimal sprad of a given budget of study time. It has little to say, however, on what would be an optimal budget of preparation time for the two tests together, although the optimal 'path' through a number of different values for that budget is constructed also in this module/applet.

Module 12 The Data: Testing the Model

How does the model compare with the strategies students seem to follow in reality? An extensive data set was assembled in the '80s, consisting in data from first year law students. They had to take six tests during that year, together constituting the first year examination. For every test the students filled out a small questionnaire immediately before receiving the test form. The were asked for their preparation time and the grade they expected to get on that particular test. The obtained grade later was added to the data.
The data might shed some light on questions such as: what strategies do students follow, and are they 'optimal' in some sense? Where and how do these spontaneous strategies differ from the model's prescriptive ones? Is it true, as might be expected, that students do not adequately handle the stochastics of the stuations they find themselves in?

Mitchell, T. R., and D. M. Nebeker (1973). Expectancy theory predictions of academic effort and performance. Journal of Applied Psychology, 57, 61-67.

Vrugt, Anneke, en Johan Hoogstraten (1998). Doeloriëntaties, waargenomen Eigen Competentie en Studieresultaten. Tijdschrift voor Onderwijsresearch, 23, 210-223.

Module 13 The System: not everybody can follow optimal strategies even if they would want to

The data in module 9 were used in two earlier papers on an application of Coleman's social system theory to the social system students and faculty together form in preparing for tests and in assessing test results html and html.

How general is 'general'?

The model of achievement tests is presented as a 'general' model. The limits to its generalness are twofold. First, the model does not directly address the content of achievement test items. Second, the model does not address systemic issues arising from strategic behavior of groups of students.

Theory about content of achievement test items is virtually non-existent. Yet quality of content evidently is the primary factor determining the quality of achievement tests. It is therefore very disappointing to find textbooks on educational measurement uncritically repeating the old adagium, see for example Wesman in Thorndike (1970) Educational measurement, that the writing of achievement test items is an art one can only master through years and years of practice. Even books on test item writing, for example Roid and Haladyna's, play down issues of content, concentrating instead on issues of form. Even the magnificent work of Bloom, Madaus and others in the seventies, on educational objectives and how to test them, does not really concern content issues either. Because of these lacunae I have written in the early eighties a book on test item writing that explicitly addresses quality of content issues, and presents an educational technology for the writing of achievement tets items. Regrettably the book is in Dutch, but you may nevertheless want to take a look at its revision (underway): Toetsvragen ontwerpen [html].
The SPA does not directly concern content, except the abstract quality of content complexity. Complexity is a measure of the number of things that correctly and simultaneously must be known in order to be able to correctly answer the question.

Systemic issues in achievement testing arise because of its many norm-referenced practices. In those circumstances it is not possible for every student in the group to follow objectively optimal study strategies and reap its benefits. This, of course, is one of the really big tragedies in education. It is the prime motivating factor for developing a theory that enables actors to 'see' the inefficiencies of traditional norm-referenced achievement testing. The SPA stops short of directly handling systemic issues. Systemic issues demand a wholly different methodology, such as presented bij Coleman in his (1990) Foundations of social theory. I have applied his theory to a relevant data set, and I have presentd the analysis in my (1992) Modelling the connection between individual behaviour and macro-level outputs. Understanding grade retention, drop-out and study-delays as system rigidities html.

Moving from norm-referenced to criterion-refenced testing makes it possible to use the SPA for groups of students, simply by adding individual results or analyses.

Having explained what is not included in the 'generalness' of the SPA, a few words of what is included will elucidate the range of applicability of the SPA.

To begin with, simple parameters of mastery, item complexity, and testlength extend the models generality to almost all possible achievement tests. The SPA is not limited to tests in the literal sense, however. All kinds of assessment must be considered as included in the concept of 'achievement test.'

To be of any strategic value at all, the SPA must include time. Learning models, therefore, are part of the SPA, and specific learning models may be added to the model. Time itself is an asset to the student in the sense that still available time might be used in further preparation for the upcoming achievement test, or not.

Strategic value itself, of course, equals what more technically is termed expected utility. The SPA is a decision theoretic model that uses utility functions representing valued outcomes of achievement testing, valued by the individual students involved that means, not by authoritarian actors. Using utility functions it is possible to quantitatively capture the value that a paticular achievement test score has in the context op recently obtained scores or yet to be obtained scores on other tests belonging to the same examination or group of tests defining grade point average.

An important quality of 'generalness' is that the model allows the breakdown of big summative achievement tests into a series of smaller formative tests, and evaluate the possible consequences of this course of action in terms of the educational achievement of the students. An example of this is presented in my (1995, in Dutch) Studiestrategieën die voor studenten en docenten optimaal zijn: het sturen van investeringen in de studie [Optimal study strategies from the perspective of students as well as teachers: getting students to optimally invest study time.] [44k html + 18 gif]


The model in itself is a rich theory on achievement testing. The fact that it has been embodied as a computer application makes this theory especially relevant to actors in the field, researchers as well as practitioners. Changing one or more parameter values in the model enables one to instantly evaluate the possible consequences of certain changes in tests and testing situations. Many examples will be given. The instruments themselves will be made available as Java-applets, so it will be possible to apply the model to typical situations you yourself have met or are responsible for.

Project history

As indicated above, Van Naerssen's 'tentamen model' was the first serious attempt to model strategic student behavior in preparing for tests and examinations. As such his work has been a source of inspiration for the development of the General Model of Achievement Tests. Van Naerssen himself did not succeed in developing a general model. Many of the assumptions he had to make in order to develop the mathematics of the model proved to obstacles for the generalization of his model. In the decennia to follow it has been possible to remove the restrictive assumptions one by one, resulting in a model that is truly general in the sense indicated in the paragraph above.

Another early influence has been the notion of transparency promoted as an essential quality of achievement tests and testing situations by Adriaan de Groot (1970). The gist of this notion is that students should be able to follow a profitable strategy in preparation of achievement tests. They should not be left in the dark about what might be asked in what ways en how their answers might be rated. There is a remarkable idententity of spirit in the Van Naerssen model and the transparency concept of De Groot. They were faculty members in the same department and research group in the University of Amsterdam. The idea of the tentamen model as a possible operationalization of the concept of transparency never seems to have occurred to them, however.

Also in the seventies a group of Dutch psychologists gathered together by Don Mellenbergh and Wim van der Linden studied the possibilities of criterion referenced measurement. This group provided a stimulating climate for me to attempt to generalize the tentamen model of Van Naerssen. The kind of decision theory used by most members of this group, however, has been statistical decision theory. There is nothing wrong with statistical decision theory, of course, in decision making over scientific hypotheses. However, it is not the best approach to human or economic decision making. Cronbach and Gleser, and Van Naerssen, therefore used economic decision theory. The problem now is that scientists using statistical decision theory do not readily understand scientists using economic decision theory, and vice versa. And yet many of the results obtained by the former group can be shown to be false using the transparent methods of economic decision theory. And so will be done in the pages to follow.

In my publication list the slow development from the seventies until the millennium can be followed. The first significant step forward was the strategic model developed in 1978, in a publication in Dutch. At the time it proved not possible, however, to generalize the model to include strategies. Two publications in 1980 on criterion-referenced decision-making established the correct use of utility functions and expected utility respectively. The eighties were a lost decennium to the further development of the model. In the nineties it proved possible to establish one by one a series of essential building blocks of the model, presented in congressional papers. It proved possible to structure these building blocks in a hierarchical way, now to be presented as a series of modules. Each module moreover has its own significance for theoretical or practical issues in achievement testing. The hierarchy in the modular structure indicates it will be more difficult to understand the issues in the general model, the higher the module is placed. That certainly has been an issue in the development of the computer program that should bring the model to life. The ultimate building block is the one that shows wheat the student's optimal strategy should be, given the information available and the time to go. This module has not yet been published, but is now available in a preliminary form (not yet impleted as a Java applet, however).


H. Becker, B. Geer and E. C. Hughes (1968). Making the grade: the academic side of college life. New York: Wiley. site

Coleman, James S. (1990). Foundations of social theory. Cambridge, Massachusetts: The Belknap Press of Harvard University Press.

Cronbach, Lee J. , and Goldine C. Gleser (1957/1965). Psychological tests and personnel decisions. Urbana, Illiois: University of Illinois Press.

Gigerenzer, Gerd, and Reinhard Selten (Eds) (2001). Bounded rationality. The adaptive toolbox. MIT Press.

Groot, A. D. de (1946). Het denken van den schaker. Een experimenteel psychologische studie. Amsterdam: Noord-Hollandsche Uitgevers maatschappij.

Groot, Adriaan D. de (1965). Thought and choice in chess. The Hague: Mouton.

Groot, Adriaan D. de (1970). Some badly needed nonstatistical concepts in applied psychometrics. Nederlands Tijdschrift voor de Psychologie, 25, 360-376. [in English]

Hogarth, Robin M. (2001). Educating intuition. Chicago: The University of Chicago Press.

Klein, Gary (2001). The fiction of optimization. In Gigerenzer and Selten, p. 103-121.

Lord, Frederic M., & Novick, Melvin R. (1968). Statistical theories of mental test scores. London: Addison-Wesley.

Marc Mangel and Colin W. Clark (1988). Dynamic modeling in behavioral ecology. Princeton, NJ: Princeton.

Mazur, J. E., & Hastie, R. (1978). Learning as accumulation: A reexamination of the learning curve. Psychological Bulletin, 85, 1256-1274.

Naerssen, Robert F. van (1965). Application of the decision-theoretical approach to the selection of drivers. In Cronbach and Gleser (1965).

Naerssen, Robert F. van (1970). Over optimaal studeren en tentamens combineren. Openbare les. Amsterdam: Swets en Zeitlinger, 1970. [The first publication on the tentamen model, in Dutch] html [here also a list of other relevant publications from his hand, plus abstracts]

Naerssen, Robert F. van (1974). A mathematical model for the optimal use of criterion referenced tests. Nederlands Tijdschrift voor de Psychologie, 29, 431 -446. [in English] pdf

Naerssen, Robert F. van (1978). A systems approach to examinations. Annals of Systems Research, 6, 63-72. scan

Suppes, Patrick (1976). Testing theories and the foundations of statistics. In William L. Harper and C. A. Hooker (Eds) (1976). Foundations of probability theory, statistical inference, and statistical theories of science. Proceedings of an international research colloquium held at the University of Western Ontario, London, Canada, 10-13 May 1973. Volume II: Foundations and philosophy of statistical inference (p. 437-455, including discussion). pdf

Wesman, A. G. (1970). Writing the test item. In R. L. Thorndike (Ed.): Educational measurement. Washington, DC.: American Council on Education.

Wilbrink, Ben (1990). Complexe selectieprocedures simuleren op de computer.Amsterdam: SCO. (rapport 246) pdf   bijlagen pdf

Wilbrink, Ben (1992). Modelling the connection between individual behaviour and macro-level outputs. Understanding grade retention, drop-out and study-delays as system rigidities. In Tj. Plomp, J. M. Pieters & A. Feteris (Eds.), European Conference on Educational Research (pp. 701-704.). Enschede: University of Twente. Paper: auteur. html

Wilbrink, Ben (1992). The first year examination as negotiation; an application of Coleman's social system theory to law education data. In Tj. Plomp, J. M. Pieters & A. Feteris (Eds.), European Conference on Educational Research (pp. 1149-1152). Enschede: University of Twente. Paper: auteur. html

Wilbrink, Ben (1995). What its historical roots tell us about assessment in higher education today. 6th European Conference for Research on Learning and Instruction, Nijmegen. Paper: auteur. [html 120k]

Wilbrink, Ben (1995). A consumer theory of assessment in higher education; modelling student choice in test preparation. 6th European Conference for Research on Learning and Instruction, Nijmegen. Paper: auteur. [164k html]

Wilbrink, Ben (1995). Studiestrategieën die voor studenten en docenten optimaal zijn: het sturen van investeringen in de studie. Korte versie in Bert Creemers e.a. (Red.), Onderwijsonderzoek in Nederland en Vlaanderen 1995. Proceedings van de Onderwijs Research Dagen 1995 te Groningen (218-220). Groningen: GION. Paper: auteur. [44k html + 18 gif]

Wilbrink, Ben (1997). Assessment in historical perspective. Studies in Educational Evaluation, 23, 31-48. [concept available 56k html]

More Literature

José M. Bernardo (1997). A Decision Analysis Approach to Multiple-Choice Examinations. pdf

Giere, Ronald N. (1999). Using models to represent reality. In L. Magnani, N. J. Nersessian, and P. Thagard (Eds) Model-Based Reasoning in Scientific Discovery (41-57). New York: Kluwer/Plenum. pdf on Giere's site at

Hartmann, Stephan (2005) The World as a Process: Simulations in the Natural and Social Sciences. pdf

Kyburg, Henry (1991). Normative and descriptive ideals. In Robert Cummins and John Pollock: Philosophy and AI (p. 129-139). MIT.

Randolph Sloof and Mirjam van Praag (2005). Performance measurement, expectancy and agency theory: an experimental study. University of Amsterdam and Inbergen Institute. SCHOLAR project. pdf. Sloof, Randolph, and Mirjam van Praag (2005). Performance measurement, expectancy and agency theory: An experimental study. Research Institute Scholar, University of Amsterdam, Roetersstraat 11, 1018 WB. Working paper. pdf [Agency-theory, a sport not directly connected to decision-making theory; its promise is to clarify some issues that otherwise might go unnoticed. The paper is on my to-read list; I am interested in its theoretical parts.]

John B. Carroll (1990). Estimating Item and Ability Parameters in Homogeneous Tests With the Person Characteristic Function. Applied Psychological Measurement, 14, 109-125.abstract



Mail your opinion, suggestions, critique, experience on/with the SPA.

Feb 26, 2016 \ contact ben at at at

Valid HTML 4.01!