This chapter presents the third module of the SPA model, SPA standing for Strategic Preparation for Achievement tests. The SPA model consists of a series—partially a cumulative one—of modules dedicated to particular functions in the SPA model, such as generating binomial distributions given what the mastery level is, generating the likelihood of mastery given a score on a preliminary test, given the likelihood of mastery generating the predictive score distribution for the test one has to sit, specifying objectiev—first generation—utility functions on test scores, specifying learning curves, given the learning curve evaluating expected utility along the learning path, given the expected utility function evaluating the optimal investment of study time in preparation for the next achievement test, and using the last results specifying the second generation utility function on test scores..
Information on the betabinomial distribution: http://www.wolframalpha.com/input/?i=betabinomial distribution.
WolframAlpha and the betabinomial distribution, parameters number correct + 1 = 13 and number false + 1 = 5, number of items = 60:
Given some information on mastery it is now possible, using the Generator and the Mastery Envelope, to predict the range within which the result on the upcoming test with a certain probability will lie. The format of the prediction again will be a distribution function. The technique to obtain it is straightforward: randomly sample mastery values from the Mastery Envelope and use the Generator to simulate a testscore on the basis of every mastery value sampled.
Analytically the predictive distribution is a beta-binomial distribution whose parameters are the number of items in the summative test, and the number correct and its complement on the preliminary test.
Aitchison and Dunsmore (1975), on predictive distributions.
[In straightforward cases, the theoretical model simply is the betabinomial model. It has been described in the literature, for example Novick and Jackson, 1974]
"Fifteen scripts were selected which had been awarded exactly the same "middling" mark by the School Certificate authority concerned, and these scripts were marked in turn and independently by 15 examiners, who were asked to assign to them both marks and awards of Failure, Pass and Credit. After an interval which varied with the different examiners, but was not less than 12 nor more than 19 months in any instance, the same scripts, after being renumbered, were marked again by 14 out of the 15 original examiners (...). The 14 examiners assured us that they had kept no record of their previous work and this was indeed obvious from the results.
Perhaps the most striking feature in the investigation is this: (...) On each occasion the 14 examiners awarded a total of 210 verdicts [Fail, Pass, Credit] to the 15 candidates. It was found that in 92 cases out of the 210 the individual examiners gave a different verdict on the second occasion from the verdict awarded on the first. In nine cases candidates were moved two classes up or down.
One examiner changed his verdict in regard to eight candidates out of the fifteen. Yet he only varied his average by a unit [scale of 100] and he awarded the same number of Failure marks, one less Pass, and one more Credit. Such irregularity of judgment is not only formidable, but it is one which would not be detected by any ordinary analysis."
Ph. Hartog and E. C. Rhodes (1936). An examination of examinations (p. 14-15). Second edition. International Institute Examinations Enquiry. London: MacMillan.
The case cited is a highly problematic one, to students preparing for this kind of examination, as well as to their examiners. The enormous grading error inherent in individual assessor behavior detracts from whatever predictability of grades otherwise could be achieved. Is it possible to build a model for this case, or is it beyond hope? The English committee was tempted to condemn this kind of examination altogether, but did not do so. Members of the English committee were, among others, Burt and Spearman. From the US Thorndike and Monroe were involved in the international project on examinations.
Studies like this one established that traditional methods of examinations were very much short of even very low criteria of reliability and validity. "It is perfectly true that, as Professor Spearman has pointed out, validity and 'reliability' or concurrence of marking are by no means equivalent terms, but no process of measurement can be valid when it yields such discrepant results in the hands of the same examiners on two different occasions." (p. 15-16 o.c.). This is the very bottom line of quality in assessment. Do not be mistaken, however, about this kind of total lack of quality in assessment being impossible nowadays: it is still common in all kinds of education, from kindergarten until and including the PhD. Its human and economic cost is immense.
Please be aware of the fact that it is the grading that is highly problematic here. The assessors should have been able to give students good feedback and instruction on their weak and strong points. That feedback might still be good even if not— as is to be expected—in good agreement with feedback independently given by other assessors of the same work. Regrettably, this aspect of assessment was not investigated in the Hartog and Rhodes study. [I have to check this verdict yet]
The student as decisionmaker is not the usual starting point in test analysis. To get a sharp picture of the contrasting traditional - institutional - decisionmaking approach, see the Rudner applet html that will allow you to see the probabilities that a student with given scores on a three-item test belongs to either the mastery or the non-mastery group (click the thumbnail for a screenshot, not the applet itself). It is a beautiful applet, allowing experiments with different values. The only restriction is the fixed number of three test items. For this kind of application of decision theory see the webpage of the author of the applet, Lawrence M. Rudner page on Measurement Decision Theory.
The guessing option is not yet implemented. I am not in a hurry to do so, because guessing on multiple choice achievement test items is a nuisance anayway, and should not be invited. The proper thing to do is to instruct students to leave questions unanswered in cases where they do not know the answer. The two possibilities then are either to subtract a point for every wrong answer, or to give a bonus for every question left unanswered.
[Given the strong theory underlying the predictive distribution model, empirical support is not really the issue, as far as the mathematical/simulative model is concerned.
The real issue might very well be: are students realistic and experienced enough to adequately estimate their chances of passing their tests? Empirical data on this question were assembled in the 1980's, in the departments of dentistry and law at the university of Amsterdam. Analyses of these data have appeared in congress papers and project reports (see sitemap). Especially interesting is the analysis using James Coleman's social system methodology, presented in congress, 1992, in Twente. I will have to extract the parts of these analyses directly bearing on the power of students in predicting their own test results.
In the literature at large, there is some confusion on the question whether or not students are able to predict their scores. The important issue here is: who is doing the predicting, for whom? In the spa-model it is the student predicting his or her own result, irrespective of what other students do or predict. In educational research, the analysis generally is prepared by collecting correlations on groups of students, a fundamentally different situation. The Coleman social system approach is a beautiful attempt to get the best of two worlds: groups as well as individuals.]
[Application is narrowly related to the situations the empirical data in the preceding paragraph have been collected from.
For example, in preparation for the very first test the newly arrived students have to sit, they do not yet know what their relative position in their peer group is. Therefore they are not yet in a good position to predict their grade on this first test. Empirical data clearly show their prediction of results on the second etcetera tests to be much better.
This kind of result should not come as a surprise to anyone. Being able to model it, however, is an important step forward. In the first months of the course, students must decide whether to stay or leave. Discrepancies between expectations and results can trigger the decision to change courses or even universities. ]
The beta-binomial model is a well known model, not the least because of its use in Bayesian statistics. In Novick and Jackson (1974) all the ingredients are given to formulate a simple prediction model for test scores, given a result on a preliminary test sampled from the same domain of achievement test items. Combining it with the insights of Van Naerssen as presented in his tentamen model, this must have been the beginning of the SPA model.
The beta-binomial model lends itself rather easily for the evaluation of complex selection situations such as the restricted admissions to higher education studies (numerus fixus studies) in the Netherlands. In 1980 I evaluated the proposal by the then minister of education in the Netherlands to change the weighted lottery scheme for admissions to numerus fixus studies, showing how it resulted in negligible chances for certain subgroups to get admission (html). The minister did not bring his proposal to parliament, probably because of political reasons that never seem to have been revealed publicly.
The analytic routine evaluates the beta-binomial distribution, using a recursion formula. The simulation routine consists of first simulating the likelihood for mastery given the preliminary test result—the Mastery Envelope—and then simulating the predictive score distribution by repeatedly sampling a mastery using the likelihood function and simulating a binomial test score given that mastery. Therefore, the routines used to simulate the predictive score distribution have been treated in the preceding chapters.
'Repeatedly sampling a mastery using the likelihood function.' That likelihood function itself is a simulation already. Random sampling from under a function that itself is the result of a simulation is a ritual dance, it does not add anything meaningful. The straightforward thing to do is to use the simulated likelihood function itself to construct the predictive score distribution. No ranom sampling. The savings in time will be small, however. In due time I will implement the straightforward routine, replacing the random sampling routine (keeping it available as an option for control or research purposes).
Important savings in time are possible by choosing an appropriately small grid (number of bars used to construct the likelihood function) for the simulation of the likelihood function of mastery. For many purposes a grid of 10 might be fine enough. The option to choose the grid other than the standard 100 therefore has been added to the standard applet also (january 2, 2008).
The picture illustrates the fit between simulation and analysis. Because the fillpolygon technique in Java is used, the individual bars that can be seen are two pixels thick, instead of the one pixel that would be appropriate here. Therefore the picture seems to show a somewhat greater surface under the simulated curve than under the analytic curve.
Incidentally the plot shows how the student having little information on her mastery, can't predict her score very well in this 500-item summative test case.
J. Aitchison and I. R. Dunsmore (1975). Statistical prediction analysis. Cambridge University Press.
Ph. Hartog and E. C. Rhodes (1936). An examination of examinations. Second edition. International Institute Examinations Enquiry. London: MacMillan.
Melvin R. Novick and Paul H. Jackson (1974). Statistical methods for educational and psychological research. London: McGraw-Hill.
Garren, S. T., R. L. Smith and W. W. Piegorsch (1994). Bootstrap goodness-of-fit tests for the beta-binomial model. Technical Report, Mimeo Series #2314, Department of Statistics, University of North Carolina, Chapel Hill, NC.
Gregory R. Warnes (2002). Hydra: A Java library for Markov Chain Monte Carlo.
Griffiths, D. A. (1973). Maximum likelihood estimation for the betabinomial distribution and an application to the household distribution of the total number of cases of a disease. Biometrics, 29, 637-648.
Lord, Frederic M., and Melvin R. Novick (1968). Statistical theories of mental test scores. London: Addison-Wesley. (Chapter 23)
Ruth H. Maki (1998). Metacomprehension of text: Influence of absolute confidence level on bias and accuracy. In Douglas L. Medin (Ed.). The psychology of learning and motivation. Advances in research and theory (p. 223-248). Academic Press.
Veldhuijzen, N. H. (1980). Difficulties with difficulties. On the betabinomial model. Tijdschrift voor Onderwijsresearch, 5, 145.
E. R. Clarke (1940). Predictable Accuracy in Examinations: The British Journal of Psychology Monograph Supplements XXIV. 48 p. Look this one up.
Jerold E. Barnett and Jon E. Hixon (1997). Effects of grade level and subject on student test score predictions. Journal of Educational research, 90, 170-174.
Mail your opinion, suggestions, critique, experience on/with the SPA