Ben Wilbrink SPA module 4 The Ruling: How the result will count

warning. The JAVA-applets have been compiled under a JAVA version that since has been declared obsolete because of security leakages. I have not yet been able to update new style JAVA or applets based on Javascript. Any questions: contact me. It is my experience that this innovative work does attract attention zero nada niente; if you are interested, then you are the exception, so do not hesitate to contact me.

some highlights of this module

Figure 4.1 illustrates the main points of the objective utility function, the function that depicts the way scores on this test will count in the overall score in the course or examination. It is the value the teacher or institution assigns to scores, therefore it is 'the voice of the master of the student.'

The reference score normally is the cutoff score. Because formal and factual cutoff score may differ, it is useful to have the concept of reference score without the surplus meaning of being cutoff.
The function is objectively fixed, given the rules on the combination of test scores and given previous results obtained by the student. This is an important observation: the objective utility function generally will be different from student to student, given the same ruling on the combination of test scores.

Figure 4.1 Typical form of (first generation) utility function in rulings where some compensation between low and high scores is allowed. In the depicted case the test has 40 items, the reference score is 30, compensation is allowed for three scoring groups below and above the reference score.

Compensation allowed might be defined on grade obtained, not on the score - number correct - itself. A simple grading rule will be assumed throughout the SPA_model: a fixed number of points - the 'group' - will be the 'span' of one grade point. In the example the group is set at two.
Extreme cases are pass-fail grading - no compensation allowed - and full grade point average - full compensation allowed. The figure illustrates the fact that the partial compensation utility function can not be approximated using ogive-like functions figuring in the mainstream literature.
Clicking the figure will show the full picture of this case, including the parameter values chosen in the menu. To use the applet itself, go to its page.

Click Figure 4.1 for a view of the menu as well. To use the applet itself go to its page.

How test scores are combined

Test outcomes are somewhat uncertain indicators of the mastery of students. Yet they somehow must be combined in order to decide whether the individual student's result for the course are sufficient for certification, admission, or whatever. The combination rules themselves most of the time are plain and unambiguous. How best to combine uncertain results, however, is a difficult problem, the solution of which will surprise many educators because their mental models on this question are not consistent with current statistical theory.

An early, twelfth century, example of a combination rule recognizing uncertainty is known as the Trial of the Pyx (Stigler, 1986, p. 3). The London Mint was allowed a tolerance of five grains. Sampled coins were reserved in a box, the Pyx, for a later trial. Allowing a 5 grain tolerance, what tolerance is allowed 100 coins? The contract between king and mint states that linear extrapolation should be used, i.e. the tolerance allowed is hundred times five = 500 grains, modern theory has the root of 100 times 5 = 50 grains. The difference is an order of magnitude! This revolves around the idea what happens with error or chances of measurement in adding observations. In the eighteenth century astronomers still "feared that errors in one observation would contaminate others, that errors would multiply, not compensate." (Stigler, o.c. p. 4)

These examples also illustrate the kind of mental models actors in education might have about what happens to the accuracy of assessments if you combine them one way or another, allowing total, some or no compensation between them - 'allowing a tolerance of how many points' - .

The above might seem a somewhat curious introduction to the topic of utility functions, but in educational assessment utility is a function of the assessment outcomes; outcome combination rules therefore are the business of this chapter for they translate or map into utility functions of definite forms.

Another kind of information needed for the SPA model is that about the testing or examination situation. How will the results on the test be graded, and how do these grades contribute to pass-fail decisions on the examination or end-of-year results? In the terminology of decision theory this is about utility functions.
The lucky situation in educational assessment is that the examination rules almost always will have been spelled out meticulously. This ruling will easily and quite remarkably translate in an objective way into utility or utility functions. The objectivity of these utility functions is remarkable because in the literature mainly subjective utility functions, or subjectively chosen utility functions, have been used. The applet illustrates the technique that will be applicable to almost any educational assessment situation.

Quite generally almost every testing situation in education is a case of threshold utility in combination with a certain range about the threshold, neutrally called the reference point in the SPA-model, where higher results on one test may compensate for lower ones on another. The extreme cases of course are pass-fail testing - no compensation at all - and the grade point average system - allowing almost perfect compensation -.

The applet allows testscores being grouped for grading purposes, and compensation between grades to be asymmetrical around the reference point. If formal compensation allowed is symmetrical, asymmetrical cases will obtain as soon as compensation points have been won or lost on previous tests, restricting the freedom to win or lose still more points. The general observation is that utility functions depend on the testing history of the student, given the rules for combining test results. In particular the rules might specify a certain amount of compensation allowed symmetrically around the reference points, the last test in the series not excepted. And yet the student, once the first test is made, may never again be in that symmetrical situation. On top of that, the last test will have a pure threshold utility function because then the student will have no degrees of freedom left. If one or more negative compensation points should be compensated for by the result on the last test, its reference point is heightened by that amount.

There is one special case, however, where the last test might seem to have a compensating utility function: if some negative points have to be compensated for on the last test, the student might pass the last test itself, without compensating for the negative points, and therefore still be obliged to retest one of the previous tests. This case will be treated fully in chapters eight and nine, because it plays a (minor) part in the construction of the second generation utility curves.

The construction of compensating utility functions is not self-evident. The first attempt to do so, presented in a 1995 paper, failed miserably. The idea then was that it should be a staircase function in a shape not unlike that of an ogive, because the ogive was the kind of function available from the literature. Only later it occurred to me that the case of partial compensation should be constructed departing from the fully compensating curve by subsequently deleting the lowest and highest steps. Because utility is scaled by fiat, choosing whatever maximum value that is convenient, it is still possible to assign the utility of obtaining the reference score the value of one. Use the multiplication factor that produces a utility of one for the reference score to produce the utilities for all other scores on this new scale. Play around a bit with the applet above to see how this works, also in cases where compensation is in terms of grades, every grade point then representing a group of x score points (the parameter 'group' in the menu).

Utility in the SPA model will be scaled by assigning a utility of one to obtaining the reference score

The scaling method chosen is arbitrary. Other possibilities are to always scale the maximum utility to be one, or to leave the choice of scale to the user. The reason to fix the utility scale in the way indicated, is to make it possible to be sure at all times about the utility scale that has been used. The user may rescale results if he or she wishes to do so, multiplying all values by a constant factor. The advanced applet 4a (at the end of this page) allows the maximum of the utility scale to be fixed at the value of one.

It is quite remarkable that the function in the case of compensation can not be approximated by the ogive-like continuous functions proposed in the literature (for example Davis, Hickman and Novick, 1973) on criterion-referenced decision-making (see the scientific position paragraph below on points like this).

The point of this exercise in constructing utility functions is that the utility structure of an examination program or curriculum is an important determinant of the effectiveness of the curriculum.

the utility scale metric

Utilities are determined by the ruling for the examination or curriculum, how many 'points' should be scored, and how they may be distributed among the tests. Take, for example, the case of compensation of one point allowed, in a series of ten tests. A total of sixty points must be scored on ten tests, reference scores count for six, put de utility of these six points as one, what is the utility of an extra point on a test? Make it the proportion of the reference, in this case one sixth. A somewhat formalized account of the utility metric and its immediate consequences (corollaries), then, is the following.

ASSUMPTION	individual test results (scores, points or grades) will be summed for the total score on the examination or course, meaning that individual test results - including compensation points - will be treated as values on an interval scale.
COROLLARY	The utility of one compensating point below or above the reference result (score or grade) is proportional to the utility of the reference result
EXAMPLE	Tests graded on a ten-point scale, reference is a grade of six, two compensation points will be allowed below or above the reference grade. The utility of grade '8' is one and 2/6th of that of grade '6', if the utility of grade '6' is put at one, the utility of grade '8' is 8/6th.
WARNING	This utility metric is derived directly from the assumption, and will be valid in an objective sense in many if not in all assessment situations in education. The metric in no way implies or assumes that the effort to obtain a result that is one point 'better' than another, should be the same no matter whether the new point is '5' or '8,' for example. It is just a fictitious metric that happens to have been formalized in the rules for the combination of outcomes on all tests in the course or examination. However, the second generation utility function treated in chapter nine will be 'true to effort.'
COROLLARY	Allowing compensation will result in different strategic situations for students having a different testing history. Formal rules may superficially seem equal for all tests in the examination, but allowing compensation results in them being unequal for students having an unequal testing history.

An alternative approach

A more fundamental approach to the construction of utility of test scores is possible, using the optimal strategies themselves to construct more 'realistic' utility functions, truly second generation utility functions, to be treated in module/chapter nine.

Why then does this chapter call the compensation functions utility functions? The reason is that students might use - will use - the compensation function as a utility function. Whether it is wise for them to do so, will be the subject of later chapters nine and ten, but for the time being it is clear that the ruling on compensation is a proposition students might accept as a rather good approximation to whatever their utility functions would be if given the opportunity to construct them - for example using the technique in chapter nine -.

Utility functions, of course, are man-made constructions, some constructions will be better approximations than others. The first generation utility functions in this chapter are a first approximation. The second generation utility functions in chapter nine will be a better one, still being objectively determined (given the learning model). The final approximation will be the use of a subjective parameter - risk attitude - to moderate the objective second generation utility function. The first generation utility function is a utility function, to call it otherwise in order to be able to call the second generation simply 'the' utility function would result in confusion.

Scientific position

Here there are two different issues, the first being how scores, marks or grades will be combined to result in total scores, end marks, a grade point average, or the examination outcome, the second how then to assign a utility function on the scale of the scores, marks, or grades to be thus taken together.

In the preliminary text of the module the first issue has been assumed to be rather simply a question of adding scores, marks or grades, one way or another, with or without differential weighting, etcetera. Of course the one fundamental complication here is that the reference point may be taken to be a passing score, introducing the complication of a conjunctive decision in what otherwise would be a compensatory ruling. However, it is possible to take issue with the simplification of adding scores, adding being after all but a traditional way of solving the combination problem. In a fundamental way, almost all grading boils down to comparing achievements, or ranking students. In theories of justice as well as in educational research the ranking character of grading is something no serious researcher quarrels with; for an historical account and an example of a fast 'evolution' of ranking examinees to grading them, see my (1997 html). From the thus rankordered students the head of the school might pick the 'best' student and reward this student with a prize; the admissions officer might choose the best ten students to consider for admission to this university college; etcetera. The rankorder, then, equals the order of preference of the head and the officer, thus forming the basis of their decision-making. The point of this detour is that there is a vast literature on the possibility for preferences to be consistent: in almost all daily decision-making preferences will not be consistent; as well as on the nature of preferences themselves: preferences get constructed on the spot (Lichtenstein and Slovic, 2006). It will not come as a surprise that this rich literature surely will have important suggestions to make about the possible ways for combining ranks, possibly quite different from what is usual practice in education. Another important result might be to discover that in a deep sense it is not possible to grade students in a way that is fair in all relevant aspects. Grading outcomes are highly contingent, especially also on the institutional rules that somewhat arbitrarily have come to be established in the past. This exercise will not end in a call to stop all grading; hopefully it will teach us to be humble in our assessing the achievements of students. Humbleness in this respect is not an outstanding quality of most teachers, so there is a world to win. I will gather the relevant literature in this page: meritranking.htm.

Fundamentally, how to combine scores, marks or grades is a deep measurement problem (Krantz, Luce, Suppes and Tversky, 1971). In fact, Vassiloglou and French (1982) tackle the kind of problem mentioned above, the first issue, using the Krantz et al. theory. I do not expect the Krantz et al. measurement theory to be very useful for generating or even evaluating specific rules in examinations. What I do expect of this theory is to make available some concepts that will help to exactly describe particular examination rules on the combination of marks or, as Krantz et al. would say, the concatenation of rankings.

The way the psychometric literature constructs and uses utility functions should be highly controversial, with few exceptions, but sadly it isn't. The way it is, a publication that strongly departs from the recieved psychometric view on the use of utility functions will itself be regarded as controversial, even if the methods used follow directly from the decision theoretic literature.

Therefore this chapter on objective utility functions is in a way the most important part of the SPA model, even without its extension into second generation utility functions in chapter nine.

characteristics of the approach chosen

It will be helpful to list the characteristics of the approach chosen here to the construction of utility functions.

Keep in mind that the attempt to construct utility functions in using the SPA model is not an attempt on scientific purposes. The user of the model has to make a satisficing choice out of many possibilities, it being absolutely out of the question to follow the scientific path to construct one's own utility function using pairwise comparisons, lottery devices, or whatever. The interested reader might consult, for example, Krantz, Luce, Suppes and Tversky (1971).

The basic idea in the SPA model is that examination rules on the way marks or grades will be combined into total grades, passing scores, or whatever, already determine what the utility function for every student should be, disregarding personal capabilities etcetera. Learning capabilities do matter, of course, because time needed to reach certain achievement goals does depend on them. It is then the application of the SPA model itself that will make it possible to construct second generation utility functions incorporating the cost—study time—involved in getting at the preferred levels of achievement on the tests comprising the examination.

loss functions

The SPA model does not employ the concept of loss functions because the decision making that will concern us typically is one of a gradual character: at what moment it is 'best' to stop investing yet more time in test preparation. Functions of expected utilities and/or expected investments will decide that. The typical situation where a loss function might come in handy is where the difference in utility between two specific possible outcomes is what counts; opportunity loss then is the difference between the two outcome utilities, it is the loss when the outcome having the smaller outcome utility obtains. An example is to go out for a walk and having to decide to leave the umbrella at home, or not. It would be nice to have dry weather and no umbrella to carry, one would be sorry however if it would start to rain and not to have the umbrella then. Another example is the testing of scientific hypotheses.

Needless to say, the concept of (opportunity) loss has nothing to do with that of investments, other than that future investments might turn out to be lost, or that earlier investments possibly should be regarded as sunk costs.

Cronbach & Gleser (1957)

The highly influential text of Cronbach and Gleser introduced decision-making theory and techniques in the fields of personnel selection and counseling.

In personnel selection the stakeholder is assumed to be the institution, the utility question then reduces to the question how to evaluate the contribution of employees to the firm or institution. That question is a tough one, Van Naerssen (1962) was one of the first researchers to tackle it, Schmidt, Hunter, McKenzie and Muldrow (1979) presented a technique to quantify utilities, Wilbrink (1989) used that technique in a program that is able to simulate results of (changes in) complex selection procedures, and Van der Maesen de Sombreff (1992) researched some applications of what he calls the Brogden-Cronbach-Gleser utility model. It is however not evident that assessment in education is an instrument for institutional decision making in the way psychological tests in personnel selection - or medical diagnostics - are.

In counseling the decision making is somewhat complicated, the primary decision maker being the counselee, the secondary decision maker the counselor who should try to assist the counselee in finding his or her preferences and utilities. Cronbach and Gleser deemed the contrast between institutional and individual decision making worthy of a last contemplative chapter in their first (1957) edition. The reader surely will recognize the resemblance between the counselor-counselee relationship and that between teacher and student. With hindsight it is a pity Cronbach and Gleser did not explicitly treat the case of assessment in education, it surely would have heightened the chances for developing a model true to the real teacher-student relationship. The rather abstract study of Cronbach and Gleser thus did not result in suggestions for specific techniques or functions capturing the utilities of counselees or students.

Robert F. van Naerssen (1970)

Van Naerssen assumes pass/fail scoring of tests. His utility functions therefore would be threshold utility functions: 1 for a pass, 0 for a fail. Expected utilities then equal chances to pass the test. The primary decision maker in the Van Naerssen model is the student, he or she is the owner of the utility function. In a way, however, the teacher's attitude should be consistent with this utility function, otherwise the teacher as secondary decision maker should change the cut-off score on the test. Van Naerssen was very careful in stipulating the different roles of students and teachers, and its consequences for decision theoretic modelling of teacher choices in the design of assessment.

In the Van Naerssen model the consequence of failing a test is that the student should sit the test again. Ultimately the student will secure a pass on all tests of the curriculum. Whatever the utility of this result, it will always be the same no matter which strategies the student did choose to follow. It is therefore not utility that gets maximized by the rational student, there simply is no room for differences in outcome utilities. The strategic situation that confronts the student is the simple one of minimizing the investment of resources, the most important of which is preparation time. The conceptual difference between utilities and investments remains regrettably somewhat covert in Van Naerssen's own exposition of the tentamenmodel, nevertheless it is definitely present.

paradigm lock-in: Hambleton and Novick (1972)

In the early seventies the educational measurement field got locked in a paradigm not fitting the educational situation, certainly not the situation as analysed by Van Naerssen (1970). This premature closure of the field can be traced back to influential publications, the first promotional and the second highly technical, from The American College Testing program, by Hambleton and Novick (1972) and Davis, Hickman and Novick (1973). [I will have to check this statement; Hambleton and Novick refer extensively to previous literature, most on mastery learning and criterion-referenced measurement - one of them by Cronbach, a 1971 report by Hambleton and Gorth - but only one of them on the use of decision theory as far as the titles reveal. The only explicit reference to the decision theoretic literature is to Cronbach and Gleser. For the time being the thesis therefore stands. The earlier work by Van Naerssen must have been unknown to Hambleton and Novick.] The point these authors missed is that the 'real' or primary decision maker is the student, the position of the teacher being secondary only, in the sense that teacher strategies of necessity are about student strategies. The new paradigm about criterion referenced measurement therefore is based on the implicit assumption that it is possible for the teacher to find optimal strategies without the student reacting to them. I do not know of any places in the literature using this paradigm where it is mentioned students might follow strategies of their own, possibly reactive to those of the teacher (the exceptions are Van Naerssen's work and my own publications in 1978 and 1980, both well known in the Dutch group of researchers around Mellenbergh and Van der Linden). Remember Van Naerssen showing in 1970 that changing pass-fail scores will change student strategies, in turn necessitating further changes in the cut-off score in the same direction, etcetera, unless the teacher attains the insight this is a game she can not win in this way.

A short characterization of the premature closure is that these authors (1) take it for granted that the teacher is the (primary) decision maker, (2) that it is pass/fail decisions that have to be optimized, and (3) that empirical data about the effects of remediation (the fail decisions) are irrelevant. It is especially the last point that will concern us here. Already in 1972 a short cut on the utility analysis was chosen that has stunted the field ever since. This is not a problem imported from Cronbach and Gleser (1965) or from the decision theoretic field in economics (Raiffa and Schlaifer 1961 was used by these authors). It is possible that not reflecting on whether the teacher is a primary or secondary decison maker was somewhat unavoidable, given the heavy involvement of Melvin Novick in the application of Bayesian statistical methods in educational measurement (for example Novick and Jackson 1974, and especially Novick's Computer Assisted Data Analysis CADA), typically exercises in what Cronbach and Gleser called institutional decision making.
In the 1972 publication the institutional statistical bias is particularly evident:

The primary problem in the new instructional models, such as individually prescribed instruction, is one of determining , if π_j, the student's mastery level, is greater than a specified standard π₀. Here, π_j is the &lsdquo;true” score for an individual i in some particularly well-specified content domain. It may represent the proportion of items in the domain he could answer successfully. (..) The value of π₀ is the somewhat arbitrary threshold score used to divide individuals into the two categories described earlier, i.e., Masters and Nonmasters.

Hambleton and Novick, 1972 p. 4.

Observe that the mastery threshold is assumed to be given. This ghost will haunt several authors in the years to follow, but nowhere in the literature the question seems to be posed whether the teachers decision problem might not be to choose this "true" threshold optimally, instead of the threshold on the observed score dimension (Van der Linden, 1980, comes close to questioning this bias, but eventually misses the point).

threshold utilities Hambleton and Novick

A picture of what Hambleton and Novick are trying to do will make it clear what probably is not clear at all in their modelling. The applet - use option 410 - obligingly will plot two threshold utility curves that represent the example of Hambleton and Novick, where the opportunity loss of passing a nonmaster is 8 times worse than that of retaining a master. The red function has two values zero and one, the green one has two values .8 and .9, respectively to the left and to the right of the mastery cutoff score in the domain of items. The opportunity lossen then are .8 - .0 = .8 and 1.0 - .9 = .1, the first being 8 times the second. Here a number is put on the domain; it is taken to consist of 400 items, the threshold lies at 194, more on this value for the threshold below.

The applet does not 'know,' nor does it need to, what the functions represent, or whose utilities they represent. The fact that every domain score here somehow has two different utilities in itself is not remarkable. The user of the applet may have good reasons to study the differences between two different utility functions or structures, especially their possible effects in real educational situations. However, what Hambleton and Novick are doing is something else altogether: they intend to use both functions together; both are meant to be true at the same time. Hambleton and Novick somehow incorporate in the two functions the assumed costs or investments that follow from either of two possible decisions: passing or failing students having score x on the test. In this formulation the reader should recognize that Hambleton and Novick assume these crucial data known, in order to be able to construct a simple solution of the optimization problem.

The next step in picturing the Hambleton and Novick paradigm is to apply the likelihood of true mastery for a subgroup of testees having an observed score that might or might not be chosen as the cut-off score on the test. Hambleton and Novick use the likelihood because they prefer to talk about true mastery instead of observed mastery. In order to evaluate expected utilities (see module 6 on that topic) predictive distributions are needed. They would not protest to use all the items in the domain in a test, therefore the picture shows the predictive distribution for a test having 400 items.

Hambleton and Novick do not supply an explicit example; here a short test of 20 items is chosen, having cutoff score 13 correct, because Davis, Hickman and Novick (1973) use this as their example. Hambleton and Novick do not in fact construct likelihoods, but use other techniques to estimate the distribution of true mastery scores given an observed test score; this difference in approach does however not affect the paradigmatic discussion here.

Hambleton and Novick assume for their example the proportion of subgroup students having true mastery above the true mastery cutoff to be .85 (this equals the expected utility using the threshold utility function .0 and 1.0; on expected utility see chapter/applet 6). This proportion means, given a domain of 400 items and a subgroup scoring 12 correct out of 20 items, that the true mastery cutoff score in the domain is 194 items. This number has been used already in the threshold utilities plot above. The expected utility following from the second threshold utility function .8 and .9 is .885.

In the Hambleton and Novick paradigm the interpretation then is that (1) the second threshold utility function is related to the decison to retain this subgroup of students for remediation, the first threshold utility function to the decision to pass them, that (2) therefore the best decision is that one having the highest expected utility. In the example the expected utility of retaining this subgroup is the higher one, therefore the cutoff score on the test should remain 13.

Hambleton and Novick's second threshold utility function, related to the decision to retain the subgroup scoring 12 out of 20 for remediation, assumes costs or investments to be part of utility also. The result of this move has been that even in empirical research authors have been tempted to ask their subjects to subjectively estimate these 'loss functions', instead of the underlying utility structure, incurred costs or investments, and remediation effects ( f.e. Van der Gaag, 1990). In this way the authors reduce the utility structure to 'correct' decisions evaluated neutrally, and 'wrongly' passing a nonmaster or failing a master, decisions that are evaluated negatively, the one possibly more so than the other. In the next paragraph this paradigm will be tested on the critical point: how can the teacher subjectively yet consistently estimate the seriousness of one kind of error in classification against the other (classifying a master as nonmaster, and the reverse)? The authors have next to nothing to say on how to solve the teachers predicament.

Testing the Hambleton and Novick (1972) threshold utility paradigm

In the following series of steps the Hambleton and Novick 1972 main example is contrasted with what a proper experimental approach probably would show. The point here is not what 'real data' would look like - the only way to show is to collect them - but what kind of data a proper interpretation of the decision-theoretic approach would demand, other than Hambleton and Novick suggest would suffice. This exercise goes to the heart of the confusion about utility functions in the Hambleton and Novick paradigm - the received view ever since 1972.

It is possible, of course, to use other information also. The 10.000 truly will be members of the group of 100.000, the scores of the whole group are known. A betabinomial may be fitted on the distribution, resulting in estimates of parameters a and b for the beta-likelihood of the group. This information can be added to the 12-out-of-20 information available for our group of 10.000. Another source of extra information is scores obtained on previous tests. Subdividing the group of 10.000 according to previous results makes it possible to add previous scores to the 12-out-of-20 information. Essentially the last amounts to allowing compensation between tests. To keep the Hambleton and Novick case transparent, the refinements will not be followed up here.

Hambleton and Novick assume the teacher to know something about the learning process. Would she know nothing, there would be no basis at all for whatever decision on test scores. Assume the applicable learning model is the replacement model, the complexity of the questions being on level 2. The learning applet will plot this learning model for two mastery levels only - 0.5 and 0.6 - but the same learning model applies to all possible mastery levels represented in the likelihood.

Retaining the group of 10.000 for another half of one learning episode will result in a predictable test score distribution for the group after one and a half episodes. Applet 6b (Expectations b) will produce the prediction. Let the to be predicted test be a very long one, say 500 items, in order to get results closely approximating the true mastery scores of these students. The passing score on the 500-item test is taken to be 313 points, the equivalent to the demarcation on the 20-item test being 12.5 points. By mistake this and the next plot have been based on a cut-off of 316 points instead; because the chosen cut-off anyhow is somewhat arbitrary, I have kept it at 316. You should be aware of the fact that choosing the 'real' mastery cut-off somewhat higher will effectively lower the expected utility corresponding to the retained group's treatment effect a bit less than that corresponding to the passed - no-treatment - group.

Passing the group will also allow a prediction of test results after an extra half period, this one being used to prepare for the summative test of Unit 6. A conservative approach is to assume the mastery of passed students to be still the same after another period that is used in other activities. Some forgetting surely will be more than compensated for by new learning; after all the course matter of unit 5 is supposed to be used or to be necessary in unit 6. In this way two predictive distributions will become available, allowing comparison of the effects of the two treatments: allowing students scoring 12 out of 20 a pass, or not. Remember thought, backwash (or feedforward) effects of instituting a cut-off of 11 instead of 12 points have to be reckoned with also, the SPA-model obligingly will allow us to do so.

The decision maker now finds herself in the situation of having to choose between action A and B: action A is to pass students scoring 12 out of 20, incurring no extra costs as far as unit 5 is concerned and an expected utility u_A - in the example .38 - , action B is to retain them, incurring the extra investment of one half extra episode of teaching and learning activities that will ultimately result in expected utility u_B - in the example .92. The only reasonable way for the decision maker now seems to be to fall back on the fundamental restrictions on the resources for Course One - how much time is available for it, what amount of course material should be covered, what percentage of the students must reach the Course One mastery level - and given the restrictions which course design realizes them best? Hambleton and Novick, however, are of the opinion that some good thinking could do the trick as well. Let us try to follow their line of reasoning, on page 5 of the 1972 report.

To begin with, given the data, i.e., given a score of 12 out of 20 items, in the example of Hambleton and Novick the probability that a student is a master is assumed to be .85. Operationalizing in the form of our test of 500 items, the threshold should be set at 243. The expected utility of a pass for this group of students then also is .85; threshold utility meaning that a master has utility one, a nonmaster utility zero. Now assume retaining these students means they have to invest .2 episodes extra; using the replacement learning model, complexity two, the result is .963 masters, i.e., the expected utility is .963. In this case, therefore, the extra investment of .2 episodes results in a .113 growth in expected utility.

This, however, is not the kind of analysis Hambleton and Novick present (see the last paragraph above); their concepts are false negative and false positive decisions, en how much worse the last are compared to the first - 8 times was chosen as a fitting quantity in their example. Well, without endorsing their talk of false positives and false negatives, which is way too suggestive, it is known from the plot at the right how the numbers are: failing the group results in .85 'false negatives,' passing them results in .15 'false positives.' Hambleton and Novick do not know and do not ask how many students will be masters after the extra investment of .2 episodes. We do now, however, and will use the data. About 1100 students reach mastery after the extra time, 370 do not. Choosing threshold utility means that the dominating result is whether or not the student's mastery is at least 243/500, or not. How much mastery is 'better' than that, is not important (Davis, Hickman and Novick, see below, will add other utility functions to the paradigm). Now then, the extra investment of .2 episodes will be lost on .85 + .037 = .887 of the students. If the teacher does not prefer one action (A) above the other (B), given the data, the breakeven point between utility and investment is that 1 utility point is just equal in worth to the investment of .887/.113 x .2 = 1.57 extra episodes. How does this compare to the Hambleton and Novick's loss ratio a/b of 5.67 in the breakeven case? I can't compare the two results, other than using the complete model, i.e. stipulating a likelihood and a learning model and a definite investment of extra time. The conclusion, then, is that Hambleton and Novick assume that the teacher not only is able to identify her breakeven point, difficult enough as it stands, but also is prepared and able to intuitively estimate the two predictive distributions and the learning model involved, given a certain amount of extra time to be invested. Is this a reasonable assumption?

Another way to formulate the difference between (1) the Hambleton and Novick paradigm as presented in the last paragraph, and (2) the empirical modelling in this paragraph, is that Hambleton and Novick do not present a solution at alle because of their assumption the teacher already knows what the opportunity losses are, while the empirical modeling of the remediation effects promises a realistic solution should it be possible to find actions either equal in investments or equal in outcome utilities. A model solution to the equal outcomes case is offered by Van Naerssen's (1970) tentamenmodel (above).

The Hambleton and Novick (1972) paradigm contains a hidden invitation to the user to neglect the collection of relevant empirical data on the effects of differentially treating students within classes or cohorts. As such it should only be used as possibly an interesting exercise in mathematical modelling, not as a serious contribution to the solution of pressing problems in educational assessment.

The subject that concerns us here is another one, though; the Hambleton and Novick paper of 1972 is the origin of a paradigm that mixes the concept of utility up with investments and unknown results of remediation activities.

Here is the place to present a table listing the literature using the Hambleton and Novick paradigm in its pristine form, i.e., under threshold utility.

[table here, I am preparing it as of May 2005]

paradigm lock-in 2: Davis, Hickman and Novick (1973)

Davis, Hickman and Novick (1973) start from the same simplified pass-fail decision case, presenting it in a more formal way befitting the statistical hypothesis testing paradigm. It is the utility structure that exclusively concerns us here, however. They narrowly escape recognizing also the real decision for the teacher is where to place reference point as to what is an acceptable level of 'mastery':

In selecting ϑ_o, careful consideration of the objectives of the training program, and previous experience with with the training and evaluation materials must prevail. If for example, ϑ_o, is intended merely to give at least an even chance of completing the next lesson, ϑ_o might be set to that level of functioning which has historically had a 50% success rate on the next unit. If, on the other hand, the decision maker is very concerned about the ill-effects of the frustration of a poorly prepared student reading advanced material, perhaps ϑ_o ought to be somewhat higher. In any case, once ϑ_o is specified for the test, prior and collateral information about the student will be combined with the test result (x) for the purpose of estimating ϑ.

Davis, Hickman and Novick, 1973 p. 21.

If only the authors would have seen that the teacher does not know 'true' masteries of her students, only observed test results, they might have had the insight to substitute the cut-off on the test score dimension for the platonic cut-off on the mastery dimension and to continue to solve the thus described optimization problem. Alas, this was not to be.

Now having mastery dimensions was not deemed to be inconsistent with nominal losses (threshold loss), and indeed, the mathematics is straightforward. This is exactly the kind of paradigm criticized by Cronbach and Gleser (p. 1): "It is therefore desirable that a theory of test construction and use consider how tests can best serve in making decisions. Little of present test theory, however, takes this view. Instead the test is conceived as a measuring instrument, and test theory is directed primarily toward the study of accuracy of measurement on a continuous scale." Almost two decades later, authors claiming to extend Cronbach and Gleser's work, nevertheless stick to the old paradigm.

Davis, Hickman and Novick, half-way their report, substitute the threshold utility structure for Hambleton and Novick's original threshold losses. This step in the right directen in no way solves the fundamental problem of disregarding empirical evidence on treatment effects. It is quite instructive to see how these authors, as well as their many followers, struggle with the concept of utility in the simple threshold utility case, trying to appply a technique that would work well with bets (see f.e. Schlaifer, 1959).

"What we need to determine is that value v such that our decision maker would be willing to flip a fair coin to decide which gamble he will take:

A lottery wich pays off C* (a correctly classified student) v percent of the time and C_{_*} (a misclassified nonmaster) (100 - v) percent of the time; or
A sure C₁₂ (misclassified master).

Admittedly, specifying v is not an easy task, but it can be done. In order to accomplish this, our decision maker might be aided by considering 'how much better' or more desirable correctly classifying a student is than misclassifying a nonmaster and compare this with how much better correctly classifying a student is than misclassifying a master. If, for example, correctly classifying a student gives you 10 'utiles' more than misclassifying a nonmaster and only 5 'utiles' more than misclassifying a master, then v for C₁₂ would be 50 percent. This says that misclassifying a master is half-way between misclassifying a nonmaster and correctly classifying a student on a 'utiles' or desirability scale. Assuming that, in fact, v = 50 then u(C₁₂) = v/100 - .5. (..)

Davis, Hickman and Novick, 1973 p. 51.

The citation above must be one of the most obscure passages in the psychometric literature in the seventies, not because of the lottery technique itself — the basic idea was used in commerce already in the sixteenth century — but because this particular application does not have a useful interpretation. It surely is highly disturbing to see how many uncritical followers these authors have had ever since. Look carefully, and note how inconsistent the idea of a true mastery cutoff score is with the idea that utility of mastery be linear; here at least the original Hambleton and Novick position still was consistent on the point of teacher utility and the utility of the testing situation as confronting the students both being of the threshold type. If the Davis c.s. inconsistency in itself does not convince you, look at the text again and answer for yourself the question what the chances might be that the highly subjective 'technique' offered by Davis, Hickman and Novick might not make the teacher decision maker worse off than she would be without this 'solution' to her problem. The philosophically minded reader surely has spotted the curious attempt, not unusual in the literature on optimizing pass-fail scoring, to replace one highly subjective procedure with another one that it is just as difficult to apply reasonably.

Davis, Hickman and Novick linear utility case

However, I will try to take Davis, Hickman and Novick to the test, in the linear utility case. The illustration, just as in the Hambleton and Novick case (now use option = 420), shows two linear utility functions that are used together in their paradigm. On top of that Davis c.s. assume the two utility functions to cross each other exactly at the true mastery cutoff point (in the plot that is score 194 in the domain of 400 items). To obtain expected utilities the functions are weighted with the same predictive test score distribution, likelihood, or other estimated probability distribution function. (Davis c.s. give much attention to techniques to use collateral information to obtain sharper estimates. This topic, however, belongs to chapter/applet 2.)

A special characteristic of the Davis c.s. linear utility case is that the maximum likelihood of true mastery, or the expected value of the predictive distribution using all items in the domain, will decide what is optimal: to the right of the true mastery cutoff score it will choose the decision to pass students, to the left of it to retain them. Therefore it is only necessary to know which function to the right of the true mastery cutoff score lies above the other, not what the exact slope of the functions is.

Testing the Davis, Hickman and Novick (1973) linear utility paradigm

One of the main contributions of Davis, Hickman and Novick is how the rather primitive threshold utility functions might be replaced with more flexible ones. It will suffice here to treat the linear utility function case, because the other functions use the same problematic paradigm.

Ignoring the need to collect empirical evidence on the effects of retaining a particular subgroup of students or not retaining them, Davis, Hickman and Novick construct two linear utility functions on presumably the same goal variable dimension, which in their case is true mastery. Using the two linear functions and one conditional probability distribution on mastery, given an observed score x, they proceed to solve the decision makers problem whether to retain or pass students scoring x. It is possible to see immediately that a realistic decision theoretic approach to the problem would use one utility function on mastery scores or on observed scores only, and two probability functions, corresponding to the outcomes of two alternative treatments. If you understand this, then you can skip the following reconstruction of the Davis et aliis case.

One of Davis et aliis' linear utility functions is a positive function of true mastery. We are free to scale this one from zero to one. The concept of true mastery is a highly problematic one, at least for a decision maker assigning utilities to entities or happenings she will never be able to see. If the tests used would be very long tests, Davis et aliis surely would not object to defining the utility function instead on the dimension of scores obtained. What, then, about much smaller tests, could it be reasonable to define linear utility on their observed scores also? Let us just do it, and see how the argument runs. We then have a linear utility function that is the fully compensating one. The illustration is the 20-item test again.

There is no possibility to coherently define another linear utility function, already having defined linear utility of every possible observed mastery. However, it is possible to model the likelihood in the experimental condition, i.e., retaining the students another .2 episode, letting them grow in mastery under the replacement learning model and complexity parameter 2.

The figure to the right presents the two predictive distributions en the corresponding expected utilities. The expected utilities should be divided by two (in the applet utility has been scaled from 0 to 2): 1.192/2 and 1.357/2, or .596 and .678 respectively. The gain in expected utility - or mean observed mastery in cases of full compensation, therefore is .082, a value that has to be balanced against the extra investment of .2 episode for every student in the subgroup. The balancing act has therefore not been decided by this decision-theoretic exercise. An external criterion could be used to decide which way the balance should tip over: total amount of time available for the course, minimal amount of course material that should be mastered, etcetera. Better still: first investigate the possible consequences the choice of the cutoff score might have on strategic behavior of students, and the predictive score distributions resulting from them.

We now find ourselves in the same predicament here as earlier when testing the Hambleton and Novick threshold loss model. The difficulty now is to find a balance between a difference in expected utilities on the one side, and the extra investment of preparation time on the other. They are incommensurable, even the indifference curve technique will not solve the problem because our decision maker is not the 'owner' of the preparation time involved. So much is admitted by Davis c.s. (p. 93 ff), in the comparable case where the teacher first has to decide whether to use the test or not, weighting the costs of testing against the increment of expected utility. A solution (not the one Davis c.s. present) is within reach, however, if we would be prepared to remodel the case, assigning the student(s) the role of primary decision maker(s). As the case stands, it is up to the subjective procedure in the Davis, Hickman and Novick approach, cited above, to come up with the parameters of a linear function, to be weighed with the same likelihood belonging to the pass treatment, and in the process identifying a unique point on the observed score scale beyond which it is definitely better to pass the students. That sounds like magic, to me at least.

An interesting detail in the Davis, Hickman and Novick paradigm is the thesis that their two utility functions should intersect at the true mastery cutoff point. It is a misconception, but placed amidst so many other misconceptions it will not be easy to demonstrate what the misconception is. The curious thing about the paradigm as a whole is that these authors stubbornly believe that their decision maker does not have a problem in locating the true mastery cutoff point, she knows pretty sure what it is, only a problem in using that knowledge to find what the best cutoff on the fallible test of a handful and some items should be .

The reader should note that what we have done here is define utility as a separate linear function for each possible decision, d_i. Thus, if decision one is chosen, the payoff or utility is to be a linear function of the state parameter ϑ with slope f and intercept e. For decision d₂, the slope is h and the intercept is g.
The existence of a breakeven or indifference value ϑ_o of the state parameter ϑ imposes the condition that

e + f ϑ_o = g + h ϑ_o, or ϑ_o = (e - g)/(h - f).

Davis, Hickman and Novick, 1973 p. 59.

The conclusion now can be no other than that the decision maker subjectively has to estimate the effects repeating the course will have on the true mastery of students, combining that estimate with one of the costs in scarce resources of students as well as teachers (time, risk of losses in motivation, the logistics of the course). This proves the decision maker's task in the Davis et al. paradigm to be quite impossible. Of course, subjects may nevertheless carry out an impossible task to the satisfaction of the researcher, see for example Van der Gaag (1990). That will however not prove a faulty paradigm correct.

The Davis et al. paradigm thus invites researchers (and teachers) not to let empirical data decide the merits of this or that contemplated cutting score, but the unchecked fantasies of the decision maker about his or her values and opinions about the cut-off on the 'true' mastery dimension.

Davis, Hickman and Novick (1973) quadratic utility

After having presented linear utility, Davis, Hickman and Novick continue with the treatment of several other mathematical functions that might be used as utility functions. This is a problematic approach to the decision problem; instead of analysing the decision problem and developing the proper utility functions from it, the reverse order is chosen of presenting mathematical functions that might be used as utility functions. Moreover, the flexibility of the mathematical functions, of which the quadratic is only the first form, quite mistakenly suggests that any decision problem probably should have a utility function from among the set of functions presented by Davis c.s. The general case in eduction, that of some compensation being allowed, is a function having a highly specific form, that is not matched by any of the Davis c.s. utility functions.

The figure presents the 'raw' quadratic utility case, a reconstruction of their figure 4.2a, curve 1 and 2 have been interchanged otherwise the applet plot would be out of range (the applet's option 430 plots quadratic utilities, using the first function only to determine the maximum and minimum values). Davis c.s. use a transformation of scale and location to reduce the number of parameters. The transformation can be inspected using option = 431 in the applet, the result is not conform the Davis c.s. Figure 4.2b (the d1 function lies between values -1.431 and .268, Davis c.s. have a three times greater range).

Davis, Hickman and Novick (1973) exponential utility

The picture shows exponential utility curves, it is a reconstruction of Davis, Hickman and Novick Figure 4.4a (use option 433 in the Applet, reverse the sign of the third parameter of the second curve; option 434 will produce the reparametrized Figure 4.4b, using the same parameter values).

The green curve plots out of bounds here (left under), because the first function only is used to determine the lowest and highest values.

Davis, Hickman and Novick (1973) squared exponential utility

The last family of functions that might be used as utility functions is that of quadratic exponential utility. The figure shows the reconstruction of Davis c.s. Figure 4.6 (applet option = 435). It was not possible to get the right result using the Davis c.s. parameter values a = .05 and b = .01; the reconstruction uses parameter values a = 1000 and b = 500. In the applet the exact formula 4.13 from Davis c.s. has been used. In the Davis et al. figure the first curve (the red one here) is put at zero to the right of the reference score, the other function (the green one here) is put to zero at the left of the reference point.

Discussion of the Davis, Hickman and Novick (1973) utility functions

Because exact references to the literature are absent in Davis c.s. it is not easy to see why these utility functions in the educational assessment case could be realistic, or whether function types not treated by Davis c.s. would not be useful. It is perfectly clear, however, that the mathematical approach of Davis c.s. is connected to other work in Bayesian statistics, especially on posterior functions. It was important then - in 1973 - for mathematicians not having fast computers at hand, to have formulas that could be evaluated mathematically or at least numerically. It was not yet possible to use Monte Carlo methods (simulating processes) or extensive numerical methods for the evaluation of realistic problems. The SPA-model is very computer-intensive, and has no problem whatsoever in generating likelihoods and from these the predictive distributions necessary to solve decision problems in educational assessment. The options to reconstruct the Davis c.s. models have been implemented in only a few statements in the Java language, using of course the already available machinery for plotting the resulting curves. Yet the optional models are fully integrated in the SPA-model, allowing expected utility curves to be evaluated on the basis of the Davis c.s. utilities. Part of the discrepancies between the Davis c.s. and the SPA-model approach therefore follow from developments in computer power since the sixties and seventies. Nevertheless, the mathematical bias of Davis c.s. has led them to give way too little attention to questions of what validly could be utility functions in educational assessment.

[Important issues that will have to be discussed here:

Is it possible to construct - or find in the literature - cases where both approaches lead to clearly different predictions and results?]

Mellenbergh and van der Linden (starting in the 70s)

Special points

[juni 2003: NB: zie de aantekening bij Figuur 8. Voor een persoonlijke variant op de objectieve functie geldt een nogal ingrijpend andere vorm van de functie dan hier afgebeeld. Het hier gegeven voorbeeld van aanpassen bij houding tegenover risico is ook in een ander opzicht minder gelukkig. Mijn huidige positie is dat het best zo zal zijn dat leerlingen verschillen in de mate van risicogeneigdheid, of dezelfde leerling in verschillende situaties, maar dat de objectieve situatie zo dominant is dat rekening houden met risicogeneigdheden vooralsnog een te luxueuze verfijning van het model op zou leveren.]

In order to be able to construct expected utility functions, utility functions must be made available first. The construction of expected utility functions is, however, not necessary for the evaluation of optimal strategies in preparing for achievement tests. Therefore it will come as a surprise that the construction of utility functions is not an essential condition in the quest for optimal strategies.
The reason for this state of affairs is that the evaluation of strategies involves the evaluation of expected costs, using an algorithm that effectively reconstructs the utility function instead of using the expected utility function. All of this is valid, assuming utility functions of the type handled in this chapter. That type of utility function is specific for achievement testing in educational settings.

Generalness

Empirical support

Application

Project history

The crucial parts of the theory and construction of utility functions were developed late in the seventies and published in my 1980 articles in the Tijdschrift voor Onderwijsresearch. This work was discussed at the time in psychometric working groups in the Netherlands, and in that sense it was thoroughly known, if not well understood. The idea that examination regulations allow the objective construction of utilities on assessment results dates from 1995, especially the work on the medicine first year examinations in the University of Groningen; the construction itself was however faulty at that time. Only recently it was discovered how to correctly construct the general utility curve, i.e., in cases where some compensation of points is allowed. The corollary of the constructive principle followed immediately: the well known dichotomy of pass-fail testing versus grade point average testing, alternatively called conjunctive versus compensatory testing, really is not a dichotomy at all but they are on one dimension betwee pass-fail scoring and full compensation as its extremes. The extremes of that dimension are full compensation and pure threshold utility.

Also of a recent date is the explicit recognition of the difference between the concepts of outcome utility and investment of resources. Without it optimal student strategies would be difficult if not impossible to find, and the misconceptions in the received view would have been more difficult to expose.

August 26, 2005. It is possible to come full circle and use the full model's optimal strategies to construct utility curves based on future savings in resources. In this way outcome utilities can be equated with future investment of resources. This stands in contrast with the last statement above on the difference between the concepts of outcome utility and investment of resources, but there is no conflict here as long as the savings in resources lie in the future. In cases of pure threshold utility, such as the Last Test in the course or examination, the application of this new construction of utility is not useful. Instead: shifting the cutoff score on the Last Test produces a series of different optimal strategies; it is just this series that allows the new second generation utility function construction in the case of the Next To Last Test, and therefore also in the case of any test (except the Last Test).

Because the full SPA-model's results are needed to construct this second generation utility function, the explanation of its characteristics makes use of the concepts treated only in the chapters yet to follow, and the full treatment on 'true utility' is given in chapter nine.

December 18, 2005. Developing the secundary utility function it becomes clear that there is a special situation concerning the Last Test case where negative compensation points have to be made up for. The conceptualization of the Last Test situation has always been a nuisance, ever since the first - faulty - conceptualization of its utility function in 1995. Because it also proves to be a crucial building block in de SPA model, developing the secundary utility function offered another opportunity to clear up the situation.

Here positive compensation points simply result in a lowered reference point, a lowered cutting score for passing the Last Test. Its utility function is a threshold function on the new reference point. Negative compensation points, however, pose a serious difficulty because now there are three possible outcomes instead of only two: failing the LT, passing the LT on the new - higher - reference score, and passing the LT while not compensating for the negative points standing.

What, then, is the - formal, primary - utility function for this case. Assume there is but one negative compensation point. Passing the LT but not compensating for this negative point implies that the student has to resit an earlier test, probably the test earning her the negative compensation point. Assuming the LT and this earlier test to be equivalent in all respects, what the students 'wins' in this situation is that she has to resit a test on its original reference, instead of having to resit the LT on the higher reference corresponding to the negative compensation point. The utility is known from the utility function on the earlier test: it is the difference in utility between passing that test on its refrence score, and passing it on a lower score that later must be compensated for.

The surprising result is that the utility function for the LT in this case is a two-step threshold function, the first step being a low one, the second step being 1 higher than the first, because it passes the LT itself as well as the negative compenstion point.

Therefore, the utility function takes a special form in the LT case. In the applet the option is to choose 'LT,' the resulting plot looks fundamentally different from that for a test that is not the last in the program. Also, now it is not allowed to declare positive as well as negative compensation, for it must be either positive on the LT (negative points must be compensated for) or negative (positive points may be compensated for). In other words: the compensation in the LT is backwards only.

Java code

As the form of the utility functions is fairly simple, there is no particularly interesting Java code to show. The mathematical utility functions from the literature (Hambleton etcetera) map directly into Java code, and therefore need not be treated here either.

Testing the applet 4

The correctness of the method can be tested by trying to plot some extreme functions; the results can be inspected visually.

A problematic point that concerns many other methods also, is that the use of the utility method in other applets is not transparent. One has to trust that the utility functions used to evaluate expected utilities correspond to the parameter values given in the menu. It will help to know that utility functions are constructed in one special method, accommodating also the special option functions according to the option number passed on to this method. So there is a solid coupling between the option number in the menu and the specific utility function used to evaluate expected utilities and expected utility functions. The advanced applets in chapters 7 and later will produce thumbnail plots of the utility functions for visual inspection.

Literature on exams, marks and grading

edexel examzone site. [On marks and grading in connexion with the secondary education examinations in the UK. Also re-marking, re-sitting, qualifications.]

Simon French (1985). The Weighting of Examination Components. The Statistician, Vol. 34, No. 3. (1985), pp. 265-280. Jstor

Willem K. B. Hofstee (2009). Promoting intersubjectivity: a recursive-betting model of evaluative judgments. Netherlands Journal of Psychology, 65.

D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky (1971/2007). Foundations of Measurement Volume I: Additive and Polynomial Representations. Dover (reprint appearing January 30, 2007).

Marilena Vassiloglou and Simon French (1982). Arrow's theorem and examination assessment. British Journal of Mathematical and Statistical Psychology, 35, 183-192.

Robert Wood and Douglas T. Wilson (1980). Determining a rank order when not all individuals are assessed on the same basis. In L. J. Th. van der Kamp, W. F. Langerak and D. N. M. de Gruijter: Psychometrics for education debates (p. 207-230). Wiley.

David J. Woodruff, Robert L. Ziomek (2004). Differential Grading Standards Among High Schools. ACT Research Reports 2004-2 pdf

Pui-Wa Lei, Dina Bassiri and E. Matthew Schulz (2001). Alternatives to the Grade Point Average as Measures of Academic Achievement in College. ACT Research Reports 2001-4 pdf

Ben Wilbrink (1997). Assessment in historical perspective. Studies in Educational Evaluation, 23, 31-48. 56k html

Literature on decision-making

Albert Burgos (2004). Guessing and gambling. Economics Bulletin, 4, No. 4 pp. 1-10. pdf

Ronald A. Berk (1980). A consumers' guide to criterion-referenced test reliability. Journal of Educational Measurement, 17, 323-349.

Sarah Lichtenstein and Paul Slovic (Eds) (2006). The construction of preference. Cambridge University Press contents.

Literature on utility

Ballestero, E., and C. Romero (1994). Utility optimization when the utility function is virtually unknown. Theory and Decision, 37, 233-243.

Chen, J. J., and M. R. Novick (1982) On the use of a cumulative distribution as a utility function in educational or employment selection. Journal of Educational Statistics, 7, 19-35.

Davis, Charles E., James Hickman and Melvin R. Novick (1973). A primer on decision analysis for individually prescribed instruction. Iowa City: The Research and Development Division, The American College Testing Program. ACT Technical Bulletin no. 17. [Not available on the ACT website (as of Jan. 2008)]

Gaag, N. van de (1990). Empirische utiliteiten voor psychometrische beslissingen. Proefschrift 22 november 1990. Zie voor annotaties toetsen.htm#Gaag_1990.

Hambleton, Ronald K., and Melvin R. Novick (1972, 1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10, 159-170. Earlier the same text: ACT-Research Report no. 53, 1972.

Keeney, R. L., and H. Raiffa, H. (1976). Decisions with multiple objectives: preferences and value tradeoffs. New York: Wiley.

Kelley, A. C. (1975). The student as a utility maximizer. Journal of Economic Education, 6, 82-92.

Krzysztofowicz, R. (1983a). Risk attitude hypotheses of utility theory. In B. P. Stigum and F. Wenstop, Foundations of utility and risk theory with applications (p. 201-216). Dordrecht: Reidel.

Krzysztofowicz, R. (1983b). Strength of preference and risk attitude in utility measurement. Organizational Behavior and Human Performance, 30, 88-113.

Libby and Novick (1982). Multivariate generalized beta distributions with applications to utility assessment. Journal of Educational Statistics, 7, 271-294.

Linden, W. J. van der (1987). The use of test scores for classification decisions with threshold utility. Journal of Educational Statistics, 12, 62-75.

Lindley, D. V. (1976). A class of utility functions. The Annals of Statistics, 4, 1-10.

Maesen de Sombreff, P. E. A. M. van der (1992). Het rendement van personeelsselectie. Proefschrift, Rijksuniversiteit Groningen.

Mellenbergh, Gideon J., & van der Linden, Wim J. (1981). The linear utility model for optimal selection. Psychometrika, 46, 283-293. pdf

Naerssen, Robert F. van (1970). Over optimaal studeren en tentamens combineren. Openbare les. Amsterdam: Swets en Zeitlinger, 1970. [The first publication on the tentamen model, in Dutch] html

Melvin R. Novick & Dennis V. Lindley (1978). The use of more realistic utility functions in educational applications. Journal of Educational Measurement, 15, 181-191. JSTOR preview

Melvin R. Novick & Dennis V. Lindley (1979). Fixed-state assessment of utility functions. Journal of the American Statistiscal Association, 74, 306-311. JSTOR preview

Pratt, J. W., H. Raiffa and R. Schlaifer (1964). The foundations of decision under uncertainty: an elementary exposition. Journal of the American Statistical Association, 59, 353-375. Reprinted in Tummala and Henshaw 1976, 35-57.

Raiffa, H., and R. Schlaifer (1961). Applied statistical decision theory. London: The M.I.T. Press.

Schlaifer, R. (1959). Probability and statistics for business decisions. New York: McGraw-Hill.

Schmidt, F. L., J. E. Hunter, R. C. McKenzie and T. W. Muldrow (1979). Impact of valid selection procedures on work-force productivity. Journal of Applied Psychology, 64, 609-626.

Cor Sluijter (1998). Toetsen en beslissen. Toetsing bij doorstroombeslissingen in het voortgezet onderwijs. Proefschrift Universiteit van Amsterdam. pdf [5.1 Beslissen met tests, p. 107 e.v.]

Stigler, Stephen M. (1986). The history of statistics. The measurement of uncertainty before 1900. Cambridge, Mass.: The Belknap Press of Harvard University Press.

Titioura, Andrei (2002). Some functional equations connected with the utility of gains and losses. A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics in Pure Mathematics. pdf 81 pp.

Tummala, V. M. R, and R. C. Henshaw (Eds.) (1976). Concepts and applications of modern decision models. Division of Research, Graduate School of Business Administration, Michigan State University, East Lansing, Michigan.

Verheij, J. G. C. (1992). Measuring utility, a psychometric approach. Proefschrift UvA.

Vos, H. J. (1990). Simultaneous optimization of decisions using a linear utility function. Journal of Educational Statistics, 15, 309-340.

Vrijhof, B. J., G. J. Mellenbergh, and W. P. van den Brink (1983). Assessing and studying utility functions in psychometric decision theory. Applied Psychological Measuement, 7, 341-357.

Wilbrink, Ben (1980). Enkele radicale oplossingen voor kriterium gerefereerde grensskores. Tijdschrift voor Onderwijsresearch, 5, 112-125. [44k html + 80k gif]

Wilbrink, Ben (1980). Optimale kriterium gerefereerde grensskores zijn eenvoudig te vinden. Tijdschrift voor Onderwijsresearch, 5, 49-62. [56k html + 3 gifjes]

(1980). Passing scores on domain referend tests: an improved decision-theoretic methodology for optimization. COWO. pdf

(1980). Passing scores on domain referend tests: an improved decision-theoretic methodology for optimization. COWO. Partly revised / not published except here pdf

Wilbrink, Ben (1990). Complexe selectieprocedures simuleren op de computer.Amsterdam: SCO. (rapport 246) [240k pdf] [bijlagen 304k pdf]

Wilbrink, Ben (1995). Studiestrategieën die voor studenten en docenten optimaal zijn: het sturen van investeringen in de studie. Korte versie in Bert Creemers e.a. Onderwijsonderzoek in Nederland en Vlaanderen 1995. Proceedings van de Onderwijs Research Dagen 1995 te Groningen (218-220). Groningen: GION. Paper: auteur. [44k html + 18 gif]

Literature on motivation to study

John H. Bishop (2005). High School Exit Examinations: When Do Learning Effects Generalize? . Cornell, Center for Advanced Human Resource Studies Working paper 05-04 http://www.ilr.cornell.edu/depts/cahrs/downloads/PDFs/WorkingPapers/WP05-04.pdf [Dead link? May 2, 2009]

John H. Bishop (2004). Drinking from the Fountain of Knowledge: Student Incentive to Study and Learn - Externalities, Information Problems and Peer Pressure. Cornell, Center for Advanced Human Resource Studies Working paper 04-15 pdf

Wilbrink, Ben (1992). Modelling the connection between individual behaviour and macro-level outputs. Understanding grade retention, drop-out and study-delays as system rigidities. In Tj. Plomp, J. M. Pieters & A. Feteris (Eds.), European Conference on Educational Research (pp. 701-704.). Enschede: University of Twente. Paper: auteur. html

Wilbrink, Ben (1992). The first year examination as negotiation; an application of Coleman's social system theory to law education data. In Tj. Plomp, J. M. Pieters & A. Feteris (Eds.), European Conference on Educational Research (pp. 1149-1152). Enschede: University of Twente. Paper: auteur. html

The Ruling: How the result will count (his master's voice)

Module four of the SPA model: Utility functions (first generation)

Ben Wilbrink

some highlights of this module

Figure 4.1 Typical form of (first generation) utility function in rulings where some compensation between low and high scores is allowed. In the depicted case the test has 40 items, the reference score is 30, compensation is allowed for three scoring groups below and above the reference score.

How test scores are combined

the utility scale metric

An alternative approach

Scientific position

characteristics of the approach chosen

loss functions

Cronbach & Gleser (1957)

Robert F. van Naerssen (1970)

paradigm lock-in: Hambleton and Novick (1972)

Testing the Hambleton and Novick (1972) threshold utility paradigm

paradigm lock-in 2: Davis, Hickman and Novick (1973)

Testing the Davis, Hickman and Novick (1973) linear utility paradigm

Davis, Hickman and Novick (1973) quadratic utility

Davis, Hickman and Novick (1973) exponential utility

Davis, Hickman and Novick (1973) squared exponential utility

Discussion of the Davis, Hickman and Novick (1973) utility functions

Mellenbergh and van der Linden (starting in the 70s)