# The mastery envelope: The likelihood of it all

## Module two of the SPA model: Mastery

#### Ben Wilbrink

warning. The JAVA-applets have been compiled under a JAVA version that since has been declared obsolete because of security leakages. I have not yet been able to update to new style JAVA or to construct applets based on Javascript. For simple analyses it is of course possible to use WOLFRAM's generator for the binomial distribution (score distribution given mastery), the beta density (likelihood foor mastery), end the betabinomial distribution (predictive score distribution); I will present the necessary links and documentation. Any questions: contact me. It is my experience that this innovative work does attract attention zero nada niente; if you are interested, then you are the exception, so do not hesitate to contact me.

This chapter presents the second module of the SPA model, SPA standing for Strategic Preparation for Achievement tests. The SPA model consists of a series—partially a cumulative one—of modules dedicated to particular functions in the SPA model, such as generating binomial distributions given what the mastery level is, generating the likelihood of mastery given a score on a preliminary test, given the likelihood of mastery generating the predictive score distribution for the test one has to sit, specifying objectiev—first generation—utility functions on test scores, specifying learning curves, given the learning curve evaluating expected utility along the learning path, given the expected utility function evaluating the optimal investment of study time in preparation for the next achievement test, and using the last results specifying the second generation utility function on test scores. ### Scheme of the mastery module. Given a preliminary test result, the likelihood of the mastery defined on the test item domain is based on the beta density β(a,b) having as parameters a = the number correct + 1 and b= the number not correct + 1. Simulation of the likelihood is rather straightforward; for every mastery value in the grid on the interval from zero to one, a binomial distribution is simulated, and the number of scores equal to the result on the preliminary test is counted. These counts allow the simulated likelihood to be constructed.

#### some highlights of this module

Figure 1 illustrates the main points of the search for what mastery might be, given the observation of a score on a preliminary test sampled from the domain of test questions. It certainly is true that there will be a mastery value that is more likely than any of the other possible values; it is the highest point in the curves plotted. But many other possible values will be seen to have substantial likelihood, therefore the plot of the entire likelihood function, likelihood for short, will be of interest.

##### Figure 2.1.1. Given the (preliminary) test score, mastery can be anything, but some possibilities are more likely than others. The mastery having maximum likelihood is particularly interesting, but so is the spread of other possible values having substantial likelihood. The solid plot is the result of simulation, the red line histogram is the result of the mathematical construction.

It is of some importance to understand how the likelihood is constructed. It looks very much like a trick, for the construction is based on make-believe. Assume a particular mastery to be the one applying in the case of this student -her 'true' mastery some would rather say - and use the generator of module one to construct a distribution of test scores resulting from this master. The likelihood of the mastery assumed then simply is the proportion of scores equal to the score on the (preliminary) test; figure two below illustrates this. Repeat the process for as many different values of mastery you want to look into. The applet uses a grid of such values, equally spaced between (but not including) masteries zero and one, or zero and hundred percent.
In multiple choice testing guessing will be a nuisance. The student sitting a preliminary test need not guess any item, therefore her preliminary test score may be taken to be without any guessing. To predict the score on the summative test, however, items not known will have to be guessed if blanks are scored as wrong answers. For the prediction - the next module - there will opportunity to make use of the guessing parameter.
Guessing, of course, might occur on other formats of test items also. Guessing need not be a very conscious process. Reversely, students might not always be sure about the answers they give, even on items they know the correct answers to. The case might be, therefore, that the student is not even able to refrain from guessing should she want to do so. The applet 2 therefore will offer the opportunity to assume guessing to be part of the (preliminary) test score obtained. The likelihood constructed in this case still is that on mastery, the guessing probability will be assumed known. No, it is not possible to derive a likelihood on the guessing parameter also (then mastery should be assumed known).

By the way: the simulated likelihood in figure 1 does not look particularly 'random.' That is because by default a very fast algorithm is used; it works fine where the number of observations is very much larger than 100 - the number of observations in figure 1 -, otherwise results look somewhat clustered. There is an option available to use full simulation instead, use this option 210 with a small number of runs only because it is very a very slow algorithm.
To use the applet itself, go to applet 2. The new menu items will be explained there too.

### How to get information from an observed test score

No model should live without data to feed it. The data consist of information the student has about her mastery of the domain. That information might be anything relevant, but available information will for definiteness be operationalized as the score obtained on a trial test or preliminary test of a certain number of items. The more items, the better the information. The plot shows the 'area' that envelopes where mastery likely is located. Technically it is called a likelihood. The higher the curve is, the more likely its corresponding mastery. The most likely mastery value is the value corresponding to the highest point of the curve. Is the most likely mastery the 'true' mastery? No, at best it is not very useful to talk about 'true' mastery at all. At worst the notion of 'true' mastery is confusing and tempts one to invent obscure statistical methods to catch this ephemeral event or thing, or what? Mastery will always be a somewhat floating concept. In the educational context that need not be problematic at all, because ultimately it finds its operationalization in test scores and grades obtained. In the third module, The Ruling, this will be made excact in the form of (first generation) utility functions on educational results. ##### Figure 2. For a chosen level of mastery, The Generator simulates a distribution, the blue bar of which indicates the scores equal to the given number correct on a preliminary test. This bar's proportional surface is the likelihood of the given level of mastery.

How is a likelihood constructed? For a series of possible values of mastery, construct their test scores distributions. Determine in every case the proportion of the scores that are equal the preliminary test score given. Specifically, divide the scale of mastery in a grid of sections or bars, it is nice to make them of equal width. Then apply The Generator to the mastery corresponding to the midpoint of every section, count the number of hits, and plot these as a frequency distribution. Scale the height of the distribution to be one. To apply the generating binomial process, specify the number of test items to be equal to the number of items in the information obtained from the preliminary test. The illustration makes clear how the algorithm works in the simulation case. The analytical 'exact' method, by the way, works in the same manner, using analytical binomial distributions instead of simulated ones.

• No student ever knows her (real) mastery level, neither do faculty; yet it is possible to derive predictions from the fallible information that is available. The score on the test gives an indication, that indication being better the longer the test, and so does the score on a preliminary test. The question then is precisely how the last score enables one to assert something about the probability of this or that mastery level being the 'true' one. Using the preliminary score it is possible for every contemplated level of mastery to determine the relative likelihood of this particular score; the technique used is the same as the simulation treated in the last section, making it possible to determine the proportion of simulated scores that are equal to the score on the preliminary. Doing this for a series of possible mastery levels results in a function of this likelihood, the maximum of this function is then arbitrarily fixed at one because this function is not a statistical density function.

citation from my 1995 text (EARLI ../publicaties/95AssessmentTheoryEARLI.htm">html) ### Scientific position

Assuming a binomial process and information that is available in the form a the number correct score on a test of particular length, the likelihood of the binomial parameter is the operationalization of the knowledge one has about the mastery of the student involved. This definiton is adequate if the student herself is the one interested in the likelihood. It is still adequate if the teacher or a third person is interested in the likelihood, for the definition implies that the only information available to the teacher is the number correct score on that particular test, randomly sampled from the domain of test items. In particular, the teacher has not picked this student because of her extreme score compared to that of her fellow test takers.

likelihood

Likelihood functions have been studied and described by - among many others - Edwards (1972). Of course he could use previous work by statisticians as well as philosophers. Stegmüller (1973) also described the likelihood concept comprehensively in a simultaneously published book. Recently there has been an upsurge in publications about likelihoods, see the literature paragraph.

The fact that a likelihood function is not some kind of probability function, and vice versa, is important. Knowing the level of mastery, the score on a test is the result of a binomial process. Knowing the score obtained on a test, the information about the underlying mastery is rendered by the likelihood function. The difference is accentuated by norming likelihood functions to have height one, while probability functions by definition must have an area under the curve of one. Likelihoods do not have the absolute significance that probabilities do have, likelihoods are relative to each other.

• Während für Wahrscheinlichkeiten Operationen wie produkt-, Summen- und Differenz-bildung definiert sind, ist dies hier nicht der Fall. Das Produkt, die Summe sowie die Differenz von Likelihoodwerten verschiedener Theta haben überhaupt keine erkenntnistheoretische oder statistische Bedeutung. [Stegmüller p. 115]

What one can do, however, is to form the relative likelihood (function) by dividing alle likelihoods by the maximum likelihood. The resulting likelihood function has the standardized height one.

In contrast to the stipulation in the citation, the area under the (relative) likelihood (function) definitely plays a significant role in the SPA model, as it does in bayesian statistics.

• Wollte man die Vorzüge des sog. Likelihood-Schlusses auf eine formel bringen, so könnte man mit Diehl und Sprott die folgenden Merkmale anführen: (1) verglichen mit anderen Verfahren ist er sehr einfach durchzuführen; (2) er liefert für jeden Stichprobenumfang ein exaktes Resultat; (3) er verwertet die gesamte Information, die man einer Beobachtung entnehmen kann.

Der dritte Punkt ist allerdings, wie bereits angedeutet, anfechtbar, so daß dieses Merkmal der Likelihoodfunktion bis heute als kontrovers gelten muß. Uneingeschränkt wird es von den heutigen Subjektivisten akzeptiert ("subjektivistisches Likelihoodprinzip"), aber auch von einem Teil der Nichtsubjektivisten.

the confidence interval

In a way the likelihood is just another method to estimate the range within which 'true' mastery lies with a certain probability. The orthodox statistical procedure is to evaluate the standard error of measurement, and use it to construct confidence intervals. Scores obtained on a group of students are needed for the standard error of measurement, therefore it is unsuitable for models of individual decision making.

The orthodox method of constructing confidence intervals is not an appropriate starting point for the building of the model. For one thing, a multitude of different intervals and probabilities can be chosen; the likelihood function in a way is equivalent to the result of trying to specify alle possible confidence intervals simultaneously. For another thing, it is not clear how to construct predictive distributions starting with one, two, or a multitude of confidence intervals. The scientific position I have chosen has been clearly formulated and exemplified by E. T. Jaynes (1976). One of his examples, by the way, is a method for adaptive testing - an intelligent method for acceptance testing -.

the beta density as likelihood function

It is known that the likelihood for a binomial parameter, given number correct = a - 1 and incorrect = b - 1, is the beta density (for example, Novick and Jackson, 1974, p. 109).

•           f(m) = B-1( a, b ) ma-1 ( 1 - m )b-1.

The factor B( a, b ) is the beta function

B( a, b ) = 01 ma-1 ( 1 - m )b-1,

which is related to the factorial and gamma functions:

B( a, b ) = (a-1)! (b-1)! / ( a+b-2 )!

for integral values of a and b, and

B( a, b ) = Γ( a ) Γ( b ) / Γ( a+b )

for real values of a and b.

Whether or not to use a, b or a+1, b+1 is a source of entanglement; in my 1978 (page 77), for example, I erred. The use has been fixed historically,

It is a nice theoretical result, belonging to what in the literature is known as the 'betabinomial model.' However, as soon as there are complications introduced in the model, such as guessing, this theoretical result does not apply any longer.

Nevertheless, the beta function makes it possible to check the correctness of results obtained otherwise, and therefore is available as option 205 in the likelihood module. Option 206 evaluates the mean and standard deviation of the beta function, allowing a direct comparison with the values obtained otherwise.

the exact method to construct the likelihood function

The alternative to using the theoretical result is to use a constructive approach: for every mastery in the grid, directly evaluate the binomial probability of obtaining a score identical to the information given (a and b). Given the chosen level of mastery m:

likelihood( m = [( a + b - 2 )! / ( ( a-1 )! ( b-1) ! )] ma-1 ( 1 - m )b-1

This exact method will result in a distribution that is equal to that of the beta distribution, at least in the 'pure' binomial process without a guessing parameter. In the case of guessing, it generalizes easily to this case also, although the resulting likelihood function will not be a beta density. The exact method therefore is the more general one, the betabinomial model applies in special cases only.

method of simulation

While the exact method allows direct evaluation of the likelihood of a given mastery, for its simulation the simulation of the full binomial distribution, given that mastery, is necessary. For a grid of 100, therefor, 99 binomial distributions have to be simulated, even on fast computers this rapidly becomes time consuming. A shortcut is possible, however, that will not corrupt the randomness of the likelihood to be generated. The trick is to use one random number to generate item scores for all levels of mastery in the grid at the same time. For example, if the random number is 0.167, the test score for all levels of mastery below 0.167 is raised one point. If a low number of simulation runs is declared, the result will show some clustering of neighboring scores. The full - but slow - simulation, however, is still available to the user of option 210.

The fast simulation is truly random in itself, and it does not introduce a particular bias by using the one random draw to determine one item score in the series of adjacent tests. The visible clustering where the number of runs is low, is itself unbiased.

the group case

In purely individual cases the likelihood allows estimation of the area within which mastery with a certain probability will lie. I have yet to work out the exact relation to the group case. For now the following may be observed. The point is, being member of a group adds information about mastery. That extra information will not help in reducing the 'standard deviation' of the likelihood function. Extra information may be accounted for, however, as a certain score on a preliminary test of a certain strength or length, and added to the purely individual information also in the form of a score on a (preliminary) test. Adding is straightforward: add number correct as well as number of items. Take as number correct for the group for example the mean test score groups like these obtained in the last few administrations of the (preliminary) test. Use the predictor of module 3 to find the number of preliminary test items that results in a predictive test distribution having a standard deviation equal to that obtained over earlier adminstrations of the test. Analytically the problem may be solved using the betabinomial model.

The assumption underlying it all is that the individual student chosen truly is a member of the group, and has not been chosen because of something special about her. The problem for the student herself is how she can be sure to be a true member of the group. In the paragraph on empirical support I will return to the last point, using data on expected test results for the very first test law students in the University of Amsterdam had to take. Empirical data on student expectations make it plausible that students intuitively do something like the method here prescribes: combining information about one's own mastery of the course material and knowledge about one's standing in the group of stduents. Before the very first test they do not know yet very well their own standing in this new group of fellow students; as a result their test score predictions are further off the mark than they are for future tests.

Note that there is no objection to doing the analysis ex post, after the summative test has been administrated. The SPA model is a general model, remember?

### Special points

guessing

Students guessing on test items pose a serious problem to the interpreter of test scores. The problem, of course, is that it simply is not known on which particular items this student might have guessed. Incorrect answers indicate the student has guessed them incorrectly, but even so, there always wil be other possibilities for the student to answer incorrectly other than by guessing. Because guessing is an extra parameter in the model, and therefore will not make things simpler, it would be best to get rid of guessing, and indeed there are possibilities - indicated below - to do so. Another reason to do so, of course, is that guessing introduces noise in the testing process. Ultimately, however, the SPA model should be applicable in cases where guessing is occurring, possibly without students being sure about when and where they guess or not. The best reason to downplay guessing might be that it can never be the purpose of education to teach students it is OK to guess if

The applet originally offered no opportunity to specify a guessing factor. One reason was that in the strategic model the student need not and should not guess on preliminary tests. Another reason is that specifying a guessing factor would complicate the model without adding substantial value. In chapter one on The Generator it was demonstrated that knowing and guessing both are probabilistic, and that adding both processes still results in a binomial process. Therefore it is possible to assume mastery to include whatever guessing the student might do. In either case the model specification of a guessing factor may be left out.

no guessing assumption

One approach to the question of guessing on questions is to assume there will be no guessing: mastery then is assumed to be without guessing. Because the model is assumed to serve the student making strategic choices, she should have no problem with sitting preliminary tests and not guessing on any item, whether items are multiplechoice or not. This position becomes a little awkward, however, because predictions of test scores then will be without guessing also. To get realistic predictions in situations where guessing really is important, the model should eventually offer the opportunity to specify guessing factors, to correct cutoff-scores reckoning with the guessing factor, or to leave it entirely to the user to handle the guessing problem.

guessing assumed absorbed

The other possible approach to guessing is to assume it is absorbed in the definition of mastery. Everything in the SPA model then should be interpreted as inclusive of whatever guessing the student has done and might do. In the case of multiplechoice testing as a consequence predictions, utilities, as wel as learning will concern this newly defined mastery.

guessing probability assumed known

At the end of the day guessing simply should be another parameter in the model. To begin with, the probability to guess the correct answer on an item that is not known will be assumed to be a known constant r, for example the reverse of the number of alternatives used in multple choice items. The binomial then still applies, its parameter now being mastery m + r instead of m only.

There is an extensive literature on guessing and correction for guessing, which suggests that models incorporating guessing will be complex. Such need not be the case. However, even using the binomial process to design a model of assessment, the designer might be tempted to use the binomial distribution on the parameter m, and then constructing a conditional binomial distribution for the guessing occurring on the items not known. The resulting statistical formulas are quite intimidating, and do not suggest the possibility they might be reduced to the binomial on the parameter m + r. Christopher Zarowski (personal communication) proved the reduction possible. In the paragraph on Java Code the details will be given, because the complex model has been implemented as well, and may be run using option 209 (in the advanced applet 2a). Needless to say, the results under option 209 equal the results under the first model, so much so that it proved desirable to mark the 209-option results by using a different color (blue instead red).

other models of guessing

Another model, implemented as option 207, is the probability of guessing the correct answer being dependent on the level of mastery, in formula:

binomial parameter = m + ( 1 - m ) * ( r + r * m )

In this model the parameter r is the probability of a student not knowing anything to guess answers correctly. Growing in knowledge, her guessing probability rises with an amount r*m. This is a rough model, endless refinements may be thought of, none of them have been implemented in the applet.

Another approach might be to specify how many items on the preliminary test were guessed correctly, and use this to construct a likeliood on the guessing probability as well. This possibility has not been implemented (yet?). A useful variant might be to sample guessing behavior before learning starts, express that information in the usual way as the number guessed correctly in the sample of items, and then use the option 207 model on the growth of the guessing 'ability.'

Some models on guessing, such as in the case of multiplechoice - in contrast to true-false - items, taking the guessing probability to be the reverse of the number of alternatives minus one, do not need special programming, the user simply chooses this as the value of the guessing parameter.

scientific position on guessing

Guessing is a nuisance in educational assessment. Under all circumstances random influences like guessing on items not known or partially known are harmful, and if possible and feasible should be avoided. Because the core business of education is to educate, it definitely is harmful to teach students that it is perfectly OK to guess on questions one does not know or is not sure of. One approach then, without abandoning multiplechoice questions altogether, would be to give the student a constant credit on questions left unanswered. The constant should be chosen so as to give ample credit to partial knowledge.

stratified sampling from subdomains

Stratified random sampling from subdomains offers no special points as far as the likelihoods for the subdomains are concerned. There simply are two likelihoods, therefore the advanced applet does not offer the option of subdomains analysis other than in the form of a second set of parameters for the second subdomain. The subdomains option will, however, be available in further applets on prediction, expectation or strategy. All that will be needed then are the two separate likelihoods.

Another way to put the reason for the omission here is the following. A likelihood is a function of levels of mastery; it is not possible to specify a dimension that is a combination of the mastery dimensions of the subdomains. Specifying a combined likelihood is possible however in the form of the bivariate plot of the two independent likelihoods; this plot on the two-dimensional mastery plane cannot be reduced to a plot on one dimension, and it therefore offers no advantage to construct it.

### Application

Given a particular testresult, and nothing else about the student is known, the likelihood function represents how likely each of the mastery scores in the grid is, relative to each other, and which mastery is the most likely of them all, has 'maximum likelihood.' If you must bet on any of the masetry levels being true, do it on the one having maximum likelihood. If it is known that the student belongs to a group, the given test result should be 'regressed to the mean' of the group first, before constructing the likelihood. If it is known that everybody in the group is guessing only, the regression should be to the mean itself. If it is known that everybody in the group has perfect knowledge, there is no regression problem al all because everybody will have all items correct. If the length of the test is very large, the regression to the mean is negligible, if it is very short the regression should be to the mean itself. Cases of practical import lie somewhere in between these extremes. In the betabinomial model (see the prediction, module 3) an exact solution is available; the betabinomial fit to the group scores supplies the parameters c (correct) and d (incorrect) that allow the construction of the likelihood of mastery for the group. See figure 1. It is now possible to specify that the likelihood for the student chosen randomly from the group, having a items correct and b incorrect - itself plotted in figures 1 and 2, the right one - is the likelihood on the parameters a + c and b + d, plotted in figure 2. Or a + c correct out of a test of a + b + c + d items. Remark that a + b always will be larger than c + d, so do not be tempted to combine high values of the sum c + d with lower ones of a + b

### WolframAlpha

Information on the betabinomial distribution: http://www.wolframalpha.com/input/?i=betabinomial distribution.

WolframAlpha and the betabinomial distribution, parameters number correct + 1 = 13 and number false + 1 = 5, number of items = 60:

#### http://www.wolframalpha.com/input/?i=betabinomial+distribution+13+5+60. Try some other values for the parameters. reliability

• Xitao Fan and Ping Yin (2003). Examinee Characteristics and Score Reliability: An Empirical Investigation. Educational and Psychological Measurement, 63, 357-368.
• abstract The literature on measurement reliability shows the general consensus that examinee group heterogeneity with regard to the trait being measured affects the score reliability. Potentially, the performance level of the examinee group may also affect score reliability because of its potential effect on the relative magnitude of error variance. This article empirically examines the effects of these two examinee sample characteristics on score reliability of optimal-performance measurement. Two large extant data sets (criterionand norm-referenced, respectively) were used in the investigation. The results suggest that both performance variability and group performance level affect score reliability, and measurement error tends to be smaller for high-performance groups than for low-performance groups.
• I am just curious whether the likelihood module is able to 'replicate' the Fan-Yin examination. I have not yet seen the article itself.

### Project history

In the 70's there was much discussion of Bayesian methods in psychometrics, see for example Novick and Jackson's (1974). The idea to use the beta function to put numbers on prior information therefore came naturally. In the statistical literature on beta-binomial models the mathematically convenient relation between binomial, beta-binomial, and beta distributions and density was present all over the place.

The idea to simulate the likelihood was of much later date, about 1994. There has, however, always been some uneasiness in the direct comparison of simulation results and the analytical beta density. The problem is that continuous functions such as the beta density have some difficulty to live in the real world where functions like these have to be approximated by cutting up the real line in a grid of a convenient number of equal chunks, in the applets called 'bars.' Doing so, and choosing the middle points of these bars as the 'mastery' points to be evaluated, will result in particular biases, that might or might not disturb results further on in the model calculations. It has proven to be very difficult to say good-bye to the beta density as the best model. Only in 2004 the light was seen, and an exact method for the analytical likelihood has been developed now, in the program (see the paragraph on the java code below) called the 'exact' method. It is based on the direct use of the binomial probability of a score equal to the given number correct on the (preliminary) test, given a chosen mastery value. The availability of the exact method obviates the need to explain to users of the SPA model why the likelihood should be based on this beta density. Another advantage of the exact method is that it does not generate the expectation that every likelihood will be based on a beta density. In particular, in further modules we will meet projected likelihoods that will have no resemblance at all to beta densities.

A very recent improvement is the development of a simulation technique that is very, very much faster than the one used from the beginnings in 1994 until april 2005. The fast simulation method will ease work on evaluating strategic positions that demand very much computer time to simulate.

### Java code

cutting up the mastery dimension

Basically the mastery dimension has to be cut up in a grid of equal intervals called bars in order to be able to construct functions on it. The midpoint of every bar as plotted [the number of the bar as treated in the program, divided by the grid number] determines the value of the mastery concerned. The method to plot the results implies that at the extreme ends the bars have half the width of the others. The applet offers the opportunity to choose the number of bars.

The method of rendering the data is that of the bar diagram. The line diagram would have been a better choice to emphasize the fact that only a few points on the mastery dimension determine the likelihood. The bardiagram, though, is the favorite rendering, because it depicts the area that in fact will be used to draw randomly from the likelihood (see the prediction in module 3).

Vector bi contains the binomial; vector e contains values of the gamma function that have already been evaluated in method lnG.

The value of 'bar' is passed to the method that carries out the simulation; it is used to renew the seed for the simulator, just in case the cycling through the bars proves faster than the time update that normally is used by the pseudo-random number generator in that method. kernel of the code

Figure 1 shows the kernel of the Java code to determine the likelihood of a given mastery m. The other parameters are n for the number of items in the (preliminary) test, c for the given number correct on that test; Together n and c summarize the information available on the mastery of this student.

The simulation of a likelihood can be done in a straightforward way, simulating a binomial for every point to be discerned on the mastery dimension. This method is offered as option = 210 in the advanced applet, as it is extremely time consuming.

analysis The analysis consists of the direct evaluation of a binomial probability. The method to evaluate the binomial coefficient is given in figure 2. It uses a method to get the natural logarithm of the gamma function, that has been inspired by the method developed by Press et al. (1989), function gammln.

beta density

Option 205 replaces the standard analysis with the beta analysis, directly evaluating the likelihood of the given mastery, using the formula for the beta density given already above. Again, the lnG method evaluates the natural logarithm of the gamma function.

fast simulation

The fast simulation of the complete likelihood uses the vector containing all mastery values of the grid. In every run for every item a pseudo-random value is generated, and its value checked against the value of each mastery; is it bigger, than the temporary vector b registers an item score for that particular mastery.

full simulation

The full simulation option 210 is seen to do just what its name implies: for every bar in the grid minus the last one it simulates a binomial distribution. The only frequency of that distribution that is used in the simulation of the likelihood is the number of scores equal to the number correct c.

guessing

Guessing is handled straightforward: before any other action is undertaken, the chosen level of mastery m is transformed

m = m + ( 1.0 - m ) * r.

Option 207, guessing probability increases with level of mastery, then is

m = m + ( 1.0 - m ) * ( r + m * r ).

Yet, to be honest, there is one action that is undertaken first: if the complexity of questions - treated fully in chapter 4 on learning - is two or larger, mastery is first transformed to mastery at complexity level one, taking the root :           m = Math.pow ( m, 1.0 / complexity ).
In both dedicated likelihood applets complexity has been fixed at one. Would it be otherwise, unexpected things would be seen to happen.

conditional model on guessing (option 209)

As explained earlier, guessing may be conceptualized as happening at the test level, instead of at the item level. The ensuing model is rather complex. Yet it allows construction of the likelihood in a fairly simple way, making use of the fact that, given the test score - which should not be greater than the given number correct on the (preliminary) test, there is only one number of the remaining items guessed correctly that will produce a 'hit,' and thus contribute to the to be evaluated likelihood. Therefore an auxiliary array g is used to store for every such possible number of items guessed correctly the binomial probability. This, however, applies to the analysis only. The simulation again is more involved because complete binomial distributions have to be simulated first, before probabilities of hits can be gathered, and stored in vector g. What is worse, the only method available is full simulation: the fast simulation technique does not apply here. The Java code is as follows.

•   if ( pl[ OPTION ] == 209 & b[ DOLIKELIHOOD ] ) {
// prepare binomial probabilities on 'conditional' guessing in vector g
if ( procedure == ANALPROCEDURE ) {
if ( r != 0.0 & r != 1.0 ) {
for ( int gc = c; gc >= 0; --gc ) {
// number unknown: n - known, number known: c - gc
g[ gc ] = binomialCoefficient( n , (double) gc, e )
* Math.pow( r, gc ) * Math.pow( 1.0 - r, n - c );
}
}
}
else if ( procedure == SIMPROCEDURE ) {
double [] sim = new double [ (int) n + 1 ];
for ( int gc = c; gc >= 0; --gc ) {
getSimulatedBinomial( gc + n - c, pl, r, sim, count);
g[ gc ] = sim[ gc ] / obs;
}
}
}
for ( int bar = 1; bar <= grid - 1; ++bar ) {
// [guessing handled here
if ( pl[ OPTION ] != 209 | !b[ DOLIKELIHOOD ] ) {
// [normal procedure here, code has been presented above]
}
else if ( pl[ OPTION ] == 209 & b[ DOLIKELIHOOD ]) { // in likelihood module only
if ( procedure == ANALPROCEDURE ) {
if ( r != 0.0 & r != 1.0 ) {
Binomial( n, m, bi);
for ( int i = 0; i <= c; ++i ) {
likelihood[ bar ] += bi[ i ] * g [ c - i ];
}
}
else if ( r == 0.0 ) {
likelihood[ bar ] = binomialCoefficient( (int) n, c, e )
* Math.pow ( m, c ) * Math.pow ( 1 - m, n - c );
}
}
else if ( procedure == SIMPROCEDURE ) {
getSimulatedBinomial( n, pl, m, bi, count); // Binomial
for ( int i = 0; i <= c; ++i ) {
likelihood[ bar ] += bi[ i ] * g[ c - i ] / obs;
}
}
}
}
}

### Testing the applet 2 A likelihood, by definition, has height = 1. Anything that is called a likelihood, and does not have height = 1, indicates that something is very wrong. Remember that a likelihood function is not regarded a statistical density, the area under the curve will surely not be equal to one.

The three analytical methods (including that of option 209) should give identical results. Means and standard deviations reported can be used for a fast check on identities. Option 206 evaluates directly the formulas for the mean and standard deviation of the beta density. Otherwise means and variances are evaluated using the distributions themselves. Choosing the grid to have very few points might produce differences, I have not yet looked into this.

Choosing the number of runs sufficiently large should produce simulated likelihoods that almost exactly eclipse the analytical plot. The differences between simulation and analysis will look somewhat exaggerated because one particular value in the simulated distribution is used to determine the height of the plot; this effect is clearly present where the number of runs is low.

Option 204 prints the function values themselves, enablinf comparison with whatever quantitative data you otherwise have about a particular function.

### Literature

J. S. Cramer (1986). Econometric applications of maximum likelihood methods. Cambridge: Cambridge.

Edwards, A. W. F. (1972). Likelihood: an account of the statistical concept of likelihood and its application to scientific inference. Cambridge: Cambridge University Press.

Ian Hacking (1965). Logic of statistical inference. Cambridge University Press.

• Long run frequencies - The chance set-up - Support - The long run - The law of likelihood - Statistical tests - Theories of testing - Random sampling - The fiducial argument - Estimation - Point estimation - Bayes' theory - The subjective theory
• citations in CiteSeer

E. T. Jaynes, E. T. (1976). Confidence Intervals vs Bayesian Intervals + Discussion: Interfaces Between Statistics and Content by Margaret W. Maxfield; Jaynes' Reply to Margaret Maxfield; Comments by Oscar Kempthorne; Jaynes' Reply to Kempthorne's Comments. In William L. Harper and C. A. Hooker (Eds) (1976). Foundations of probability theory, statistical inference, and statistical theories of science. Proceedings of an international research colloquium held at the University of Western Ontario, London, Canada, 10-13 May 1973. Volume II: Foundations and philosophy of statistical inference (p. 175- 257).

Lord, Frederic M., & Novick, Melvin R. (1968). Statistical theories of mental test scores. London: Addison-Wesley. (Chapter 23)

Novick, Melvin R., and Paul H. Jackson (1974). Statistical methods for educational and psychological research. McGraw-Hill.

Stegmüller, Wolfgang (1973). Probleme und resultate der Wissenschaftstheorie und Analytischen Philosophie. IV Personelle und statistische Wahrscheinlichkeit. Berlin: Springer.

Ben Wilbrink (1978). Studiestrategieën. Examenregeling deel A. Amsterdam: COWO (docentenkursusboek 9). pdf Herziene versie 2004, al beschikbaar voor hoofdstuk 1 t/m 4. [370k html + gif-bestanden]

Ben Wilbrink (1995). A consumer theory of assessment in higher education; modelling student choice in test preparation. 6th European Conference for Research on Learning and Instruction, Nijmegen. Paper: auteur. html

Ben Wilbrink (1998). Inzicht doorzichtig toetsen. In Theo H. Joostens en Gerard W. H. Heijnen (Red.). Beoordelen, toetsen en studeergedrag. Groningen: Rijksuniversiteit, GION - Afdeling COWOG Centrum voor Onderzoek en Ontwikkeling van Hoger Onderwijs, 13-29. html

Christopher Jonathan Zarowski and Rodney Lynn Kirlin (preprint). A probability model for multiple-choice tests based on the binomial distribution. .

### more literature

David M. Williamson, Russell G. Almond, Robert J. Mislevy, Roy Levy: An application of Bayesian networks in automated scoring of computerized simulation tasks. In David M. Williamson, Robert J. Mislevy and Isaac J. Bejar (Eds) (2006). Automated scoring of complex tasks in computer-based testing (pp. 201-257). Erlbaum.

• The topic here is updating a student model, giving information on how the student is solving complex problems. This kind of problem is similar to that of the student updating her own model of how much mastery has been achieved, given the latest problems that have been tackled.
• p. 204: "Bayes nets support Bayesian modeling of their parameters. This means that models can be defined with parameters supplied initially by experts, and later refined based on experiential data (Williamson, Almond, and mislevy, 2000). Contrast this with neural network models that cannot incorporate the expert opinion, and formula score and rule-based systems whci do not allow easy refinement on the basis of data"
• for Kevin Murphy's list of Bayes net software packages see here

### statistical literature

I probably will never have the resources to study the likelihood literature. Indeed, the Edwards book already contains so much material that does not directly bear on the needs of the SPA model. The following titles are here to satisfy my curiosity only.

Russell G. Almond (1995). Graphical Belief Modeling. CRC Press. info

• "As one of the first volumes to apply the Dempster-Shafer belief functions to a practical model, a substantial portion of the book is devoted to a single example--calculating the reliability of a complex system."
• I have not seen this one, not in the Leiden library.

Owen, Art B. (2001). Empirical likelihood.. Chapman & Hall / CRC Press.

• [see print.google.com. Synopsis: One of the first books published on the subject, Empirical Likelihood offers an in-depth treatment of this method for constructing confidence regions and testing hypotheses. The author applies the method to a range of problems, from those as simple as setting a confidence region for a univariate mean under IID sampling, to problems defined through smooth functions of means, regression models, generalized linear models, estimating equations, or kernel smooths, and to sampling with non-identically distributed data. Numerous examples from a variety of disciplines and detailed descriptions of algorithms-also posted on a companion Web site-illustrate the methods in Practice. http://www-stat.stanford.edu/~owen/empirical/. I mention the title here for those interested in the likelihood technique. The SPA model does not use any of this more involved stuff.]

Richard Royall. Statistical evidence: A likelihood program. Chapman & Hall.

• Synopsis: Interpreting statistical data as evidence, Statistical Evidence: A Likelihood Paradigm focuses on the law of likelihood, fundamental to solving many of the problems associated with interpreting data in this way. Statistics has long neglected this principle, resulting in a seriously defective methodology. This book redresses the balance, explaining why science has clung to a defective methodology despite its well-known defects. After examining the strengths and weaknesses of the work of Neyman and Pearson and the Fisher paradigm, the author proposes an alternative paradigm which provides, in the law of likelihood, the explicit concept of evidence missing from the other paradigms. At the same time, this new paradigm retains the elements of objective measurement and control of the frequency of misleading results, features which made the old paradigms so important to science. The likelihood paradigm leads to statistical methods that have a compelling rationale and an elegant simplicity, no longer forcing the reader to choose between frequentist and Bayesian statistics.

Azzalini, Adelchi. Statistical inference based on the likelihood. Chapman & Hall.

• Synopsis: The Likelihood plays a key role in both introducing general notions of statistical theory, and in developing specific methods. This book introduces likelihood-based statistical theory and related methods from a classical viewpoint, and demonstrates how the main body of currently used statistical techniques can be generated from a few key concepts, in particular the likelihood.
Focusing on those methods, which have both a solid theoretical background and practical relevance, the author gives formal justification of the methods used and provides numerical examples with real data.

Changbao Wu (submitted to World Scientific March 17, 2002). Empirical likelihood method for finite populations.

Changbao Wu (2004). Weighted empirical likelihood inference. Statistics & Probability Letters, 66, 67-79.

Bing-Yi Jing, Junqing Yuan & Wanh Zhou (). Empirical likelihood for non-degenrate U-statistics. ????

Paul H. Garthwaite, Joseph B. Kadane and Anthony O'Hagan (2005). Statistical methods for eliciting probability distributions. Journal of the American Statistical Association, 100, 680-701.

• Abstract see http://www.shef.ac.uk/~st1ao/abs/elichand.html, you can try whether the full article is available for free yet at http://www.sheffield.ac.uk/beep/publications.html.

### links

QuickCalcs: Confidence interval of a proportion or count

Is your answer correct? [Jaynes, E. T., 1976, "Confidence Intervals vs. Bayesian Intervals", in Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science, W. L. Harper and C. A. Hooker (eds.), D. Reidel, Dordrecht, pg. 175. [8Mb ! pdf download of this chapter]]

QuickCalcs: Confidence interval of a SD

http://home.ubalt.edu/ntsbarsh/Business-stat/otherapplets/ConfIntPro.htm [dead link? 2-2008] Proportion Estimation with Confidence [This site is a part of the JavaScript E-labs learning objects for decision making.]

Susan Vineberg (1998). Coherence and Epistemic Rationality. html

• ABSTRACT: This paper addresses the question of whether probabilistic coherence is a requirement of rationality. The concept of probabilistic coherence is examined and compared with the familiar notion of consistency for simple beliefs. Several reasons are given for thinking rationality does not require coherence. Finally, it is argued that incoherence does not necessarily involve fallacious reasoning.

Mail your opinion, suggestions, critique, experience on/with the SPA

July 20, 2015 \ contact ben at at at benwilbrink.nl http://www.benwilbrink.nl/projecten/spa_likelihood.htm