Nederlands Tijdschrift voor de Psychologie en haar Grensgebieden, 26, 360-376. Didakometrisch en Psychometrisch Onderzoek, juni 1970.

This is a transcript of the original article. Short URL: http://goo.gl/7ZgrN

The article contains a number of controversial statements. De Groot was perfectly able to see how his statements would be controversial, but it was his debating style to deliberately use controversial statements.
In due time I will annotate each of the transcribed paragraphs here, how certain ideas relate to later work by De Groot or by others. Some of the statements by the De Groot evidently are not (quite) true, it is a mystery how it was possible he nevertheless made them. I will comment on those statements also.
De Groot's ideas on acceptability are important in themselves. They were never elaborated further by him, nor by others, with two exceptions. In a non-trivial way the tentamen model articulated by Van Naerssen (1970) may be regarded as an application of the idea of acceptability in a pragmatic—decision-theoretic—way. The dissertation by Job Cohen—later to become major of Amsterdam—on 'study rights' shows another way to make acceptability a reality in university life. This book therefore should be available in the public domain (it isn’t, writing 2013).

Robert F. van Naerssen. Over optimaal studeren en tentamens combineren. Openbare les html.

M. J. Cohen (1981). Studierechten in het wetenschappelijk onderwijs. Tjeenk Willink. (besproken door Tom Bos en Marc Groenhuijsen)

Adriaan de Groot died 2006. He made important contributions to cognitive psychology as well as to methodology in psychology, and almost single-handedly introduced multiple choice testing in education in the Netherlands.

[360]

Some badly needed non-statistical concepts in applied psychometrics ¹)

by

Dr. Adriaan de Groot

1. The social science aspect of psychometrics 360

2. Acceptability, a fourth dimension 362

3. Aspects of acceptability 365

4. Procedures in acceptability analysis 372

note 1. A preliminary note on acceptability—the subject has haunted me over a number of years—was briefly discussed with Lee J. Cronbach in the summer of 1966. From 1967 on, acceptability was a theme in a number of minor publications, memos and discussions in the Netherlands. It was also one of the research themes to be studied, as part of Lewis R. Goldberg's project MH 12972, during my 1968 stay as a visiting scholar at the Oregon Research Institute, director Paul Hoffman. Discussions with Dr. Goldberg and some explorations carried out by, and with, Dr. Frank Payne were instrumental in ordering my thinking. Thanks are due to all persons concerned—along with my apology for the fact that this first product inEnglish of what was then called project S(abbatical) Y(ear) 2 still is a theoretical essay rather than a report on empirical research.

1. The social science aspect of psychometrics

1.1. Inevitably, psychometrics is a social science. For it deals with techniques of measurement, prediction, and decision making to be applied to human beings—who may have pertinent opinions, attitudes, rights and commitments of their own. In particular, the application of psychometric techniques presupposes certain relationships between testers, testees, and society, which are directly relevant and sufficiently interesting to be studied from a social science point of view. Curiously, this self-evident fact is hardly reflected in any of the current models of thought. Neither measurement nor test theories, nor for that matter the individualizing tradition of clinical psychology, provide the conceptualizations needed for an implementation of this social science aspect. In the

[361]

present article, some concepts are introduced, relationships discussed, and consequences indicated to help fille the gap.

1.2. The ideology underlying psychological measurement theories and psychometric test theories is that of a natural science. Psychologists have introduced a number of specific techniques and conceptualizations, it is true, but the main methodological postulates have been borrowed, explicitly or implicitly, from such sciences as physics. Some of these 'postulates' are: (1) prediction is the main, and final, goal of the scientific enterprise; (2) Prediction hinges on measurement, the stronger the better, up to that ideal of quantification, the ratio scale. In the classical conception to whcih many behavioral scientists still adhere, the real thing scientific begins with 'strong measurement' and ends with 'strong prediction.'

1.3. Advocates of the application of decision theory to testing, in particular Cronbach and Gleser (1957) 1965, have shown that the above approach is incomplete. If instruments are designed, constructed, validated, and used primarily as tools for measurement and prediction, the information produced in the process is insufficient, and in part irrelevant for purposes of decision making. Decision optimalization requires that outcome values are determined, probabilities computed, and costs taken into account. The new model adds an important an important economic viewpoint to the measurment-prediction paradigm; instruments and procedures of using them are now judged for their profitability. ²)

note 2. This term is not used in Cronbach and Gleser's classic on the subject. It is introduced here, and preferred over 'utility,' 'benefits' and other near-equivalents, in order to emphasize the distinction between this point of view and that of acceptability.

1.4. However important the decision-theoretical profitability approach may be, a puzzling question remains: profitability to whom? In the use of tests and related decision procedures the interests of many different groups are simulatneously at statke: sponsors, testers (psychologists), testees, and the general public (the community, society at large). If it were possible to exactly determine outcome values for all individual persons concerned and to substitute those values into a formula with adequate, objectively determined weights, the profitability approach might become the final word. Obviously, however, this is impossible. While still instructive as a way of thinking, and attractive as a schematic

[362]

procedure for securing 'the greatest possible happiness to the greatest possible number'—the underlying philosophical principle—in most practical situations the profitability approach just does not work.

1.5. The present situation with regard to profitability is highly analogous to that of the validity concept before Cronbach and Meehl introduced 'construct validity' (1955). At that time there was (and at present there is) nothing fundamentally wrong with the idea of an 'ultimate' or 'essential' criterion variable by means of which the validity of a predictor can be defined—except that such a criterion variable is hardly ever available, or producible. The criterion-oriented conception of validity has remained beutifully consistent, on paper; but Cronbach and Meehl have shown that adherence to it often leads our thinking to a dead end. The same appears now to be true of the profitability approach. Another way of thinking is badly needed.

1.6. The empirical profitability approach breaks down on the problem of how to determine and how to combine the pertinent value systems of various individuals and groups. In a practical testing or other decision situation value systems are likely to boil down to interests and rights. What type of model, then, is needed for determining and combining the interests and rights of all persons concerned in a certain situation?

Curiously, a very old discipline exists which has specialized over the ages in doing just that, namely the dicipline of law. Thus far, this discipline has not very often inspired psychologists³) in modeling their thought. For the development of psychometrics as a social science, however, a certain amount of borrowing of concepts and ways of thinking from the discipline of law appears to be unavoidable, and fruitful.

note 3. With the possible exception of Otto Selz (cf. De Groot, 1966, p. 54).

2. Acceptability, a fourth dimension

2.1. In this article, the notion of acceptability, or equitability, refers to decision procedures in which psychometric devices or tools are used. Acceptability can best be explained as a fourth 'dimension,' or point of view, in the evaluation of those procedures, in addition to reliability, validity, and profitability.

br> [363]

The main underlying idea in this presentation is that, in the near future, evaluative handbooks of tests (like those of Buro) ought to devote some attention to the acceptability point of view as well as to the other three. Given the present trend of increasing emphasis on the rights of subjects, applicants, students, and employees, the question of whether the use of a psychometric device in a certain situation is or is not acceptable might well become one of the main determinants of its actual usefulness. Psychologists can no longer back out of the problem area of acceptability. Regardless of their purely scientific uses, psychometric procedures of various kinds have always been developed for and sold on their usefulness as tools for judging individual persons, and for corresponding decisions of social consequence to those individuals. If this usefulness is systematically restricted, or destroyed, by social factors that can be studies, the influence of those factors must be studied, by the psychologits and psychometrists, themselves. ⁴)

The reader is invited to relate the following statements to psychometric devices in the broadest sense: tests, questionnaires, examinations, judgmental variables, sum scores of other empirical data, etc.

2.2. The crucial reliability question with regard to an instrument or empirical variable is: To what extent can it be shown to measure at all?

Empirical answers to this question can be given if, minimally, some small sample of scores on the variable, itself, is available. In the case of stability assessment, scores of two adminsitrations are needed, with a time interval in between. In the case of equivalence, one administration may suffice, provided the variable is a sum score with built-in (quasi-) replications.

Reliability outcomes do not in any way determine the meaning of the variable (2.3), let alone its usefulness in specific situations (2.4 and 2.5) — except that zero reliability generally precludes both meaning and usefulness.

note 4. In spite of the amount of serious attention specific ethical problems have drawn recently—intelligence testing (e.g., Brim, 1965), personality measurement (e.g., Messick, 1965), invasion of privacy (e.g., Ruebhausen and Brim, 1966), ethics in the research setting (e.g., Schultz, 1969), etc.—psychologists have often tended to treat the general issue of acceptability as a political borderline problem outside the realm of a scientist's concern. Apart from some important exceptions, notably Lovell (1967), the prevailing reactions thus far to the 'protests against testing' (Whyte, 1956; Gross, 1962; Hoffman, 1962, a.o.) have been reactions of defense rather than rethinking of the basic principles of psychometric practice.

2.3. The crucial validity question with regard to an instrument or empirical variable is — in a somewhat unorthodox formulation: To what extent can it be shown to have a specific meaning? Empirical answers to this question can be given if, minimally, some reference data along with the scores on the variable itself are available. In the case of criterion validity⁵), a sample of corresponding scores on a criterion variable — which supposedly has meaning — are needed to determine the (predictive) meaning of the variable in question. In the case of construct validity ⁵), an empirically supported, explicit theoretical argument is needed to justify the (measurement) meaning read into the variable in question.

Validity outcomes do not determine the usefulness of the variable in specific situations (2.4 and 2.5) — except that zero validity, i.e., no support for the supposed specific meaning, generally precludes corresponding specific usefulness.

2.4. The crucial profitability question with regard to the use of an instrument or empirical variable to a specific purpose is: How far can its optimalized application be shown to be profitable, in particular to the sponsor who pays for the program?

Empirical answers to this question can only be given if, apart from the specification of the purpose and the situational constraints, the available data can be interpreted in economic terms, i.e., in terms of, empirically estimated, costs and possible gains. In the case of institutional decisions, the value system of the (institutional) sponsor determines for all testees the way in which the benefits from decisions are evaluated. in the case of individual decisions, where the testee is the (individual) sponsor, his personal value system is decisive (cf. Cronbach and Gleser, 1965, pp. 15-17).

Profitability outcomes do not in any way determine the acceptability of the application in question — except that zero profitability to the sponsor, generally, precludes acceptability.

2.5 The crucial acceptability question with regard to the use of an instrument or empirical variable to a specific purpose is: How far will its use, given the details of the procedure by which testees are assigned to treatments—underlying assumptions and empirical data incuded—be considered equitable to all persons concerned, and to the testees in particular?

note 5. Cf. De Groot (1969), pp. 248-261, in particular the footnote on p. 261.. [365]

Empirical answers to this question must be based on data beyond profitability. The only form in which the answers can be fruitfully systematized is that of a testing—or examination—contract.

2.6. The present introduction of acceptability as a 'fourth dimension' is not meant to imply that it should be attended to after the other three have been satisfactorily taken care of. Quite the reverse. One of the future benefits from the study of acceptability should be precisely that it may prevent undue attention to socially still-born, psychometric-statistical refinements in the realm of reliability, validity, and profitablity analysis.

3. Aspects of acceptability

3.1. Testing or examination contracts require adequate constructs by means of which demands and compromises can be expressed adequately. In this section, three possibly important constructs will be discussed briefly: objectivity, transparency, and justifiability of differential decisions.

3.2. Apart from its function as a scientific principle in its own right—connected with the requirement of repeatability—objectivity in psychological measurement ⁶) has often been advocated because decisions based on objective procedures are free from the subjective bias of judgmental decisions. According to this argument, objectivity of procedures is a prerequisite to equitable decision making.

From the point of view of acceptability analysis, however, the objectivity condition is neither sufficient nor generally necessary. Its insufficiency in many decision situations will become clear in the discussion of transparency and justifiability (3.4 thru 3.6). The question of whether or not objectivity is necessary—or possibly, the degree to which it is required—clearly depends on the degree to which ensuing decisions are of social consequences to the testees. Thi, in turn, depends on the type of differential treatment envisaged and on the nature of the rights and interests at stake.

note 6. A testing and decision procedure is 'completely objective,' by definition, if the processing of all the pertinent behavioral data from a subject up to the final decision is carried out or could be taken over by a machine program (cf. De Groot, 1969, p. 171).

[366]

3.3. For example, in educational situations where study 'rightts' are at stake—e.g., in state universities or colleges—definitive rejection of a student is a highly consequential treatment socially. Generally, such a decision will be acceptable only on the basis of the student's failure to meet pre-set objective standards—such as below-critical achievement test scores. On the other hand, grading a successful student's total achievements in terms of a final qualification is of much less social consequence. In such a situation a judgmental element is not objectionable—and unavoidable, for that matter. The rejection of the thesis of a student who has been accepted, and remains, in a doctoral program is another interesting case in point. Again, a judgmental element is unavoidable—but it does no harm from the acceptability point of view if the student can rely on the cooperation of the judges, themselves, in his endeavor to produce a better nect draft. The social consequence consists of some loss of time, it is true, but this loss is generally compensated for by the learning effects of the prolonged training. ⁷)

From the standpoint of acceptability, then, the objectivity requirement cannot and must not be treated as an absolute principle. Contract requirements regarding objectivity will vary according to the treatments that are envisaged and to the social consequences thereof in the given social situation.

note 7. It should be noted that the consequences of educational decisions, and those of rejection and acceptance in particular, depend strongly on the educational system. In the Dutch system of uniformity—state-controlled exams, especially in secondary education, leading to diplomas, some of which provide study rights at any university in a whole set of faculties—rejection decisions are likely to be highly consequential. Within this system, objectification of standards is an important acceptability requirement. In the Anglo-Saxon system of diversity, on the other hand, non-admission to one (prestige) college or Ph. D. program in no way precludes admission to the next. Within this system, objectification of admission standards is a much ess critical acceptability issue.

3.4 'Complete transparency' of a testing and decision situation obtains, by definition, if the subject has available to him all the information he needs for developing his personal, best possible test preparatory and test taking strategies.

As was the case with objectivity, the transparency condition is neither generally sufficient (see 3.6) nor generally necessary. Again, contract requirements depend on the social situation. Consider for example

[367]

an applicant for a job in some small private firm. In such a case it might be argued that he has no right whatever to be appointed, regardless how qualified he might be. The firm has the right to determine its own selection policy and to keep it secret—even if it wanted to select (objectively) young people with red hair, green eyes and low extroversion [sic] scores. To a certain extent, the same argument applies to private institutions of education: since applicants have no official rights they cannot claim transparency of the admission policy.

We should realize, however, that in modern democracies such cases tend to become exceptional rather than the rule. Business and industry tend increasingly to fuse into big corporations which control large sections of the labor market. It is 'of social consequence' to the modern applicant if he is rejected by 'the corporation'; and, if his rejection is based on a testing program and a selection policy he cannot fathom, he is likely to feel hurt in his individual rights. Public pressure towards explicit and transparent standards of personnel selection cannot but increase.

In student selection, even for private institutions, the trend towards openness may become even stronger; understandably so, since public opinion has accepted the right for every youth to have 'the best available' education—whatever that expression may mean exactly. A fortiori, the acceptability point of view will demand transparency of those examinations—as in the Dutch university admission system—which decide on explicit rights to enter an academic career.

3.5. Rather than going any further into situational particulars it seems useful to spell out in some detail the implications of complete transparency:

(1) Non-objective procedures—requiring a judge, with an unpredictable personal program—are non-acceptable.

(2) Questionnaires—hinging on secret keys—are non-acceptable.

(3) If the program consists of aptitude and achievement tests, information about content areas, item types, the principles of scoring and decision making, etc. must be available in advance to prospective testees—preferably in printed form, as in a good manual with many item examples.

(4) A specific requirement is the transparency of the test scoring keys. First, the key must be explained in detail in the test instructions; second, the information about the key, as given in the manual and the instruction, must be simple enough for the testee to develop his own

[368]

optimal test taking strategy. (This condition virtually excludes most of the more refined methods of error correction and of complex item or subscore weighting.)

(5) Total score formulas, too, must be tranparent, i.e., known in advance and simple enough to be translated into adequate personal strategies. First, the testee must be in a position during the testing session to distribute his time and effort accordingly. Second, advance knowledge about the weighting of various subtest scores is essential for developing an adequate test preparatory, or study, strategy; e.g., how best to compensate personal weak points by strong points. (This condition, again, restricts differential weighting to a minimum. Maximal transparency appears to be guaranteed, when raw item scores and raw subtest scores are summed without weighting, i.e., when content areas are actually weighted by the (pre-published) numbers of pertinent items—supposedly of equal average difficulty. For the transparency of more complex procedures, see (7) and 4.2)

(6) If a cutting score is used, the manner in which it has been or will be obtained must be made known in advance. The main transparency requirement is that prospective testees must have a clear idea of the level of achievement required for passing the test since this is directly relevant to their test preparatory (or dtudy) strategy. (The cutting score, itself, need not always be announced; e.g., a straightforward selection situation with a quota and a known selection ratio may be sufficiently transparent.)

(7) Complex scoring and selection procedures—cf. (5)—can be made transparent by previous coaching (or previous experience); e.g., certainty scoring (Van Naerssen, 1966; Sandbergen, 1968). The transparency requirement then boils down to the demand that such coaching be carried out and be accepted as sufficiently effective.

3.6. The requirement of justifiability of differential decisions is fully met, if, for any two testees who were assigned to different treatments, the argument to support such differential decisions is agreed upon as acceptable. For instance, if a cutting score for treatment T, X_T, is used, in a distribution of (integer) total scores, X, the differential decision that subject A (score X_T) is assigned to T whereas subject B (score X_T-1) is not, must be justifiable. If T means 'pass' and non-T 'fail,' the availability of a supporting argument and its endorsement by all persons concerned is particularly critical.

[369]

This is a complex requirement. First the contents of the test series must be considered representative, equitable in relation to its aim, valid in its composition. Testees, in particular, must accept the content or predictive validity of the series—a condition which amounts to: face validity taken seriously and documented (somewhat like construct validity is faith validity taken seriously and documented). Second, the actual weights applied in scoring must be accepted, i.e., considered equitable as well as transparent (3.5). Third, it must be accepted that a minimal difference—one point—may be decisive.

The latter condition, in particular, is a highly important stipulation in any test or examination contract. It requires some further analysis.

3.7. The point is that psychometric arguments are never sufficient for a justification of differential decisions near the border line. Regardless of how reliable and valid the total score may be, the probability that, for instance, the true score, or true utility, of subject A (score X_T) is actually lower than that of B (score X_T-1) can never be negligible. Subject A may have had 'good luck,' or subject B 'bad luck'; the psychometric analysis cannot reliably exclude such possibilities. It must be concluded that the subjects (A and B, themselves) will have to accept explicitly that such things may happen and that they take that risk.

The question remains: Under which conditions will testees be willing to accept their risk of good and bad luck? This question leads to a number of puzzling problems, some of which will be briefly taken up in the following paragraphs.

3.8. First. there is a large and fundamental difference between predictive and measurement-based arguments, and correspondingly, between aptitude and achievement testing. If differential decisions near a borderline are to be supported by the predictive validity of the battery, such support—by a validity coefficient of, say, .50—is notoriously weak. Testees can hardly be expected to accept this type of justification if they have any rights at stake at all. In all educational situations, in particular where pass/fail desicions are to be made—e.g., in admission testing for state schools—the predictive justification, with its unavoidable proportion of false negatives, is unlikely to convince testees. Given the present trenc (cf. 2.1), intelligence and aptitude tests can be expected to become non=acceptable in an increasing number of selection situations, including industrial selection.

[370]

Achievement tests, by contrast, have another leg to stand on—apart from their predictive power. In particular if a required level is defined (possibly in the vein of 'testing for mastery,' Bloom, 1968), supporting arguments of the following types are possible, and likely to be acceptable: 'This test, itself, is reprsentative of what you ought to know before entering,' or '... of what you ought to have learned during the year,' or '... of what you ought to have learned during the year,' or '... of what you must do well here.' In addition, achievement tests are generally felt to be more equitable than aptitude tests since the prospective tstee can straightforwardly prepare for them. ⁸) Alternatively, their transparency could be said to be superior (cf. 3.4 and 3.5). From here on, only achievement test batteries will be discussed.

note 8 This applies to older rather than younger students, it is true. With young school children whose educational attainments may depend more on the quality of their schools and teachers than on their own capacities and study efforts, achievement tests are certain to discriminate against the educationally handicapped. That is, achievement tests remain equitable from a measurement point of view but they are biased in their predictive function (Drenth, 1967, p. 23): the 'two legs' are at variance with each other. This is the typical case for 'special programs' to be instituted by educational agencies. Such programs in turn raise difficult special questions of acceptability—including the possibility of 'reverse discrimination'—that cannot be discussed in the present context.

3.9. A second problem, in addition to that of face content validity (cf. 3.6), is that of face reliability. Since no reliability coefficient can ever be high enough to justify differential decisions near a borderline the question remains: How much unreliability is the testee willing to accept? Or: How large a risk of bad luck is he willing to take? Or, possibly: How much extra effort is a student prepared to devote to a subject to make sure that his true score will be beyond the danger zone of unreliability? Face reliability, as represented in those questions , is likely to be a function of statistical reliability only to some extent. In an achievement test, face reliability would seem to be ensured if a reasoable number of items of good quality and adequate difficulty, requiring a reasonable amount of testing time, are reasonably well distributed over the pertinent study area. These conditions may as well be fulfilled with an r_xx=.60 as with one of .90. ⁹)

note 9 Generally, the reliability issue in achievement testing is rather obscure. Reliability depends on homegeneity, i.e. on item intercorrelations; but item intercorrelations depend on: (1) similarity of content (by no means an educational nor an aceeptability

[371]

In any case, face reliability is something different from statitical reliability, and it must be attended to in its own right.

3.10. Third, if the test or the battery's face (content) validity and face reliability are acceptable, i.e., if testees agree to accept the remaining risk of bad luck, still another problem remains. If subject A acores X, and subject B scores X - 1, can one be sure that A's achievement is really superior to that of B, regardless of the respective item compositions of the two scores?

Again, no psychometric proff is possible; testees will have to accept this as part of the game. The question is of particular importance for speeded tests where skipping apparently difficult items may be, but is never certain to be,a good strategy. In some situations, tests with a speed element may be non-acceptable, whether for this reason or for other reasons. But the problem remains, for power tests as well: Does a score X represent a better achievement than X - 1?

Clearly, the fulfilment of this condition depends on the accepted near-equality in (face) 'importance' of the items (or, score points) in the set from which all possible total scores are made up. Suppose that this set consists of N dichotomous items (o, 1) with 'accepted item importance values' a_i (i=1, 2.....N), measured in a ratio scale, with a_i > 0 for every i. Let a positive factor ƒ be determined so that:

(1) Mdn (ƒa_i) = 1

and let δ_i be defined by:

(2) ƒa_i= 1 + δ_i.

Then any obtained item sum score X (X=0,1,...,N) can be shown to correspond to a larger a_i-sum than any obtained score X-1, if and only if:

(3) Σ_{_{_i=1}} |δ_i ^{^{^N}} | < 1.

requirement), (2) relative amount of attention given to various subjects in teaching, in interaction with: (3) relative amount of study effort devoted by students to those same subjects. Consequently, statistical reliability depends on content and educational-situational factors which lead a life of their own. This is to say that reliability data, according to the psychometric stereotype that a test 'measures' a 'trait,' are often overrated as sources of information.

[372]

Obviously a degree of near-equality of the a_iis required which, for all practical purposes, amounts to equal 'importance.' It appears that accepted equal item weighting presuppoes near-equal 'accepted importance' for all items. The conclusion is hardly surprising. The point of this little argument is, however, to illustrate the kind of reasoning which 'acceptability analysis' is likely to produce.

4. Procedures in acceptability analysis

4.1. Acceptability is of course a basically empirical notion. Opinion surveys among testees — students, applicants — and/or over samples of the general public, will be needed to implement the idea. Evidence about the requirements of objectivity, transparency, face — content or construct — validity, face reliability, equitable weighting, etc. must be collected.

The goal of such 'acceptability analysis,' however, would not be to map a new field, in the measurement tradition (cf. 1.2) — let alone to design and test theories, to compute statistics, or to extract factors — but to systematically gather categorical information relevant to the framing of tester-testee contracts. As in the case of Cronbach and Gleser's profitability analysis, a 'taxonomy of decision problems' will be needed (op. cit., p. 15) — from the acceptability point of view, this time — for each cell of which a specific standard contract is to be specified. Clearly, this goal requires primarily a logical-systematic effort — with empirical 'evidence' filled in, as much in the juridical as the scientific tradition.

4.2. As the reader will have noticed, statistical ways of reasoning are of little, if any, value in acceptaility analysis. Tranparency requirements may prevent many a statistical optimization procedure from being used (cf. 3.5); the whole predictive argument will often be non-acceptable (cf. 3.8); face validity or face reliability may be at variance with statistical validity and reliability (cf. 3.6 thru 3.9). Acceptability analysis requires a different, non-statistical sort of inference (cf. 3.10).

This is not to say that all statistical evidence is irrelevant in the acceptability context, let alone that psychometric test analysis would become superfluous. Along with the subject matter psecialists, expert psychometrists still have an important part to play, in the construction and development of good instruments as well as in pertinent research projects (e.g. validity research). However, these experrts must be prepared to a situation where their arguments are no longer decisive. Some of their favorite inferential methods do not apply at all, while others lead to arguments of relative value only, which may be overruled by acceptability considerations of a non-statistical nature.

4.3 The problem of judgmental weighting may serve as a typical example. Consider a complex sum score which does not consiste of equally weighted (one point) items but of a number of (interval scale) subscores for which equal raw score intervals cannot be considered judgmentally equal. How are the subscores to be weighted?

If the total score is supposed to measure some complex attribute, statistical solutions would call for the calculation of variances and covariances. Since the concept of 'equal distribution to the total score' [nb: probably 'contribution' meant here, bw] can be variously interpreted (see, e.g. Hazewinkel, 1964) the determination of weights would require, in addition, the choice of a model. Now, clearly, from the point of view of justifiability, the multiplicity of possible models makes them all unconvincing; and, from the point of view of transparency, they are all too complex to 'see throughs' [sic] their consequences. It may be that weighting on the basis of variances only is an exception, i.e., is acceptable in some situations¹⁰), but other models are certainly out.

In any case, the typical procedure for judgmental equating in acceptability analysis would have to be a different one. It would disregard both variances and covariances and run somewhat as follows. The decision to equate the intervals d_x, on the accepted interval scale of subscore X, and d_y, on the accepted interval scale of subscore Y, i.e., to weight them equally, can be taken if and only if respondents consider them 'equally important.' This cannot but mean that they are interchangeable. The intervals d_x and d_y are interchangeable if respondents

[374]

judge the following two cases, A and B (subjects, jobs, or other) 'equally good':

A: X₀, Y [sic] - d_y B: X₀ - d_x, Y₀

on the assumption that A and B have identical (sub)scores on all other variables.

This method, which can be called ceteris paribus analysis appears to be the adequate procedure of judgmental equating. It is non-, of not anti-statistical. It fits better into the tradition of inference of law, and of negotiating, than into those of natural-scientific psychometrics.

4.4. In this article I shall not enter upon a discussion of philosophies which differ from the one implicitly presented here. My own position is very near that of Lovell who, in his 'dissenting view' in the American Psychologist (1967) specifically rejects the use of 'tests of character'— as opposed to 'tests of capacity' — in any situation where a 'personnel contract' obtains, whether 'strong' or 'weak,' The use of personality tests is acceptable only under a 'client contract,' i.e., in a situation providing services to the respondent on a fully voluntary basis. For the details of Lovell's three main types of contract, for his supporting argumentation and for his outline of 'a new ethical standard', the reader must be referred to the orignal article.

Finally, it will be clear that the subject matter of this article is related to a number of other issues which have been in high on the agnenda recently, in particular in the United States; such as invasion of privacy, generally (e.g., Ruebhausen and Brim, 1966), and deceptive experiments on laboratory subjects, psychology students in particular (e.g., Schultz, 1969). Again, specific situations may require the framing of specific contracts. I shall not go into those specifics, however, so as not to impair the readability of this essay by too much detail.

Summary

In the analysis of procedures for assigning testees to treatments, acceptability is introduced as a 'fourth dimension' next to reliability, validity and profitablity. Even more than profitability (to the sponsor), aceeptability (to all persons concerned, the testees in particular) depends on the details of the social decison problem as well as on the properties of the instruments per se. In particular, acceptability depends on the degree to which testee rights are at stake (e.g., study rights). Apart from objectivity, 'transparency' of the testing and decision situation and 'justifiability of differential decisions' are crucial points of view.

Acceptability analysis requires a non-statistical was of reasoning to be borrowed from the discipline of law rather than from tarditional scientific methodology. Accordingly, the typical end result of acceptability analysis should be an empirically based testing or examination contract.

note 10 (p. 373). Though certainly not always. An applied area where this type of problems is highly relevant is that of job evaluation, whenever a point system is used. The universe of jobs is not a nature-given set: different variances of job aspects or factors, and certainly different covariances, may have a real meaning which may better be retained in its raw score form than statistically egalized. Generally, job evaluation provides a good example area since the problem is obviously one of justifiability (of wage differences) rather than measurement. The folowing exposition is of immediate relevance to job evaluation practices.

Bibliography

Bloom, Benjamin S., Learning for mastery. UCLA/CSEIP Evaluation, 1, 2, 1968.

Brim, O. G., American attitudes toward intelligence tests. Amer. Psychologists, 20, 125-130, 1965.

Buros, O. K. ed., The sixth mental measurements yearbook. Highland Park, N. J., Gryphon Press, 1965.

Cronbach, L. J., and Goldine C. Gleser, Psychological tests and personnel decisions. Urbana, Ill., Univ. of Illinois Press, (1957) 1965.

Cronbach, L. J., and P. E. Meehl, Construct validity in psychological tests. Psychol. Bull. 52, 281-301, 1955. In: H. Feigl and M. Scriven, eds., Minnesota studies in the philosophy of science I. Minneapolis, Minn., Univ. of Minnesota Press, 1956, pp. 174-204.

De Groot, A. D., Statistische versus sociale rechtvaardiging bij weging van beoordelingscomponenten (statistical versus social justification of weights for components of a sum score). In: Verslag bijeenkomst Vereniging voor Statistiek, 19 november 1964 (summary), sociaal-wetenschappelijke sectie.

De Groot, A. D., Thought and choice in chess. Den Haag, Mouton, 1965; New York, Basic Books, 1966.

De Groot, A. D., Psychometrie en democratie—Some (rater chaotic) remarks on test theory. Memorandum AET-148, 1966.

De Groot, A. D., Methodology. Foundations of inference and research in the behavioral sciences. Den Haag, Mouton, 1969.

De Groot, A. D., Aanvaardbaar instrumentgebruik bij selectie en advies. De Psycholoog, 5, 1-4, 1970.

Drenth, P. J. D., Protesten contra testen. Amsterdam, Swets en Zeitlinger, 1967 (oratie).

Gross, M. L., The brain watchers. New York, Signet Books, 1962.

Hazewinkel, A., Enkele beschouwingen betreffende de bijdrage van een component tot een samengestelde variabele. Statistica Neerlandica, 18, 325-339, 1964.

Hoffman, B, The tyranny of testing. London, Collin-McMillan, 1962. [Dover questia]

Holtzman, W. H., Some problems of defining ethical behavior. Amer. Psychologist, 20, 247-250, 1960.

Lovell, W. R., The human use of personality [tests: a dissenting view. bw]. Amer. Psychologist, 22, 383-393, 1967.

Messick, S., Personality measurement and the ethics of assessment. Amer. Psychologist, 20, 136-142, 1965.

Ruebhausen, O. M. and O. G. Brim, Privacy and behavioral research. Amer. Psychologist, 21, 423-437, 1966.

Sandbergen, S., Teststrategie; een onderzoek naar veranderingen in testgedrag en coaching op geprecodeerde studietoetsen. N. T. Psychol. 23, 16-38, 1968.

Schultz, D. P., The human subject in psychological research. Psychol. Bull. 72, 214-228, 1969.

Van Naerssen, R. F., Itemscoring 'zeker' of 'niet zeker'. In: T. H. Eindhoven, Nationaal Congres Onderzoek van het Wetenschappelijk Onderwijs, Deel I. Eindhoven, 1966.

Whyte, W. H., The organization man. London: Penguin Books, 1956.

For extensive annotations see my page groot70.htm, Ben Wilbrink.

March 6, 2008 \contact ben apenstaartje benwilbrink.nl

http://www.benwilbrink.nl/publicaties/70degroot.htm short: http://goo.gl/7ZgrN