Miniature Job Training and Evaluation as a

HUMAN F A C T O R S , 1978, 20(2), 189-200 Miniature Job Training and Evaluation as a Selection/Classification Device ARTHUR I. SIEGEL’, Applied Psy...
Author: Rosalyn Grant
9 downloads 4 Views 732KB Size
HUMAN

F A C T O R S , 1978, 20(2), 189-200

Miniature Job Training and Evaluation as a Selection/Classification Device ARTHUR I. SIEGEL’, Applied Psychological Services, Zizc. Wayize, Peiiiisylvaiiia

A iiovel approach to selectioittclassificatiotzis described. The approach rvliicli cotiibiiies the leariziitg, perforriiaiice test, aiid assessiizeiit ceiiter coiitexts has beeii verified iiz a variety of sittiatioits. There is reasoii to believe that the approach: ( I ) yields selectioiilclassificatioii devices wliicli possess reliability, validity, aiid appropriate poumr, atid (2) meets fairness criteria.

The decision of the U. S. Supreme Court (Griggs vs. Duke Po\ver) and subsequent interpretations by the Equal Employment Opportunity Commission (EEOC) have rendered illegal preemployment tests which are not clearly and directly job related. \Vhile all industrial psychologists might not agree with the EEOC stand, most intelligence tests and conceptual (e.g., arithmetic reasoning, spacial relation, analogies, vocabulary) tests are now, accordingly, illegal unless there is available direct empirical evidence that the test employed is valid and is not differentially valid in the specific industry and application involved. To cope with the problems which arise from such rulings. a number of approaches have been taken. One approach is to develop the data required for showing that a preemployment test is valid and not differentially valid. Such an approach is costly and time consuming because such studies must be completed for each and every job and situation in which the test is used.



Rcquests for reprints should be sent to Dr. Arthur I. Siegel, Applied Psychological Services. Inc.. Wayne. Pennsylvania 19087, U S A .

A second approach to meeting the legal requirements of a test program is to build tests which are content relevant and fair to all on the face of the tests themselves. While this approach does not avoid validation problems, it minimizes them to a degree. It possesses the additional advantage of avoiding the criticism, often made by test takers, that they can learn and do a job although they cannot perform well on the usual pencil-and-paper tests. The present a p p r o a c h , developed a t Applied Psychological Services, is based on the conjecture t h a t a person who c a n demonstrate the ability to learn and perform a job sample will be able to learn and perform on the total job-given appropriate on-thejob training. This approach suggests both a training and a measurement aspect to the testing situation. Specifically, the job seeker is trained to perform a sample of tasks involved in the job and, immediately following the training. his ability to perform these tasks is measured. Within the training aspect, full attention is given to individual differences, “hands on” training, minimization of literacy requirements, and the like. The test, similarly, is of the performance nature. The ap-

189 -k3

1978. The

Human Factors Society. Inct

190-April, 1978

proach is termed the miniature (job) training and evaluation approach in subsequent parts of this paper. For example, in one training situation, the esaminee is trained, through demonstration and practice, to turn on and shut do\sn a motor. The test situation involves a procedural examination of the performance nature. The esaminee is scored on how \veil he performs what he is taught. This approach was mirrored by O’Leary (1973) who wrote: From the validity standpoint, the way out of the dilemma lies in the content validity of the test. or the similarity between test and job. The more nearly the test duplicates the specific tasks to be performed on the job, the greater the chances of developing devices that 3re fair. Where possible, then. job simulation tests (e.g.. job sampling) should be part of the selection procedure. Obviously, this approach has two disadvantages: (3)it is initially more costly than less ideal approaches since 3 relatively large amount of money must be spent to develop and implement it: and (b) many jobs are so structured that the job sample method is difficult to apply (e.g.. testing a 22-year-old college graduate for a sales position in a stock brokerage firm). (p. 148)

O’Leary further stated that since the task simulation method relates the test behaviors directly to the job for which an applicant is being considered, it provides specific information about the applicant’s suitability for the job. It results in an increase in validity \shile i t minimizes the acceptance of applicants who would be risky hires and the rejection of applicants who could perform well on the job. Hoisever, the pure job sample approach in general and the O’Leary article in particular have been sharply criticized (Gael, 1974). O’Leary’s arguments were said to be based on: (1) irrelevant and faulty data (Gael, 19741, (2) improper statistical arguments and lack of data pertinent to predictive validity (Blood, 1974). and (3) poor evaluation vis-a-vis legal and social judgmental issues and side effects.

Pirrpose o f Preseiit Paper The purposes of this paper are to: (1) describe and summarize recently completed re-

HUMAN

FACTORS

search rclative to the miniature training and evaluative approach to selectionlclassification, and (2) assess the merit of the approach relative to psychometric properties and to EEOC issues. MINIATURE TRAINING AND EVALUATION SITUATION Miniature training and evaluation situations have been reported by Siege1 a n d Bergman (1972), Siegcl and \Viesen (1977), and by Applied Psychological Services (1975). The first two of these developments involved a wide variety of military trade oriented jobs while the last involved a variety of industrial refinery jobs. In the Siegcl and Wiesen study, the miniature training and evaluation approach was combined with the assessment center approach. This assessment center extension is discussed more fully in a subsequent section. A typical miniature training and evaluation is exemplified by the assembly test developed as a part of a Navy machinist mate battery (Siegcl and Bergman, 1975). In the assembly training and evaluhtion situation, the persons being evaluated are ’taught and tested on assembly of a gate valve from its component parts. First, a demonstration of the correct assembly procedure is presented. This demonstration is followed by a practice session in which the students arc allowed to assemble the valve themselves. The instructors observe the students and assist them, as required. After the practice session, each person under evaluation is individually tested on his ability to assemble the valve. Scoring is through the checklist procedure. Quite obviously, no one test can be employed to assess an individual’s potential for any job. A total battery, derived from job analytic procedures, is employed. All situations in the battery contain the same logic-a “shots and tell” learning situation and a nonwritten performance examination. For

April, 1978- 191

ARTHUR I. SIEGEL

esample, the Navy machinist mate battery, referred to above, contains six training and evaluation situations. An earlier related approach was used by Lawshe and Tiffin in the development of the Purdue Mechanical Adappbility Test. This test, described by hlcCormick 6r Tiffin (1975), was developed on the basis that: ... there was reason to believe from a previous study that. other tlziizgs beiizg eqztal [emphasis added]. those persons who have most profited in knowledge from previous mechanical experiences may do better on mechanical jobs than those persons who have not so profited (pp. 143-144).

In concordance with this, Siegel a n d Bergman (1975) suggested that the miniature task training part of the miniature training and e\.aluati\.e approach gives all the individuals tested a fair chance to do wcll on the in a \\,ay that a performance test \,,ithout learning phase does not. The effects of differences in exposure in and specifically t o the job sample considered a r e thought to be controlled due to the equalizing effect of the training phase. Miniature training and evaluative situations, so developed, minimize emphasis on the ability to read and write. Ash and Kroeker (1975) cautioned against the use of tests which have a higher reading level than that required by the job. Such tests may be inherently biased against certain classes of people who, for reasons associated with their culture or socioeconomic class, have had less formal education, or less successful formal education, than other classes. For a motivated person, formal education, which yields a facility with written English, may have little to do with job success in many jobs. Types o f Tests DeiTeloped To date, the miniature training and evaluation approach has been applied to a fairly broad sample of situations. Table 1 lists these situations along with the sources in \vhich full descriptions of each situation may be found.

RELIABILITY AND VALIDITY The interesaminer reliability for the miniature training and evaluation approach was investigated by Siegel and Bergman (1 972; 1975) and by Sicgel and EViesen (1977). Empirical validity results have been reported by Siegel, Bergman, and Lambert (1973) and by Siegel and Leahy (1 974). Sicgel and EViesen (1977) investigated \’alidity from the “Policy capturing” point-of-vie\y.

Iltterexnfl2ilter Relinbi[ity In the Sicgel and Bergman (i972; 1975) studies. a Navy student sample consisting of an portion Of persons and of white persons was involved. A white test administrator and a black test administrator, who \\‘ere both trained in administering and Of scoring the tests, lveR compared i n interesaminer reliability. For three situations, persons who had passed through the training aspects of the training and evaluative situation were scored independently by the black esaminer and by his white counterpart. The separate scores, so determined, were compared. The results of this interesaminer reliability analysis are shown in Table 2. Examination of Table 2 shows that the interesaminer reliability coefficients were acceptably high. In addition, the means and standard deviations across esaminers were almost identical. The Sicgel a n d \Viesen (1977) intcresarniner reliability work employed five situations. Four trained csaminers were involvcd. Each separately scored the work of from 5 to 19 students. The intraclass correlation coefficient was employed to indes the amount of agreement, by test. bet\veen the scores reported by the separate esaminers. The results are presented in Table 3 and again suggest acceptable interesaminer reliability. Summarization of the interesaminer rcliability data presented in Tables 2 and 3 indicates: (1) a range from 0.72 to 0.99, (2) a me-

HUMAN

192-April, 1978

FACTORS

TABLE 1 hliniature Training and Evaluation Situations and Source of Full Description Name of Situation

Source

Conceptual Integration/Application Coordinative Speed and Accuracy Dual Task Performance Inspect i onlSort Level of Aspiration Social Interaction Tool and Object Naming, Use, and Recognition Gasket Cutting and Meter Reading Trouble Shooting Equipment Operation

Siegel and Wiesen (1977)

Siegel and Bergman (1972)

Assembly

Applied Psychological Services (1975)

Process Record Keeping Process Task Performance Measuring and Scaling Tool Application Monitoring Operations Diagnosis of Machine Defects Information Processing Relating Diagrams with Objects

dian of 0.93, (3) a n intcrquartile range of 0.92 to 0.96. These reliability coefficients appear q u i t e a c c e p t a b l e a n d suggest t h a t , w i t h proper examiner training a n d associated procedures, adequate interexaminer reliability c a n be attained in the miniature training a n d evaluative context.

TABLE 2 hlean, Standard Deviation (S.D.), and Interrater Reliability Coefficient Reported by Siege1 and Bergman for a Black Test Administrator and a \Vhite Test Administrator Scoring Three Evaluative Exercises Exercise 7

2

E I I ipirica I Validity Siegel, Bergman, and Lambert (1973) investigated the ability of the miniature train-! ing and evaluation proccdurcs to predict a performance critcrion after the Siegel and Berg.man subjects \wre on their Navy jobs for a period of nine months. The s a m e subjects \\'ere follo\ved by Siegel a n d Leahy (1974) after t h e subjects were on the job for 18

TABLE 3 lnterrater Reliability (Intraclass Correlation) bctween Four Administrators as Reported by Siegel and \Viesen for Five Evaluative Exercises

3 Exercise

Number of Subjects

Reliability Coefficient

19 6

0.95 0.99 0.72 0.93 0.92

White Black White Black White Black 1

n

32 39 39 Mean 14.41 14.72 56.03 55.69 22.05 22.26 S.D. 2.23 3.10 7.49 7.46 3.21 3.33 r 0.75 0.97 0.96

2 3 4 5

5 15 15

April, 1978- 193

ARTHUR I. SIEGEL

months. In both followups. the criterion was the same set of job performance tests. For the first followup, 54 of the original sample of 99 students were available. For the second followup, 45 of the original 99 were available. Most of those not available for followup were assigned to ships outside of the continental United States. Since the original assignment to ships was random, there is no reason to believe that those available for followup testing tvere different from the remainder. Other reasons for nonfollowup \rere: on leave at time of follotvup, not locatable, and no longer in Navy. This method of validation against a performance criterion represents a particularly stringent test when a military situation is involved. A host of intervening situations can serve to mitigate a journeyman's development. For example, a new journeyman may not be assigned to tasks which enhance his de\vlopment of job rclated skills. The new man may be assigned to "housekeeping" duties. Or, for one reason o r another, the supervisor may not provide adequate on-thejob training. The performance criterion in both followups involved a battery of seven individually administered job performance tests. The basis for the criterion instrument set was supervisory opinion relative to an adequately diversified set of tasks which \vould represent the range of tasks performed by journeymen on the job. The performance criteria included standing messenger watch (ShliV), breaking and making a flange (BhIF). packing a valve (PV),demonstrating procedures in common malfunction and in emergency situations (SEQ), knowledge of use and names of common equipment and tools (TKU), manifesting general alertness and common sense in the \vork situation ( i V i V ) , and adequacy of technical job knowledge (OJK). Sicgel and Bergman calculated the zero order correlation bctween each predictor and

a composite criterion score. The three predictors with the highest zero order correlations were then employed to determine the multiple correlation with each of the seven criterion tests for the first followup. The same predictors and criteria were employed, with minor exception, by Siege1 and Leahy in the second followup. The resulting multiple correlations are presented in Table 4. Table 4 indicates that the miniature training and evaluative situations produced statistically significant multiple correlations for 3 of 5 first follo\\wp criteria. None of the multiple correlation Coefficients \vas statistically significant in the second follo\vup. Comparison of the validity coefficients yielded by the miniature training and evaluative approach with those yielded by the Navy paper-and-pencil tests in use at the time indicated superiority for the miniature training and evaluative situations for the first followup but not for the second foltowup. iVhile attenuation of correlation coefficients is customarily expected on cross validation, the question of the durability of predictions based on trainingkvaluative situations, as opposed to predictions based on an ability/ ConceptuaLapproach, seems to be opened by these findings. It is quite possible that miniature job training and evaluatise situations TABLE 4 hlultiple Correlation of Miniature Training and

Evaluative Predictors with Performance Criteria Multiple Correlation Criterion

First Follo wup

Second Follo wup

BMF

0.30

PV TK

ww

0.22 0.42' 0.43'

0.15 0.30 0.29

SEQ

0.35

OJK

SMW *

statistically significant

0.46'

0.20 0.23 0.21

194-April, 1978

are adequate for predicting success on initial job entry but that continued success and development hinges on verbal and conceptual factors. We note here that tivo of the three miniature job training and evaluation predictive situations-gasket making and trouble shooting-evidencing the highest zero order correlations involved factors other than pure performance. The gasket cutting test was administered in a time stressed, attention sharing situation. In this situation, t h e esaminee was asked to fabricate a gasket while he simultaneously monitored a randomly fluctuating meter. Similarly, the trouble shooting test involved such higher mental factors as evaluation, data synthesis, and relational thinking. The possibility also remains open that thc performance criteria did not reflect the job requirements for a person with 18 months espericnce as well as the requirements for a person closer to job entry. Finally, it is noted that the program was not designed as a test developmental venture. Rather, it was designed to demonstrate a concept or approach. If a pure test development program had been involved, a different set of steps would have been completed (e.g., item analysis, test construction, initial sali- . dation and test reconstruction, cross validation).

Volidity-Policy Cnptririi zg As stated above, the Siegel and Wiesen (1977) work extended the miniature training and evaluation concept to the assessment center contest. Prior use of assessment centers is limited almost entirely to managerial level personnel. The extension of the assessment center approach to technical jobs represents an elaboration of the assessment center concept, as originally developed. Ho\vever, the extension of the assessment center approach to nonmanagcrial jobs was previously suggested as a possible area of investigation by Bray and hloses (1 972).

HUMAN

FACTORS

Several reviews of current assessment center practices are available (e.g., Bray and Grant, 1966; Howard, 1974; Huck, 1973; hlacKinnon, 1975). The assessment center treats each individual a s a whole person rather than as a sum of specific abilities and aptitudes. The typical assessment center runs for one to three days and involves two to five administrators (assessors) and six assessees. The goal is to predict future success. The prediction is made as the result of a joint decision by the assessors relative to each individual as a whole person (in terms of whether o r not o r how \\.ell the person will succeed) rather than a rating on a statistically weighted sum of various scores (Bray and Campbell, 1968). Siegel and LViescn’s (1977) work employed a set of miniature training and evaluative situations, and four assessors. Each student in the sample (N = 140) was evaluated by the assessment team in terms of how \&I1 each student tvould succeed in a selected occupation. This allowed comparison of the actual miniature training and evaluative situation scores with the combined overall opinion of the assessors. To this end, a stepwise multiple linear regression analysis procedure was used. This procedure allows evaluation of miniature training and evaluative scores relative to the overall decision of the assessors. In other contexts, the approach has been termed “policy capturing.” Madden (1964), Stephenson and iVard (1971), Bottenberg and Christal (1968), and Christal (1968a; 1968b; 1963) described employment of the policy capturing approach for A i r Force officer advancement evaluative purposes, and Sitgel and Federman (in press) employed the approach for deriving emphasis areas in Air Force Technical Training. The general approach has also been applied in a wide variety of other areas including, but not limited to, judgments of personality characteristics (Hammond, Hursch, a n d Todd,

April, 1978-195

ARTHUR I. SIEGEL

1964), attraction of common stocks (Slovic, 1969), mental illness diagnosis (Goldberg, 1970), and judgments of admissibility to graduate school (Da\ves, 1970; 1971). \Vhile there is some theoretic controversy relative to the use of the additive model for such work, Slovic and Lichtenstein (1970) after a comprehensive review of studies employing the linear approach, concluded: In all of these situations the linear model has done a fairly good job of predicting the judgments. as indicated by r values in the 30s and .9Os for the artificial t a s k and the .70s for the more complex real-world situations (p. 36).

The Siegel and \Viesen data allowed computation of multiple regression equations for men who were to be assigned to two career fields. The stepwise multiple correlation (six predictors) between the miniature training and evaluation situations and the composite estimate of the assessors was 0.81 for one career field (N = 38) and was 0.68 (N = 64) for the other career fields. Validit~-Discriiizilln111 A iialysis

The miniature training and evaluative \.alidity question has also been approached from the discriminant analytic point of view. Here the question of interest relative to validity is: What is the absolute predictive power of the miniature training and evaluative situation relative to an absolute criterion? Sicgel and Bergman (1973) dichotomized their fleet performance criterion scores on the basis of a delphi technique derived cut score to yield a pass-fail performance criterion to be predicted. Table 5 presents the number of persons predicted by the equations to fall into each criterion group (pass o r fail) and the number who actually fell into each criterion group. There was 74% and 62% correct classification, respectively for the 9 and the 18 month follo\vups. The hlahalanobis D square was statistically significant ( p '< 0.005)only for the 9 month folionup.

TABLE 5 Discriminant Function Prediction and Actual Performance for Nine and Eighteen hfonth Followups Actual

Discriminant Function Prediction

Pass

Pass

20

Fail

10

9 Months

Fail

.

18 Months

Pass

Fail

4

9

6

20

6

13

Siegel and Wiesen (1977) employed the discriminant analytic technique to investigate thc ability of the miniature training and evaluative situation to predict whether or not the assessment team would assign people to one o r the other of two jobs. They achieved a 79% hit rate and the Mahalanobis D square statistic was statistically significant ( p < 0.05).

Validity-Differential In the Siegel and Bergman sample, the miniature training and evaluative scores of the 29 black and 25 white criterion sample subjects were compared through t tests in order to determine whether or not statistically significant differences existed across the races. No such differences were found. To provide additional insight into the differential validity question, the relationships among the composite miniature training and evaluative scores and the composite field criterion for both blacks and whites were esamined. The miniature training a n d evaluative predictor-criterion correlation coefficients failed to achieve desirable predictive levels for whites in terms of the composite field criterion. However, the miniature training and evaluative situations predicted the performance of the blacks with statistical significance. Differential validity is said to obtain when the correlation coefficients for two groups differ significantly from

196-April, 1978

zero and from each other. The correlation coefficients were converted to z scores and tests of the statistical significance of the difference between the correlation coefficients for the black and the white groups \\‘ere compared. The differences were not statistically significant. Accordingly, the miniature training and evaluative situations can be held to fail to meet Bochm’s (1972) criteria for differential validity and cannot be held to be differentially valid by this criterion. OPINIONS OF TEST TAKERS While opinions of test takers add little psychometric insight, the advent of consumerism dictates that user opinion be conespcsidered in any novel procedure-and cially the testing field. In all studies completed to date, postrest interviews have been completed to determine the reactions of the test takers to the miniature job training and evaluation situation(s). The response has been uniformly positive across studies. In the Siege1 and Wiesen (1977) study, a questionnaire which contained both multiple choice and open-ended questions was administered at the conclusion of each assessment day. The multiple choice data concerning the p e r c e i v e d fa i r n e s s in d i c a t ed u n a n i ni ou s agreement that the miniature training and evaluative cscrcises \Yere fair in an absolute sense. In a thoice betlveen “yes” and “no,” indicating agreement o r disagreement with a statement that the miniature training and evaluative tests seem fair, 100% of the sample responded “yes.” This topic was further probed by asking the individuals who participated in the assessment process to compare the exercises with other tests taken in the past. Based on a five category Likert tgpc scale, 31% of the respondents considered the miniature training and evaluative csercises “very much more fair,” 42% found the min-

HUMAN

FACTORS

i a t u r e training and cvaluativc cscrcises “more fair,” and 26% said the exercises were “equally fair.” Only one individual (less than 1% of the sample) chose the “less fair” category. No one chose the “very much less fair” category. There was also unanimous agreement that the miniature training and evaluative exercises were enjoyable. On a four category scale, 76% of the individuals assessed chosc the superlative “enjoyed very much” response. No one chose the “nonenjoyable” ca tegory. The reasons for perceiving the exercises to be fair \\‘ere inquired into through a complction type question. Reasons frequently given for thinking the miniature training and evaluative esercises \Yere fair fell into two major areas: (1) the training completeness and hands on aspects of the learning phase of the exercises. and (2) the esercises e m phasized performance, not reading and writing. Typical replies to the question are given below: “Cave me a chance to prove that I could do some things with my hands, not just my head.” “Very fair. It \vas explained very well every time, to make sure I understood what was going on. and I did.” “Yes. Because these weren’t tricky like others.” ”It gives a person an opportunity to use his hands in a certain situation, instead of learning i t from a book. Like some tests you have 10 read questions and a certain amount of time is alloted (sic) for the test. Some people can’t read fast enough and that isn’t fair. But here you can use your hands.” ”Because it was a lot less \