The Joint Commission on National Dental

Constructing Licensure Exams: A Reliability Study of Case-Based Questions on the National Board Dental Hygiene Examination Tsung-Hsun Tsai, Ph.D.; Bar...
Author: Osborn Banks
12 downloads 2 Views 222KB Size
Constructing Licensure Exams: A Reliability Study of Case-Based Questions on the National Board Dental Hygiene Examination Tsung-Hsun Tsai, Ph.D.; Barbara Leatherman Dixon, R.D.H., B.S., M.Ed.; John H. Littlefield, Ph.D. Abstract: Patient cases with associated questions are a method for increasing the clinical relevance of licensure exams. This study used generalizability theory to assess changes in score reliability when the number of questions per case varied in the National Board Dental Hygiene Examination (NBDHE). The experimental design maintained the same total number of case-based items, while varying the number of cases and items within cases to assess changes in score reliability. Using generalizability theory, the amounts of error variance within cases and between cases on the NBDHE were assessed. Impact on score reliability (generalizability) was computed. The data were from the responses of 4,528 candidates who took the paper-pencil version of the NBDHE in spring 2009. Results showed that the minimum value of generalizability occurred when fourteen cases with ten items each were used in the examination. The maximum value of generalizability occurred when fifty cases with three items each were used in the examination. The research findings support the development of more cases with fewer items per case on the NBDHE in order to enhance test score reliability and validity. Practical constraints should be considered if more cases with fewer items per case are developed for future examinations. Dr. Tsai is Consultant in Educational Measurement and Statistics; Ms. Dixon is a past member of the Joint Commission on National Dental Examinations, former Chair of the Utah Board of Dentists and Dental Hygienists, and current member of the Commission on Dental Accreditation Dental Hygiene Review Committee and the American Dental Hygienists’ Association Council on Regulation and Practice; and Dr. Littlefield is Adjunct Assistant Professor, Department of Comprehensive Dentistry, University of Texas Health Science Center at San Antonio Dental School. Direct correspondence and requests for reprints to Dr. Tsung-Hsun Tsai, 2286 University Drive, Naperville, IL 60565; [email protected]. Keywords: dental hygiene education, licensure examination, National Board Dental Hygiene Examination Submitted for publication 9/6/12; accepted 12/11/12

T

he Joint Commission on National Dental Examinations (JCNDE), which is responsible for developing and administering the National Board Dental Hygiene Examination (NBDHE) program, routinely validates the NBDHE.1-4 In 2010, the JCNDE confirmed that the case-based items enhanced the overall validity of the NBDHE.5 Validity refers to the degree to which logic and evidence support the interpretation and use of scores achieved. NBDHE scores are used for making pass/fail licensing decisions by state boards of dentistry. The aim of our study was to assess the reliability of individual candidate scores on the case-based component of the NBDHE. Our goal was to better understand how much various sources of error affect test scores and ultimately to enhance test score reliability by reducing the sources of error variance. Score reliability6 estimates the amount of measurement error embedded in a candidate’s test score. As the reliability of test scores increases, the measurement error decreases, i.e., becomes relatively smaller. Technically, reliability in classical test theory 6 quantifies the internal consistency of performance

1588

on a test; in other words, it estimates the “true” test performance after taking the “overall” error of measurement into account. Reliability in classical test theory does not provide information about how much measurement error is associated with various sources (e.g., variability of item difficulty). Identifying the major sources of measurement error individually and assessing the relative impact are critically important. Armed with this information, more efficient measurement procedures can then be constructed to reduce the error and enhance test score reliability. Without comprehensive information on sources of error that contribute to test performance, it is not possible to develop effective measurement procedures. To explain this concept, we might suppose that an investigator wants to evaluate reliability of scores on writing prompts. The investigator must, first, identify sources or components of error that contribute to scores on writing prompts; second, determine how much error results from each source; third, compare the amount of error from each source to identify the largest source of error; and fourth, enhance the reliability by reducing the error resulting from the largest

Journal of Dental Education  ■  Volume 77, Number 12

source. Generalizability theory is used to accomplish this goal.7 In generalizability theory, an overall error is disentangled into components or sources of error, error variances are estimated, and reliability is enhanced by reducing the main source of error. Several studies have applied the concept and framework of generalizability theory in assessing the score reliability of licensure and certification testing programs. Research on the examinations of the Medical Council of Canada found the main source of error was the number of items within patient cases and determined that optimal reliability occurred when the examinations used patient cases with two or three items per case.8 Research on the Canadian Chiropractic Examining Board for its June 2005 Clinical Skills Examination concluded that generalizability theory could be used to understand sources of error by identifying and selecting facets of measurement (e.g., rigorous training of the raters).9 Research on the National Board Dental Examination Part II (NBDE Part II) found the main source of error was the number of items nested within patient cases and that generalizability could be further enhanced by including at least ten cases for a total of 100 casebased items covered on the exam.10 Research on the NBDE Part II, conducted by researchers outside the JCNDE, also found the main source of error due to the number of items nested within cases and that the reliability of NBDE Part II could be further enhanced as the number of cases (not the number of items per case) increased.11 This independent research supports the use of generalizability theory in test development to optimize reliability of the NBDE Part II and strengthen test score validity evidence. Based on these previous studies assessing score reliability using generalizability theory, the JCNDE was concerned whether similar results would be produced using data from the NBDHE’s complex test structure (e.g., an examination comprised of a series of clustered items [testlets/patient cases] with varying numbers of associated items per case). In our study, the terms “testlets” and “patient cases” are interchangeable. This study also analyzed score reliability using classical test theory (traditional method) and score reliability using generalizability theory (sources of measurement error) for the NBDHE when a different number of items per patient case was used. Based on the previous studies of testlet-based assessments, the generalizability estimate is more accurate than the traditional reliability estimate because reliability in classical test theory tends to overestimate the “true” test performance.10-12

December 2013  ■  Journal of Dental Education

This study followed Brennan’s concepts and notations of generalizability theory.7 His concepts and notations as well as the application to this research on the NBDHE are described below. In generalizability theory, a facet is a set of similar measurement conditions characterizing sources of error that contribute to test performance, i.e., test scores. In our research, facets of measurement are number of patient cases and items nested within cases. The design in generalizability theory in this study is two-faceted, unbalanced items-nested-within-cases p×(i:c) design. The unbalanced design was used because the patient cases in the NBDHE are varied in length, i.e., the number of items for each case is not equal.13 In this design, three main indices are present: number of candidates (p); number of items (i); and number of patient cases (c). Every item (i) is associated with one and only one patient case (c). There are three main effects: candidates (p); patient cases (c); and items nested within patient cases (i:c). Two interaction effects exist: candidates-patient cases (pc) and candidatesitems-nested-within-patient cases (pi:c). In generalizability theory, generalizability studies (G-study) generalize specific results from a sample to a universe of measurement. A universe of measurement is a hypothetical population defined by facets. In our research, G-study generalizes results from a small set of patient cases with numerous items per case to a larger set of patient cases with fewer items while maintaining the same total number of case-based items. D studies are decision studies based on results from a G-study. In D studies, the number of levels for each facet is manipulated (e.g., number of items per case). Changes in reliability coefficients (generalizability) and error variances on various facets are examined. In our research, D studies were designed to assess whether reliability coefficients of the exam improved when more patient cases with fewer items per case were used. Reliability in generalizability theory (Generalizability, G-coefficient) is an indicator of the extent to which individual candidate scores achieved on the current NBDHE would remain essentially the same if numerous additional exams on the same topics were taken. G-coefficient, Eρ2, is the ratio of universe score variance, σ2(p) (as with true score variance in classical test theory), to the sum of universe score variance, and relative error variance. Relative error variance, σ2(δ), is the error variance associated with using a candidate’s observable deviation score as an estimate of a candidate’s universe deviation score; σ2(δ) is the sum of all variance components that have interactions of

1589

objects of measurement with facets in the universe of generalization. Like reliability in classical test theory, as generalizability increases, the error becomes relatively smaller, and thus a more precise measurement of “true” test performance can be obtained.

Method The paper-pencil version of the NBDHE administered in spring 2009 was used in this research. The NBDHE is designed to fulfill a didactic requirement in the dental hygiene licensure process. The examination assesses candidates’ theoretical knowledge of basic biomedical, dental, and dental hygiene sciences and the ability to apply such information in a problem-solving context. This comprehensive examination consists of 350 multiple-choice items. Items in the various disciplines are intermingled throughout the examination and are presented with a stem pairing a question or statement, followed by a list of four or five possible responses. The examination has two components: a discipline-based component and a case-based component. The 200 discipline-based items are independent of each other and the patient cases. The 150 case-based items are comprised of fourteen patient cases. A case consists of a synopsis of a patient’s health and social history, plus dental charting, radiographs, and photographs when relevant. A series of items is associated with each case. To select the correct response to each item, candidates must possess the requisite knowledge, interpret the case materials correctly, and then identify the most appropriate alternative among the four or five provided. Responses from 4,528 candidates

enrolled in all accredited dental hygiene programs taking the paper-pencil version of the NBDHE in spring 2009 were the data source used in this study. Random effects variance components for Gstudy unbalanced p×(i:c) design with np=4,528 candidates, nc=14 cases with ni=9 ~14 items per case were estimated by urGENOVA.14 The estimated G-study variance components from urGENOVA were used as inputs to GENOVA15 to obtain results for D studies. urGENOVA and GENOVA are computer programs developed primarily for analyses in generalizability theory. In D studies for p×(I:C) design, a candidate responds to each of the n’c cases with n’i items nested within each case. In the NBDHE, the total number of case-based items for each candidate is 150. The D studies included five combinations of n’c and n’i so that n’c ≤50 and n’i n’c ≤150 as shown in Table 1.

Results A variety of descriptive statistics (e.g., mean raw score of percentage correct) and reliability coefficients in classical test theory (Cronbach’s alpha) for all 350 items (200 discipline-based items and 150 case-based items) are shown in Table 2. The 200 discipline-based items are seen to be less difficult for candidates than the 150 case-based items. The reliability (Cronbach’s alpha) of the case-based items is lower than the discipline-based items because of the relatively small number of items compared to the number of discipline-based items, i.e., 150 items vs. 200 items. G-study results for unbalanced p x (i:c) design are shown in Table 3. Error variance components of

Table 1. D studies for p×(I:C) design D Studies I II III IV

V

n’c 14 15 25 30 50 n’i 10 10 6 5 3 n’i n’c 140 150 150 150 150

Table 2. Descriptive statistics Examination

Number of Items

Raw Score Mean

Raw Score Standard Deviation

Mean Raw Score Percentage

Cronbach’s alpha

Total 350 240.37 22.68 68.70 0.890 Discipline-based 200 139.22 14.97 69.60 0.857 Case-based 150 101.15 9.45 67.40 0.726

1590

Journal of Dental Education  ■  Volume 77, Number 12

the main effects—candidates (p) and cases (c) and error variance component of the interaction effect (pc)—are relatively small. Two error variances are much larger: items nested within cases (i:c) and the interaction effect of candidates-items-nested-withincases (pi:c). These two facets are two major sources of error when interpreting test scores. The objective of D studies is to determine whether the generalizability of the total score achieved on the NBDHE could be improved by making changes to the test structure (e.g., increasing the number of cases with fewer items per case). Results for D studies using a p×(I:C) design with five measurement conditions are reported in Table 4. Statistics reported in Table 4 include relative error variance, σ2(δ), and G-coefficient, Eρ2. In Table 4, generalizability is maximized, and error variance is minimized by using fifty cases with three items per case. Overall, more cases with fewer items per case reduce error variance and therefore enhance generalizability. However, it is also noted that error variances for fourteen cases and even fifteen, twenty-five, and thirty cases are insignificantly larger than the error variance for fifty cases. Furthermore, G-coefficients for fifteen, twenty-five, and thirty cases are insignificantly smaller than the G-coefficient for fifty cases.

Discussion Previous research with testlets by the Medical Council of Canada8 reached similar conclusions to this study: score reliability is highest with a large number of cases and few items per case (e.g., three per case); and the largest source of error variance is due to variability of candidate performance on individual items within a given case. The findings are also consistent with results from internal and external research on the NBDE Part II.10,11 What might be causing these findings? The patient problems presented in the cases occur frequently (not unusual); however, there is not a stan-

dard approach for writing the questions associated with a case. For example, a standard approach might be three questions for each case: what is the most likely problem that is causing this patient’s signs and symptoms? What additional information would help you confirm your diagnosis of this patient? What are the most important next steps in treating this patient? This standard approach to writing items might reduce the large variability of performance on items within a case (e.g., the candidate recognizes the most likely underlying problem and how to manage it). Is the use of multiple-choice questions (MCQs) contributing to the large variability of individual candidate performance within cases? Reasoning through problems presented by patients in actual practice is more similar to fill-in-the-blank questions than MCQs (i.e., clues are provided by MCQ options, but not by fill-in-the-blank). A study that compared student performance on fill-in-the-blank questions vs. MCQs concluded that students study differently when preparing for a fill-in-the-blank exam.16 Perhaps the clues provided by reading the MCQ options influence candidate performance as much or more than knowledge about the specific patient case. If case-based testing research used fill-in-the-blank questions, the large variability of individual candidate performance on questions within cases found by this study and a predecessor8 might be reduced. Practical considerations and constraints related to using other item types on the NBDHE were discussed and are being investigated by the JCNDE. At this time, the JCNDE supports the use of MCQs on the NBDHE. We recommend more research on the use of fill-in-the Table 3. G-study results for unbalanced p x (i:c) design Effect

df

Variance Component

p 4527 c 13 i:c 136 pc 58851 pi:c 615672

0.00285 0.00123 0.05313 0.00046 0.16238

Table 4. Results from D studies for p×(I:C) design D Studies I II III

IV

V

n’c 14 15 25 30 50 n’i 10 10 6 5 3 σ2(δ) 0.00119 0.00111 0.00110 0.00110 0.00109 Eρ2 0.70497 0.71912 0.72135 0.72191 0.72303

December 2013  ■  Journal of Dental Education

1591

blank questions and other item types to reduce the large variability of individual candidate performance within cases. What are the implications of the findings in this study for constructing future exams? First, if the number of cases is increased with fewer items per case, then the total number of items on the exam will be decreased (compared to the current exam) due to increased time required to read the cases. Second, good cases are difficult to construct (e.g., good quality radiographs), so the amount of effort required to construct an exam may be increased. Third, the advent of computer-administered exams provides options for displaying visual data (e.g., radiographs) more easily than paper-based exams used in this study, so displaying patient data is less difficult. Practical constraints should be considered when applying the results of the reliability research on the NBDHE to other aspects of the licensing examinations. These include testing time, cost of item and case development, and content validity. Regarding testing time, although generalizability would be enhanced with more cases with fewer items per case, an increased overall testing time might be problematic for the vendor responsible for administering the examination in a finite amount of time at a computer testing center. We expect that fees would be increased to accommodate the increased testing time. From a psychometric perspective, a longer testing time might introduce other sources of error (such as fatigue or decreased motivation) that could impact test performance. Development of more case materials, including pretesting and validating items, would be time-consuming, challenging, and expensive. Therefore, costs associated with developmental activities must be considered. However, it should also be noted, with regard to content validity, that broader range of content (more cases) is anticipated to improve the measurement characteristics of the examination compared to an increased number of items within cases. Ensuring evidence of validity is a key consideration in building an examination.

REFERENCES

1. Kramer GA, DeMarais DR. Construct validity of the national board dental hygiene examination. J Dent Educ 1997;61(8):709-16.

1592

2. Kramer GA, Neumann LM. Validation of the national board dental hygiene examination. J Dent Hyg 2007; 81(3):1-17. 3. Kramer GA, Neumann LM. Domain specification and validity: national board dental hygiene examination practice analysis. J Dent Educ 2004;68(9):920-44. 4. Yang CL, Neumann LM, Kramer GA. Examining the stability of candidate performance and item difficulty for the dental hygiene examinations. Poster presentation at American Dental Education Association Annual Session & Exhibition, March 2009, Phoenix, AZ. 5. Neumann LM, Kramer GA, Tsai TH. Update on the national board dental hygiene examination. Presentation at American Dental Education Association, Washington, DC, February 27-March 3, 2010. 6. Allen MJ, Yen WM. Introduction to measurement theory. Monterey, CA: Brooks/Cole, 1979. 7. Brennan RL. Generalizability theory. New York: SpringerVerlag, 2001. 8. Norman G, Bordage G, Page G, Keane D. How specific is case specificity? Med Educ 2006;40:618-23. 9. Lawson D. Applying generalizability theory to high-stakes objective structured clinical examinations in a naturalistic environment. J Manipulative Physiol Therap 2006;29: 463-7. 10. Tsai TH, Shin CD, Neumann LM, Grau BW. Generalizability analyses of NBDE Part II. Eval Health Prof 2012;35(2):169-81. 11. Downing SM, Bordage G, Koerber A, et al. Maximizing measurement efficiency and reliability: optimum number of options for multiple-choice items and optimum number of questions for case-based testlets, 2008. Unpublished paper based on research project funded by Joint Commission on National Dental Examinations’ Innovative Dental Assessment Grant. 12. Lee G, Frisbie DA. Estimating reliability under a generalizability theory model for test composed testlets. Appl Meas Educ 1999;12(3):237-55. 13. Joint Commission on National Dental Examinations. National board dental hygiene examination: 2009 guide. Chicago: Joint Commission on National Dental Examinations, 2009. 14. Brennan RL. Manual for urGENOVA version 2.1. Iowa Testing Programs Occasional Papers no. 49. Iowa City: University of Iowa, 2001. 15. Crick JE, Brennan RL. Manual for GENOVA: a generalized analysis of variance system. American College Testing Technical Bulletin no. 43. Iowa City, IA: ACT, Inc., 1983. 16. Pinckard NA, McMahan CA, Prihoda TJ, et al. Shortanswer examinations improve student performance in an oral and maxillofacial pathology course. J Dent Educ 2009;73(8):950-61.

Journal of Dental Education  ■  Volume 77, Number 12

Suggest Documents