Designing a Useful Likert Scale to Measure Average Group Response

Designing Useful Likert Scales 1 Running head: DESIGNING USEFUL LIKERT SCALES Designing a Useful Likert Scale to Measure Average Group Response Pape...
Author: Veronica Jordan
8 downloads 1 Views 49KB Size
Designing Useful Likert Scales 1

Running head: DESIGNING USEFUL LIKERT SCALES

Designing a Useful Likert Scale to Measure Average Group Response Paper Presented to American Evaluation Association November 2005 by

Randall S. Davies

[email protected] Indiana University South Bend South Bend, Indiana, U.S.A.

Special Acknowledgement to Lee Smith Colleen New Jenny Daken Indiana University South Bend South Bend, Indiana, U.S.A.

ABSTRACT Developing a Likert scale to gather data is easy. Designing an instrument and scale to accurately gather average group response data that can also be interpreted in a meaningful way is more challenging. This paper reports the findings of a project to combine a fully anchored Likert scale with a numerical rating scale. The resulting scale for a follow-up survey utilizes fixed anchor points along with visual cues to provide a layer of direction for the respondents; it also uses numerical continuums between major anchors to elicit a second layer of information that is more discriminating. The results are both easily interpreted on the original scale and are also more appropriate for longitudinal comparisons.

Designing Useful Likert Scales 2 Designing a Useful Likert Scale to Measure Average Group Response In 1932, Rensis Likert first reported getting highly satisfactory and reliable data from summated rating scales he developed to measure attitude. His Likert scale, as it is now known, has grown in popularity and is used extensively in evaluations to measure not only attitudes and opinions about various personal phenomena, but also for many other purposes, including rating human performance and ability. In an evaluation of a teacher preparation program, for example, an important criteria for accreditation is often the requirement that the school of education gather evidence from external stakeholders regarding their satisfaction with, and opinions about, teacher candidates being prepared at that particular institution. Follow-up surveys using Likert scales are typically employed to provide this information. Aggregated scores from individual items and average scores from groups of items are commonly used to identify successes or determine areas for needed improvement. The challenge is to ensure that the design of the instrument will provide accurate data that can be interpreted in a meaningful way. There are two important aspects relevant to the collection of survey data: rater criterion (i.e., how individual raters determine their response) and quantification (i.e., the method used to assign numerical values intended to characterize the construct of interest). The issue of how raters determine a response is important; however, this study deals only with the second aspect of data collection – quantification. More specifically, how well a scale facilitates the collection of reliable data that can be easily interpreted and is useful for decision making. The purpose of this study was to design such a scale and determine any benefits and limitations associated with its use. Background Information Valid and Reliable Data Too often, individuals seem to forget that quality evaluation results are dependent on the evaluator’s ability to collect valid and reliable data. Data is valid when it provides a good estimate of the construct characteristic or attribute being measured. In other words, the instrument measures what it was suppose to measure. Instruments are considered reliable when the data they provide is a consistent estimate of what the instrument actually measures, intended or otherwise (Linn & Miller, 2005; Nunnally, 1978). Data scales help us ensure that multiple raters provide more consistent (i.e., reliable) measurements; they also provide a context

for understanding the result. The scale used will affect the reliability and usefulness of the data collected. Consider for example the situation where you want to determine participant satisfaction for a program or product you are evaluating. A data collection scale that asks respondents to indicate simply whether they are “satisfied” or “not satisfied” is incapable of providing evidence about the degree of satisfaction for the group response; more response options are needed. On the other hand, data provided from a numerical rating scale continuum utilizing dichotomous anchors at each endpoint to broadly delimit the response (e.g., “extremely unsatisfied” to “extremely satisfied”) can be equally problematic. The first scale is very reliable but has limited interpretation potential. The second is much less reliable (i.e., it is unable to ensure that different raters indicate their degree of satisfaction at the same point) because it provides inadequate guidance for individuals to consistently quantifying their response; still, using this type of scale has the potential to provide greater interpretive insights. The strength of a fully anchored scale is inter-rater consistency, while the strength of a numeric scale continuum is interpretation and differentiation potential. The two scales used in this example are of course two extreme cases. Various scales exist that fall somewhere between these two scale designs. Trait vs. State. One additional issue worth mentioning is the fact that an instrument’s reliability is often dependent on the stability of the construct characteristic being measured. For example, if each individual respondent’s satisfaction is mutually exclusive of that of other raters, and tends to change dramatically and erratically each day (i.e., a fluctuating momentary state), then the reliability of a survey instrument may be of little importance and impossible to accurately determine. If however the attribute being measured is more stable (i.e., an unwavering personality trait) then the reliability of the data collected and the scale used to collect this information can be very important. Sometimes it is impossible to determine the degree to which the construct being measured is a trait or a state (Snow, Corno, & Jackson, 1996). What is important to remember is that a survey may best be described as a point in time measurement estimate of a specific construct. Ordinal and Interval Data Typically data obtained from fully anchored Likert scales are considered to be ordinal by nature, as it is

Designing Useful Likert Scales 3 unlikely respondents will uniformly perceive the specific anchors to form equal intervals (Goldstein & Hersen, 1984). This would make the mathematical averaging of such data suspect. In other words, while the assessment of each individual’s opinion may be reliable and valid for that person, averaging the opinions of all the respondents may not be particularly useful and may in fact be inappropriate. For example, if about half the respondents think the teachers trained at a particular school are exceptional, and the other raters think the performance of students from that institution is unacceptable, does this make the school adequate? The distribution of the results needs to be considered in addition to the central tendency. On the other hand, data provided from a numerical rating scale may be of limited usefulness if the raters are not trained to properly quantifying their responses on the scale being used. The data provided from such a scale is more likely to be considered interval data, and thus the mathematical average has meaning, but the mean produced from such instruments often has little practical application for the evaluator because there is no way of knowing what a specific average response represents (e.g., on a scale of one to ten, is a 4.51 good or bad?) or whether the value truly (i.e., reliably) represents the overall group response in a meaningful way.

interval level data; but without guidance of anchor labels, the inter-rater reliability (e.g., the degree to which raters consistently indicate similar levels of satisfaction at the same point) is diminished. The challenge is to properly combine the essential elements of each to maximize the overall benefit in a hybrid design. This was accomplished by using a partially anchored numeric rating scale with fixed anchor points at key position on the scale. In addition, visual cues were added to provide a layer of direction for the respondents. This was done to facilitate consistency in response measurement and the potential for more meaningful interpretation. Numerical continuums between major anchors were used to provide a second layer of information. This was done to allow respondents (i.e., raters) to differentiate the degree of their response within a delimited area. This also makes the resulting data more likely to be interval level data and thus more appropriate for longitudinal or disaggregated comparisons of average group response (i.e., comparison of mean response).

An anchored rating scale’s strength is inter-rater consistency (i.e., raters know exactly what each option represents); it weakness is that it produces ordinal data with limited interpretation potential. The strength of a numerical rating scale is that it is considered to produce

This scale was originally designed for an evaluation of a teacher preparation program to gather evidence from external stakeholders regarding their satisfaction with teacher candidates being prepared in the school of education. The original instrument used a four point Likert scale. A beta version of the revised scale was initially used (see Figure 1). Based on initial results, it was modified to improve accuracy (i.e., avoid confusion) of response placement in the central area (see Figure 2).

Figure 1: Beta Version of Revised Scale

Figure 2: Final Version of Revised Scale

Combining Scale Designs

Unsatisfactory Satisfactory Exceptional Proficient Developing

1

2

3

4

5

6

7

8

Method To explore the degree to which rater responses vary based on the scale being used, participants in the study were asked to respond to a three part survey. Each part of the survey asked the same series of five questions (see Table 3) each time using a different scale. A time delay of approximately 30 to 40 minutes between parts was implemented. Participants were not allowed to refer back to or change their answers from previous parts. In the first part of the survey respondents were asked to simply write a number between 1 and 8 that best represented their response. The second part of the

Unsatisfactory Developing

1

2

3

4

Satisfactory Exceptional Proficient

5

6

7

8

survey used a numeric continuum. Participants selected the number representing their response from a horizontal display of numbers from 1 to 8. They were given the verbal instruction to rate each statement from “very unsatisfactory” to “very satisfactory” using the scale provided. The last part of the survey used the scales presented in Figures 1 or 2 with no additional verbal instruction provided. The beta version was used with 32 respondents. The scale was then tested with three respondents using a talk aloud interview to identify potential revisions for the final version of the scale. The final version of the scale was used with an additional 53 respondents.

Designing Useful Likert Scales 4

“not quite satisfied” or “just barely satisfied” should be placed. Talk aloud interviews with three respondents raters as they completed the survey in an attempt to identify the problem. Based on their feedback, it was clear that the visual cues provided did not seem to be working as intended. The arrow signifying a point between options 4 and 5 was meant to identify the point between being satisfied and not satisfied. The purpose was for respondent to make a choice at that point as to their overall satisfaction. Respondents seem to be using a range rather than a point. Based on this feed back the graphical cues were modified and implemented in a final version of the scale.

Table 1 Survey questions asked The number that best represents a rating of Not quite satisfactory. Just Barely Satisfactory Very Good. Very Unsatisfactory. Exceptional. Results The results of the study are presented in this section. The survey scale development is discussed. Results from each of the questions are presented with analysis.

Not Quite Satisfied Based on the survey results it seems likely that the revised versions of the scale did assist respondents in consistently assigning this rating. Table 2 presents the distribution of responses by scale used. Table 3 presents descriptive statistics for the responses by scale used.

Beta Test Revisions Based on the results of the first 32 respondents, it was noted that the instrument scale did not seem to be responding as intended for the central portion of the scale. Raters did not seem to know where a response of Table 2 Response distribution for question Not Quite Satisfactory

Response Scale N 1-8 rating 90 Numeric continuum 73 Beta version 36 Final version revised 53 Note: the expected response was 4

2

3

4

5

6

14 8 8

26 17 8 3

36 35 19 49

11 8 1 1

3 5

Table 3 Statistics for question Not Quite Satisfactory Scale

N

Mean

SD

Range

1-8 rating Numeric continuum

90 73

3.59 3.79

1.004 1.013

2-6 2-6

F

Beta Version 36 3.36 .867 2-5 22.30 Final version 53 3.96 .275 3-5 Note: significant difference at alpha = .05, Effect Size (ES) – partial eta-squared The final version of the scale seemed to help raters pinpoint the intended location of this rating better than other scales. The results indicate a significant difference in the rating obtained from the different versions of the scale being used. While the response distribution obtained from the 1 to 8 rating scale was quite different

Sig.

ES

.05]. The mean rating for each scale was between 6.5 and 7.0. Very Unsatisfactory Based on the survey results it seem likely that the revised versions of the scale did assist respondents in consistently assigning this rating. Table 8 presents the distribution of responses by scale used. Table 9 presents descriptive statistics for the responses by scale used. The response difference between the beta and final versions of the scale was statistically different [F (1, 87) = 4.49, p > .05]; however the partial eta-squared effect size reported in Table 9 was calculated to be .049 which indicates that less than 5% of the effect is likely to have been the result of the different scales being used in the survey. This is quite small and likely of little practical significance.

F

Sig.

ES

.001

.969

---

The beta version seemed to do a better job of helping raters pinpoint the location intended for “very unsatisfactory.” The mean rating for each scale was between 1.0 and 1.25 . Exceptional Based on the survey results it seem likely that regardless of the scale, raters were able to identify the spot intended to represent a rating of exceptional. Table 10 presents the distribution of responses by scale used. Table 11 presents descriptive statistics for the responses by scale used. Both the beta and final versions of the scale seemed to help raters pinpoint the intended location of this rating better than other scales. The response difference between the beta and final versions of the scale could not be calculated because all respondents indicated the same rating. The mean rating for each scale was between 7.9 and 8.0; the distribution (i.e., spread and range) of the responses was only slightly different for each of the scales used.

Designing Useful Likert Scales 7

Table 8 Response distribution for question Very Unsatisfactory Response Scale N 1-8 rating 90 Numeric continuum 73 Beta version 36 Final version 53 Note: the expected response was 1

1

2

3

72 63 36 47

14 8

4 2

4

5

6

Table 9 Statistics for question Very Unsatisfactory Scale

N

Mean

SD

Range

1-8 rating Numeric continuum

90 73

1.24 1.16

.526 .441

1-3 1-3

Beta Version Final version Note: alpha = 0.05

36 53

1.00 1.11

.000 .044

1-1 1-2

F

Sig.

ES

4.49

.037

.049

Table 10 Response distribution for question Exceptional Response Scale N 1-8 rating 90 Numeric continuum 73 Beta version revised 36 Final version revised 53 Note: the expected response was 8

4

5

6

7

8

1 1

1 3

88 69 36 53

Table 11 Statistics for question Exceptional Scale

N

Mean

SD

Range

1-8 rating Numeric continuum

90 73

7.96 7.92

.332 .400

5-8 5-8

.000 .000

8-8 8-8

Beta Version revised 36 8.00 Final version revised 53 8.00 Note: Significance comparison cannot be calculated

Discussion and Conclusions Developing a Likert scale to gather data is easy. Designing an instrument and scale to accurately gather average group response data that can also be interpreted in a meaningful way is more challenging. Using a fully anchored scale can limit the potential

F

Sig.

ES

*

*

---

interpretive value of the result. Using a numerical continuum with dichotomous anchors at each endpoint to broadly delimit the response has an increased potential for producing unreliable data. Using a partially anchored scale along with visual cues to provide a layer of direction for the respondents helps

Designing Useful Likert Scales 8 respondent consistently apply their ratings; combining this with numerical continuums between major anchors provides a second layer of information making the data more discriminating. The results of this study seem to indicate that this type of scale does increase reliability and the potential usefulness of the resulting data set. The scale one chooses to use in a survey instrument does seem to make a difference in the result. The average rating may not be affected drastically if the sample size is large enough but the potential for obtaining inconsistent data (i.e., in terms of variance and range) is increased considerably if the scale fails function properly. Raters clearly use different criteria for determining their responses but if the scale used can assist the rater in quantifying their rating (i.e., indicating their response on the scale) the resulting range of responses is decreased dramatically for similar responses. This is especially true for sections of the scale where anchors are not used. Unreliable results will affects any comparisons and interpretations the evaluator may make. Along with the need for data driven decision making, there is a need to make sure the data we use is as accurate and useful as possible. The use of clear

anchors increases the inter-rater reliability when quantifying the aggregate response; it also helps when interpreting the result; but this must be carefully balanced with scale options that can produce interval level data so the results are more easily compared. Many instruments that utilize various scales gather data but the usefulness of the aggregate result is suspect or not as useful as it could be if the scale used fails to function properly. Measurement errors occur for various reasons; often the cause of errors is out of the control of the evaluator. Error introduced as a result of scale design and function deficiencies are one aspect of an evaluation that can be controlled by the investigator. Developing a useful scale takes time. The anchor labels and graphic cues used in the scale need to be tested to make sure they function as intended. Using a scale like the one developed in this study has the potential to produce reliable and useful data but the scale used must also match the purpose. The resulting data set obtained from such a scale will be more easily interpreted on the original scale and data are more appropriate for longitudinal comparisons since the data is more reliable and more likely to be considered interval level data.

REFERENCES Goldstein, G., & Hersen, M. (1984). Handbook of Psychological Assessment. New York: Pergamon Press. Likert, R. (1932). A Technique for the Measurement of Attitudes. New York: Archives of Psychology. Linn, R.L. and Miller, D.M. (2005). Measurement & Assessment in Teaching (9th ed.). Saddle River, NJ: Prentice Hall. Nunnally, J. (1978). Psychometric Theory. NY: McGraw Hill. Snow, R., Corno, L., & Jackson, D. (1996). Individual differences in affective and conative functions. In D. C. Berliner & R. C. Calfee (Eds.), Handbook of educational psychology (pp. 243-310). New York: Macmillan.

Suggest Documents