8 Reliability and validity

8 Reliability and validity Objectives At the end of this chapter you will be able to: r r r r r r r r r explain what reliability is; define validity...
Author: Julian Rich
4 downloads 0 Views 455KB Size
8 Reliability and validity

Objectives At the end of this chapter you will be able to: r r r r r r r r r

explain what reliability is; define validity; distinguish reliability from validity; describe internal consistency and stability reliability; compare the types of validity: construct, content, and criterion-related validity; describe how to measure reliability; describe how to measure validity; explain how to increase reliability; and explain how to increase validity.

CONTENTS Improving the quality of the study: Reliability and validity of measures Types of reliability Types of validity Conclusion References Chapter review questions

150 152 155 157 158 158

149

150

Part 4 Measurement

Improving the quality of the study: Reliability and validity of measures Constructs and measures This chapter begins by defining some key terms. Following Edwards and Bagozzi (2000), a construct is a conceptual term for a phenomenon of theoretical interest. Constructs are thus concepts that exist as part of a theoretical language. Examples of constructs used in management research are ‘total quality management’, ‘transformational leadership’, and ‘emotional intelligence’. Most constructs of interest to researchers are conceptualised as variables; that is, they can take on different values or states, whether qualitative or quantitative in nature. For this reason, constructs are often called latent or unobserved variables. Because constructs are abstractions, researchers need to be able to operationalise or measure them in an empirical study. A measure is defined here as a score or observed value that is taken to empirically represent a construct (Edwards & Bagozzi, 2000). Measures may be gathered through methods of data collection such as questionnaires, documentation, and observation. We can thus speak of measured or observed variables as indicators of their respective latent constructs. However, no measure is a perfect representation of the underlying construct. An important part of empirical research is to maximise the reliability and validity of measures.

Reliability and validity of measures ‘Reliability’ refers to the extent to which a measure is free of random measurement error (Smithson, 2005). A perfectly reliable measure has no random measurement error. Reliability can be defined as the ratio of the true score variance to the observed score variance (the variance is the mean of the squared deviations from the mean, and the standard deviation is the square root of the variance), because each observed (i.e., measured) score is composed of a ‘true’ score and measurement error. If there is random measurement error, the measure has less-than-perfect reliability. Of course, most measures used in research are imperfect. However, if a measure’s reliability is too low, it cannot

Reliability and validity

151

be used in research. Note that reliability is a property of the scores (i.e., the measures) and not of the instrument or procedure used to gather the data. Therefore, reliability must be tested each time an instrument is used to generate scores for a sample. Validity is whether the researcher is measuring the construct he or she purports to be measuring. In other words, it is the extent to which a measure measures what it is supposed to measure. For example, if a researcher examines a measure of self-esteem, he or she needs to ask whether it really measures self-esteem, or whether in fact it measures self-confidence (a similar self-evaluation), or lack of depression, or lack of anxiety, or life satisfaction (other measures of positive affect/attitudes that are highly related to self-esteem). Validity is the degree of confidence that a researcher can have in inferences drawn from scores, and the confidence that a researcher can have in the meaning attached to scores. It is important to understand that a measure cannot be valid unless it is reliable, but a measure can be reliable but not valid. Reliability is thus a necessary but not sufficient condition for validity. Reliability and validity apply to both qualitative and quantitative data. Often, it is easier to assess reliability and validity with quantitative data; however, in our opinion, they are equally important with qualitative data.

The necessity for reliability and validity Studies that use measures with poor reliability and validity produce data, both quantitative and qualitative, that lack rigour. Consequently, the researcher cannot justify the use of these measures because other interpretations could be drawn from the data. For example, statistics such as correlation coefficients are attenuated (reduced in size) due to the presence of measurement error. Researchers often measure relationships between variables (e.g., between intentions to leave a job and actual labour turnover). If a researcher has measures with low reliability, he or she is less likely to detect the associations between variables when they are in fact related. The reason for this is that, when a measure has low reliability, it weakens the effect size and thereby limits statistical power to detect relationships with another variable.

152

Part 4 Measurement

If a researcher was to compose a measure of a construct and it appeared as if it measured that construct (i.e., it has face validity), this would not constitute sufficient evidence that it really does. For example, measures of intelligence may actually capture how well an individual is able to answer written tests (which he or she may be well practised in as a result of being highly educated), rather than his or her innate intelligence, as reflected in genetic inheritance. Therefore, in this example, instead of measuring intelligence (innate ability), the researcher is really also measuring years of schooling and grade point average. Those individuals with higher educational levels thus score higher on this measure than those with lower educational levels. As a consequence, the scores are not a valid measure of intelligence, as they actually reflect education. Before initiating a research project, researchers need to establish that they are using reliable and valid measures. In some areas, researchers will find that measures of the variable have already been developed and they are advised to use these. In other areas, there may not be established measures and therefore researchers have to establish their reliability and validity.

Types of reliability We have defined reliability as the extent to which a measure is free of random measurement error. The measures can be single-item or multi-item (i.e., summed or averaged across several items) scores. The development and validation of multi-item scales is discussed in more detail in Chapter 9. The following is a discussion of various ways of estimating reliability of scores.

Internal consistency reliability Internal consistency reliability is used for multi-item measures. If a multi-item measure has little random measurement error, the researcher would expect the items to be consistent with each other. Internal consistency reliability is typically measured by a statistic called Cronbach’s alpha coefficient (see Cortina, 1993). An alpha coefficient

Reliability and validity

153

measures how correlated each item is with each other item in the scale. It is a measure of consistency because if the items in the scale are related to each other, it is an internally consistent measure. Alpha coefficients are calculated using the average correlation among the items. So at least two items are required in order to calculate an alpha coefficient. An alpha coefficient ranges from 0 to 1. It is not possible to obtain a negative alpha coefficient, unless the researcher has made a computational error (e.g., failed to reverse score negatively worded items), or the scale is extremely unreliable. In general, measures that are highly reliable have alpha coefficients of .90 or greater, while scales that have alphas below .70 can be said to have less than fair reliability (although alphas of .60 or higher are acceptable for newly developed scales) (Nunnally, 1978). It is important to understand that Cronbach’s alpha does not indicate that the scale is unidimensional or valid. It also needs to be remembered that as a researcher increases the number of items, Cronbach’s alpha coefficient will also increase. Unless the items have a high average intercorrelation, it may be difficult to get acceptable internal consistency reliability for scales with a small number of items (e.g., two or three items).

Test–retest reliability Test–retest reliability is the extent to which a measure gives the same result on two (or more) repeated administrations. If a measure is perfectly reliable, it should provide the same score on repeated administrations. For example, if a researcher measures an individual’s intelligence one week, he or she could obtain an estimate of test–retest reliability by re-measuring intelligence two weeks later using the same test. If the measure is reliable, the test scores should be similar. Similarly, if a researcher measures an employee’s job satisfaction or intentions to leave on one day, the employee’s satisfaction and intentions to leave should be approximately the same two weeks later. This type of reliability is referred to as stability. The error associated with test–retest reliability is anything that yields different scores on repeated administrations. The length of time between measures is an important consideration; a shorter interval will typically yield a higher correlation. Test–retest reliability is often used to measure reliability in single-item

154

Part 4 Measurement

measures, provided the underlying construct is not expected to change substantively over time. Test–retest reliability is measured via a correlation coefficient (e.g., Pearson’s correlation coefficient). To obtain this coefficient, the researcher merely correlates scores on the first administration of the measure with their matched scores on the second administration. It does imply that researchers require longitudinal data and need to match up scores from the first administration to the second administration. The correlation coefficient should be positive and as high as possible. Test–retest (i.e., stability) coefficients are usually lower than estimates of internal consistency reliabilities. According to Corcoran and Fischer’s (1987) criteria, a test–retest coefficient above .80 indicates strong stability; a coefficient above .71 implies good stability; and a coefficient above .51 denotes fair stability.

Inter-rater reliability Data are often gathered through observation. With observational data, one researcher’s observations might differ from another researcher’s observations. A similar issue arises with analysis of qualitative (textual) data. Usually with qualitative data, the researcher wants to determine whether there are identifiable themes in the text. Again, one researcher’s interpretation might differ from another researcher’s interpretation. In these types of situations, inter-rater (or interobserver) reliability statistics can be calculated. In order to assess interrater reliability, two (or more) researchers should provide ratings or scores for each of the variables in the data. There are many statistics for calculating inter-rater reliability, including per cent agreement and coefficients such as Kappa. In general, inter-rater reliability should be .80 or greater in order for the researchers to conclude that they are rating consistently.

Other measures of reliability There are several other measures of reliability. Instead of alpha coefficients, researchers can apply split-half reliabilities to measure the internal

Reliability and validity

155

consistency of a multi-item scale. In order to do this, the researcher can split the items of a measure into the odd (e.g., first, third, fifth items) and even (second, fourth, sixth) items, and then estimate a coefficient that indicates how related the odd scores are with the even scores. If a measure assesses what it is supposed to measure, then it should be internally consistent. Consequently, the test is split into two halves and the total score for odd items is arrived at, as well as the total score for even items for each respondent. Then, for the whole sample, the correlation of odd with even scores is estimated. Other forms of reliability are estimated by developing parallel forms of the measure. They measure the same construct or phenomenon, with very similar, but not identical, items. The correlation coefficient is calculated by administering the two measures to the same sample. This procedure is referred to as parallel forms.

Types of validity There are several types of validity, and researchers should be familiar with them all when searching for published and/or established measures in order to make an informed decision about whether the measure assesses what it purports to measure. It is difficult to establish validity for ‘home-grown’ measures (those developed by the researcher for the study), as large sample sizes and multiple measures are required. More information on validating multi-item measures or scales can be found in Chapter 12.

Construct validity Essentially, construct validity refers to whether a measure relates to other measures in ways predicted by an underlying theory of the construct. Construct validity is comprised of two subtypes: convergent and divergent validity. If a measure captures what it really is supposed to measure, scores on that measure should be more related to scores on other similar constructs (convergent validity) and not, or less, related to scores on dissimilar constructs (discriminant validity). For example, if a measure of managerial level actually assesses managerial level, it

156

Part 4 Measurement

should be more related to constructs closely associated with managerial level (e.g., salary, the number of managerial promotions, and the number of subordinates responsible to that person) than to other constructs that may be spuriously related to advancement. The latter could be age, the number of years employed in full-time work, the number of levels in the organisation, and the organisation’s size. Thus, if the managerial level item was valid, it would be more highly correlated with the former constructs (convergent validity), and not related or less highly related to the latter constructs (discriminant validity). In other words, the convergent and divergent validity of a measure is assessed by determining whether the pattern of relationships in the empirical data match those in the nomological network (i.e., the expected theoretical relationships between the construct the measure is capturing and other constructs) (Schwab, 2005). Another approach to examining construct validity is through the use of both exploratory and confirmatory factor analysis to determine evidence of convergent and discriminant validity.

Criterion-related validity If a measure is valid, it should predict something that the researcher is interested in. For example, if a selection interview or a selection test is a valid measure for choosing future staff, it should predict their performance on the job. Criterion-related validity means that the measure predicts a relevant criterion. In other words, it attempts to answer the question, ‘Does it matter?’ Criterion-related validity is practical and pragmatic. However, the choice of the criterion variable is critical. Smithson (2005) notes that the criterion measure should be known to be reliable and valid already. Criterion-related validity may be predictive or concurrent, depending on how it is measured. Predictive validity is the extent to which a measure predicts subsequent performance or behaviour. For example, scores may be obtained in a selection interview (e.g., supervisory ability), subsequently people are hired (for research purposes, it would be best to hire everyone to avoid range restriction problems), and their job performance is measured a year later. Predictive validity is determined by the strength of the correlation (called a validity coefficient) between supervisory ability, measured at selection, and job performance,

Reliability and validity

157

measured a year later. Alternatively, the researcher could measure the current staff on supervisory ability, using the interview, and then take their job performance scores and correlate the two. This is referred to as concurrent validity, as the measure (supervisory skills measured via interview) is correlated with a criterion (job performance) that is measured at the same point in time. In order for validity coefficients to have criterion-related validity, the coefficient should be as high as possible. One rule of thumb is that a relationship may be considered weak if the validity coefficient is .10, medium if .30, and strong if .50 (Cohen, 1988).

Content validity Content validity refers to whether the items designed for the measure adequately cover the domain of interest. For example, an exam with content validity would have questions covering all of the content that had been covered in the course. Thus, content validity is focused on the extent to which the content of a measure is representative of a wider body of material that it is trying to assess. Content validity is often estimated by a thorough review of the relevant literature and consultation with subject matter experts, to determine whether the items in the measure have adequately sampled the domain.

Face validity Measures that have face validity appear, at face value, as if they measure what they say they measure. Face validity is subjective. Nevertheless, all measures must have face validity. However, just because a measure appears to measure what it claims to measure, there is no guarantee that it does. The measure has face validity, but not empirically demonstrated validity.

Conclusion Measures used in research need to be reliable and valid. If they are not valid and reliable, the researcher cannot be confident about the

158

Part 4 Measurement

conclusions drawn from the study. This applies to interpreting both qualitative and quantitative data. Irrespective of the type of data collected, it needs to be reliable. The measure also needs to be valid. In other words, it needs to measure what it is supposed to measure, predict relevant criteria, cover the content underlying the construct, be similar to similar constructs and dissimilar from different constructs, and not be contaminated by method factors such as social desirability. This may require the use of published measures, which have been through rigorous reliability and validity checks. Alternatively, researchers may use hard data (e.g., number of sales for measuring performance), the validity of which can be more easily demonstrated. Often it is best for researchers to use multiple measures, which allows them to determine if a number of the measures converge for evidence of construct validity.

References Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. Corcoran, K. & Fischer, J. (1987). Measures for clinical practice: A sourcebook. New York: Free Press. Cortina, J.M. (1993). What is coefficient alpha? Journal of Applied Psychology, 78, 98–104. Edwards, J.R. & Bagozzi, R.P. (2000). On the nature and direction of relationships between constructs and measures. Psychological Methods, 5, 155–174. Nunnally, J.C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill. Schwab, D.P. (2005). Research methods for organizational studies. Hillsdale, NJ: Lawrence Erlbaum Associates. Smithson, M. (2005). Statistics with confidence. Thousand Oaks, CA: Sage Publications.

Chapter review questions 1 2 3 4

What is reliability? What is validity? Are there any differences in the need for reliability and validity in qualitative and quantitative data? What is internal consistency reliability?

Reliability and validity

5 6 7 8 9 10

What is test–retest reliability? How do internal consistency reliability and test–retest reliability differ? What is inter-rater reliability? How is inter-rater reliability calculated? What are construct, content, and criterion-related validity? How do the various measures of validity differ?

159

Suggest Documents