THE RELIABILITY OF THE ANES FEELING THERMOMETERS: AN OPTIMISTIC ASSESSMENT

THE RELIABILITY OF THE ANES FEELING THERMOMETERS: AN OPTIMISTIC ASSESSMENT Robert N. Lupton Michigan State University [email protected] William G. Jac...
19 downloads 2 Views 236KB Size
THE RELIABILITY OF THE ANES FEELING THERMOMETERS: AN OPTIMISTIC ASSESSMENT

Robert N. Lupton Michigan State University [email protected] William G. Jacoby Michigan State University [email protected]

Prepared for presentation at the 2016 Annual Meetings of the Southern Political Science Association. San Juan, Puerto Rico, January 7, 2016.

ABSTRACT The feeling thermometer is a rating-scale survey question intended to measure affective responses toward a stimulus object, such as a political candidate or party. Items using this format have been included on the American National Election Study’s interview schedule since 1964. Recently, however, they have been criticized for being “notoriously unreliable.” To our knowledge, however, the reliability of the feeling thermometer items has never been tested directly. We employ methods for three-wave panel data to test the reliability of the thermometer items included in the 1980 NES Major Panel Study and the ANES 2000-2002-2004 Panel Study. We show that most thermometer items are extremely reliable, both in absolute terms and relative to other commonly-used items from the ANES surveys. To the extent that there are thermometer ratings with lower reliability levels, they are due more to the stimuli than to a problematic item format. For these reasons, we believe the feeling thermometers can continue to be used as measures of affective responses to candidates, parties, groups, and other political phenomena.

Feeling thermometers are ubiquitous in the study of American political behavior. But, despite their wide usage, the feeling thermometer survey items have been criticized for possessing questionable measurement characteristics. While this criticism has taken several forms over the years, a recent and particularly salient charge is that the feeling thermometers are “notoriously unreliable” (Broockman, Kalla, Aranow 2015). Interestingly, however, the reliability of the individual feeling thermometer items apparently has never been tested directly. This seems like a major omissions for survey questions that are used so widely. The current study attempts to fill this void. Using a methodology for measuring the reliability of a single item with three-wave panel data, we find that the feeling thermometers are quite reliable. And, variations in reliability seem to be explained by the nature of the stimulus objects being evaluated, rather than the format of the thermometer items. In the interest of brevity, the analysis reported below only uses data from the American National Election Studies’ (ANES) 1980 Major Panel Study and the ANES 2000-2002-2004 Panel Study. But, it is important to emphasize that the results generalize: Virtually identical findings are produced using the other three-wave panel studies that the ANES deployed over the years. While there may be other measurement concerns about the feeling thermometers, item reliability is not a hindrance to using these survey questions in order to elicit affective responses to stimulus objects. BACKGROUND The term, “feeling thermometer” refers to a question format used to obtain rating scale responses from survey respondents toward specific stimulus objects. These items were first used in the 1964 ANES Pre-Post Survey to obtain evaluations of various sociodemographic and political groups. Starting in 1968, feeling thermometers have been used for presidential candidates, other political figures (e.g., vice-presidential candidates), and governmental institutions (e.g., the U. S. Supreme Court). While the exact wording has varied somewhat, depending upon the year and the interview format (face-to-face or over the telephone), a typical version of the question is as follows:

Please look at page 2 of the booklet. I’d like to get your feelings toward some of our political leaders and other people who are in the news these days. I’ll read the name of a person and I’d like you to rate that person using something we call the feeling thermometer. Ratings between 50 degrees and 100 degrees mean that you feel favorable and warm toward the person. Ratings between 0 degrees and 50 degrees mean that you

don’t feel favorable toward the person and that you don’t care too much for that person. You would rate the person at the 50 degree mark if you don’t feel particularly warm or cold toward the person. If we come to a person whose name you don’t recognize, you don’t need to rate that person. Just tell me and we’ll move on to the next one. The booklet page mentioned in the question contains a stylized rendering of a thermometer, with nine labelled points ranging from 100◦ (labelled “Very warm or favorable feeling”) through 50◦ (labelled “No feeling at all”) to 0◦ (labelled “Very cold or unfavorable feeling”). In principle, respondents can answer a thermometer question with any number between zero and one hundred; thus, the format could be regarded as a 101-point scale. In reality, most respondents give integer responses that end in zero (i.e., 50, 60, etc.), leading many researchers to treat the thermometers as nine- or eleven-point scales. The feeling thermometers have proven to be extremely popular, and they (or close variants) have been deployed by several different survey research organizations beyond the ANES (e.g., Alwin 1997; Delavande and Manski 2010; Adida, Davenport, McClendon 2015). Feeling thermometers have been used for many different purposes, including measuring attitudes toward single candidates and political figures (e.g., Reynolds 1974; Feldman 1988; Adida et al. 2015); modeling vote intentions for single candidates (Delavande and Manski 2010) or between a pair of candidates (Markus and Converse 1979; Page and Jones 1979; Jacoby 1999); developing multidimensional representations of ideological orientations (Conover and Feldman 1981) and party identification (Weisberg 1980), operationalizing spatial theories of voting (e.g., Rabinowitz 1978; Rabinowitz and Macdonald 1989), modeling public perceptions of presidential candidates (Weisberg and Rusk 1970; Jacoby 1986; Jacoby and Armstrong 2014), and modeling perceptions of, and affect toward, sociodemographic groups (Zinni, Mattei, Rhodebeck 1997; Kalkan, Layman, Uslaner 2009). Thus, analyses of feeling thermometer responses have played a major role in studies of the mass public’s orientations toward politics and the social world more generally. Potential Measurement Problems Given their widespread usage, it may be somewhat surprising to note that the measurement characteristics of feeling thermometer ratings remain open to some serious questions. For example, most analysts treat the thermometers as if they provide interval-level measurement— although few say so explicitly. Nevertheless, several researchers have suggested that the function relating “true”

2

feelings about stimulus objects to thermometer ratings of those objects is monotonic rather than linear in form (e.g., Rabinowitz 1976; Winter and Berinksy 1999; Jacoby 1999), thereby implying that the proper level of measurement is ordinal. Another potential problem is that different people may use the thermometer scale in different ways. For example some may use the full range of possible values from zero to 100 to express their feelings toward a battery of stimuli, while others confine their ratings to a narrower interval along the 101-point scale. Or individuals may make finer distinctions about their feelings toward objects they like than they do toward objects they dislike. In either case, this is the problem of interpersonal incomparability of scores, or differential item functioning. Several studies have used indirect evidence to suggest that this may be occurring among thermometer ratings (e.g., Brady 1985; Wilcox, Sigelman, Cook 1989; Winter and Berinsky 1999). Still another potential problem involves the effects of random measurement error on the thermometer scores. To the extent that this occurs, it lowers the reliability of the measurement. And, recent events suggest that this might be the case: Broockman, Kalla, and Aronow (2015) state that “Feeling thermometers are notoriously unreliable survey items. That is, in a technical sense, subject’s responses to feeling thermometers typically contain a fairly large amount of random measurement error” (page 14). Now, the paper in which this statement appears is unpublished and currently only available online. But it still has had a major impact on the social science research community: It was precisely this supposed unreliability that led Broockman and his colleagues to question the integrity of the data analysis in the article, “When Contact Changes Minds: An Experiment on Transmission of Support for Gay Equality,” by Michael J. LaCour and Donald P. Green (2014). Broockman et al. show that there are implausibly high test-retest correlations between feeling thermometers for “gay men and lesbians” on different waves of a panel study used in the LaCour and Green article. The average of the six pairwise coefficients (across the four waves of the panel study) is 0.952. We agree completely that such correlations are far too large to be believable. But, we are more hesitant about the motivation that led to this conclusion. Broockman et al. provide no evidence about the poor measurement properties of the feeling thermometers beyond the conventional wisdom or common knowledge that seems to be implied by the term “notoriously

3

unreliable” (emphasis added). We believe that more systematic investigation is necessary before tarring a seemingly useful battery of survey items with such a broad brush. We start with the general assumption that no measurement is perfect— that is, random errors always occur, regardless of the exact context of the measurement. Hence, all measures are less than perfectly reliable. In the context of survey research, feeling thermometers are intended to reduce measurement error by giving respondents a “longer” continuum on which to place themselves than is the case with three-, five-, or seven-point bipolar response scales (often called “Likert-type items”). If respondents really do represent their true feelings about the stimulus object with greater precision, then the feeling thermometer format should reduce random measurement error. But, many researchers have noted that, despite the potential for 101 different scores, the vast majority of respondents place themselves at one of the nine labelled points on the feeling thermometer card or at one of the eleven values that end in zero (e.g., Rabinowitz 1976; Alwin 1992; 1997; Broockman et al. 2015). The resultant “lumpy” distributions of scores may suggest that people are not capable of making the fine distinctions implied by the relatively continuous thermometer format (e.g., Winter and Berinsky 1990). If that is the case, then error is introduced, again reducing the reliability of the resultant measure. Thus, there probably is reason for concern about the reliability of the feeling thermometers. Prior Research on Measurement Error and Reliability Previous research on this topic is surprisingly limited. Green (1988) used a structural equation model with feeling thermometers as manifest indicators of latent variables to test multidimensional representations of party identification and ideology. He shows that conclusions of multiple partisan and ideological dimensions probably were incorrect and due to the combination of random and nonrandom measurement error in the feeling thermometers. Nevertheless, Green’s results also show that the feeling thermometers perform at least as well as other indicators of latent partisan and ideological affinities. Furthermore, they show robust correlations with relative candidate preferences— which are measured by taking differences between feeling thermometer ratings of the Democratic and Republican presidential candidates. So, while there certainly is measurement error present, the variables work in a manner consistent with theoretical expectations. Green concludes that “(t)he evidence . . . speaks well not only for the validity of the measures used in our model but also for the substantive importance of group evaluations as a topic of study” (1988, page 776). 4

Krosnick (1991) examined the temporal stability of symbolic versus non-symbolic attitudes. The former category included party identification and ideology. Krosnick included attitudes toward social groups in the latter category, and six of his seven measures were based on thermometer ratings. He reported that “the party (and) ideology . . . reliabilities are higher than the . . . social groups reliabilities . . . which in turn are significantly larger than the efficacy and trust reliability” (page 561). But, any comparison for present purposes is inconclusive, because Krosnick also used feeling thermometers to measure party identification and ideological orientations. Thus, the reliabilities of the thermometer-based measures range from high to moderate levels; it is difficult to draw more precise conclusions from the results presented in Krosnick’s article. Probably most relevant to the present study, Duane Alwin carried out several analyses comparing the reliabilities of feeling thermometer items to the reliabilities of survey items with fewer response categories. One study used three-wave panels from the ANES (Alwin 1992) and the other employed a multitrait-multimethod approach to examine items from the University of Michigan’s 1978 Quality of Life Survey (Alwin 1997). The results in both studies showed that feeling thermometers produce more reliable measures than do response formats with fewer categories. Nevertheless, there are limitations to the utility of these findings for present purposes. Specifically, the NES results are only presented as average reliabilities across a set of items; therefore, it is impossible to discern from Alwin’s article the reliability of specific thermometer items. And, the Quality of Life Survey had respondents use the thermometers to rate their satisfaction with various aspects of their own lives. This is a very different substantive context than the evaluations of political stimulus objects for which the feeling thermometers are used by most political scientists. A recent major study of measurement error and reliability in survey questions focuses on the methodological and substantive benefits of combining responses to individual items into multipleitem measures (Ansolabehere, Rodden, Snyder 2008). Several of the examples used in this analysis combined feeling thermometers with other items to form relatively reliable measures of issue attitudes. We certainly agree that the benefits of multiple-item measures are enormous. But, the use of such measures implies that the scores on the constituent items are not of intrinsic interest because they are merely error-laden replications of the attitude object being tapped by the multi-item measure. Thus, Ansolabehere et al. really do not address the reliability of the feeling thermometer items per se, other than asserting that each one is relatively unreliable when taken on its own. 5

To summarize, the overall number of relevant studies is very small. And several of those had different objectives than evaluating the reliability of the feeling thermometers. Furthermore, the results often are not presented in great detail, making it impossible to evaluate the reliability of thermometers toward specific objects. In several of the studies, the feeling thermometers, themselves, are regarded as limited manifest indicators of more general concepts. We believe the latter assumption may be problematic because each feeling thermometer is intended to measure a person’s attitude toward a specific stimulus object. The stimuli represent important elements of the political, such as presidential candidates, prominent public figures, political parties, governmental institutions, and sociodemographic groups. Therefore, we believe that more attention needs to be given to the reliabilities of the feeling thermometer items. METHODOLOGICAL STRATEGY AND DATA Following standard principles of measurement theory, we assume that a feeling thermometer score, shown as X, can be broken down into two components as follows:

Xi = Ti + Ei

(1)

In equation (1), Xi is respondent i’s thermometer rating of some stimulus, Ti is i’s “true” attitude toward the stimulus, and Ei is the error or discrepancy between the observed thermometer score and the actual attitude. Taking the variance of X produces:

2 2 σX = σT2 + σE + 2σT,E

(2)

The last term on the right-hand side of equation (2) is the covariance of the true attitudes and the errors. But, let us assume (following fairly standard practice) that the two are uncorrelated. In 2 ) breaks down that case, the last term in equation (2) drops out, and the total observed variance (σX

neatly into two components, the variance in the true attitudes ((σT2 ) and the variance of the errors 2 ). Dividing both sides by σ 2 shows that the total variance in the observed thermometer scores (σE X

can be divided into the prortion that is due to variance in the true attitudes, and the proportion

6

that is due to error. The former is defined as the reliability of the feeling thermometer, X, when it is used as a measure of the attitude toward the object:

RelX =

σT2 2 σX

(3)

Of course, equation (3) cannot be evaluated directly, because the true attitude, T , is unobserved. Therefore, it must be estimated indirectly. If X were a scale, defined as a linear combination of multiple indicators of T , then the reliability of X can be estimated by evaluating the internal consistency of the indicators, defined as the intercorrelations among the constituent indicators. This is precisely the basis for Cronbach’s coefficient α, by far the most widely-used estimate of scale reliability (e.g., Nunnally and Bernstein 1994). Unfortunately, we cannot use this approach because we do not have multiple indicators. Again, we are regarding each thermometer as a single measure of attitude toward the stimulus object. With single measures, the usual approach is to use test-retest correlations to obtain estimates of reliability. The logic is that, if a question is administered to respondents at two time points, and the underlying attitude does not change during the interval between the two administrations, then the only source of empirical change is error. Hence, one minus the squared correlation between the thermometer scores at the two time points should give the proportion of the observed variance that is due to error. And, since the variance breaks down into two additive components, the squared correlation itself is the proportion due to “true” variance, or the reliability. Again, the preceding logic is based on the assumption that the true attitude is perfectly stable. In fact, it might change over time. And, with only two administrations of the feeling thermometers, there is no way to differentiate substantive change in the true attitude from error fluctuations in the scores. On the other hand, if the thermometers are administered at three time points, then there are several ways to differentiate temporal stability in the true attitude from reliability in the thermometer scores. Let j be an index for the time point, so that j = 1, 2, 3. Now, if we are willing to assume that the Tj ’s and xj ’s are both standardized to unit variance at each time point, that Tj is related to Tj−1 by a single-lag autoregressive process, and that the reliability is constant (i.e., it does not vary across the time points) then Heise (1969) showed that the reliability of X can be estimated from the correlation matrix for the Xj ’s. Wiley and Wiley (1970) relax some of Heise’s

7

assumptions. They continue to assume the single-lag autoregressive process across the true scores 2 , is constant over time. But they argue that it is more realistic and also that the error variance, σE

to assume that the variances of the true attitudes do change over time. This, implies that there can be different reliabilities for Xj at each time point, rather than a single global estimate for X. Wiley and Wiley (1970) show that maximum likelihood estimates of the three reliabilities can be obtained from the covariance matrix for the Xj ’s. We will use the Wiley and Wiley approach to estimate the reliabilities of feeling thermometer items and, for comparison, several other items from some of the ANES panel studies. Over the years, the ANES has conducted five three-wave panel studies that include feeling thermometers for the same stimulus objects over all three waves.1 Here, we will analyze data from two of these panel studies. First, the 1980 ANES Major Panel Study interviewed survey respondents in January, June, and September 1980. Thus it measures individual attitudes across the course of a presidential campaign. The 1980 Major Panel Study includes eleven feeling thermometers toward individual political figures (most of whom were presidential candidates), six more for the Democratic Party, the Republican Party, Liberals, Conservatives, Independents, and Political Parties (generally). The Study also includes the standard seven-point party identification index and the seven-point ideological self-placement variable, along with four additional issue attitudes measured on seven-point bipolar scales. The second data source is the 2000-2002-2004 ANES Panel Study which includes 29 feeling thermometers that were administered on all three waves. Seven of these are for individual people, two are for Liberals and Conservatives, four tap attitudes toward governmental institutions, and 16 measure evaluations of social groups. This Panel Study also includes the party identification index on all three waves. And there are questions asking respondents whether federal spending should be increased, decreased, or kept at current levels in each of ten policy areas. Currently, we examine only the 1980 and 2000-2002-2004 panel studies in order to keep the paper relatively brief. But, future iterations will incorporate data from all five ANES panel studies that contain suitable data. For now, we want to emphasize that the results we present below

1

Feeling thermometers were included in all three waves of the 1972-1974-196 ANES Panel Study, the 1980 Major Panel component of the 1980 ANES, the 1990-1991-1992 ANES Panel Study, the 1992-1994-1996 ANES Panel Study, and the 2000-2002-2004 ANES Panel Study.

8

definitely are generalizable. Reliabilities for the feeling thermometers from the other three panel studies are very similar to those we analyze here. EMPIRICAL RESULTS The Wiley-Wiley estimates of the reliabilities for the feeling thermometer items in the 1980 Major Panel Study are given in Table 1, and they are depicted graphically in Figure 1. The reliabilities range over a fairly wide interval, from a minimum of 0.395 (for Political Parties in general in the June 1980 panel wave) to a maximum of 0.940 (for George H. W. Bush in the January 1980 wave). But, the central tendency for the 1980 feeling thermometers leans toward the upper range, with a mean reliability of 0.732. On average, nearly three-fourths of the variance in the thermometer ratings is shared with the underlying attitudes toward the respective stimulus objects. This certainly appears to be an acceptable level of reliability for survey data, where previous research showed much lower average reliabilities for a variety of different kinds of survey items (Krosnick 1991; Alwin 1992). Admittedly, the feeling thermometers are not perfect measures. Here, the mean reliability still leaves 26.8% of the observed variance in the feeling thermometers as error. Of course, we want the error variance to be as small as possible. So, is this too much? Like judgments about the size of the R2 in a regression equation, it is difficult to evaluate particular reliability values on their own. Therefore, it is useful to present some comparative evidence. Table 2 and Figure 2 present the reliabilities for the other items from the 1980 Major Panel Study. As expected, the reliabilities for the party identification index are consistently high, ranging from 0.862 to 0.870. Reliability is lower for ideological self-placement, with coefficients ranging from 0.635 to 0.657. The reliabilities for the issue questions range from 0.498 (for programs to help Russia in September 1980) to 0.724 (for defense spending in June 1980), with a mean value of 0.616. So, on survey questions about policy issues, just over 60% of the observed variance is shared with the true attitude, on average. Of course, this is quite a bit lower than the mean for the feeling thermometers. Let us next consider the results for the 2000-2002-2004 Panel Study. The reliabilities for the feeling thermometer items are presented in Table 3 and Figure 3. Here, the coefficients vary over a wider range, from a minimum of 0.338 (for evaluations of Blacks in 2004) to a maximum of 0.888 9

(for evaluations of Dick Cheney in 2004). The mean reliability for these thermometers is 0.597. In the twenty-first century data, the average feeling thermometer variable only shares about 60% of its variance with the corresponding underlying attitude. What accounts for the apparent difference between the results obtained for the two datasets? One obvious possibility is the differing sets of stimulus objects to which the thermometers were applied. In the 1980 data, 11 of the 15 stimuli were individual figures, and the remaining four were related to the political parties. In the 2000-2002-2004 data, only seven out of 29 stimuli were individual political figures; the rest were either related to the federal government or were sociodemographic groups. Even a quick glance at Figure 3 reveals that the objects in the latter two subsets tend to have lower reliabilities than do those in the first subset. Specifically, the reliabilities for individual figures range from 0.706 (for Jesse Jackson in 2002) to 0.888 (again, for Cheney in 2004), with a mean value of 0.787. The coefficients for the remaining stimulus objects range from 0.338 (again, for Blacks in 2004) to 0.745 (for fundamentalists in 2004), with a mean of 0.537. Thus, the critics may have a point when it comes to evaluations of institutions and groups. For those stimuli, the feeling thermometers share just over half of their variance (i.e., 53.7%) with the underlying attitudes. But, the much higher reliabilities for the thermometer ratings of individual figures strongly suggests that the problem lies with the nature of the stimulus objects rather than with the feeling thermometer format, itself. Once again, some comparative evidence should prove useful. Table 4 and Figure 4 provide the 2002 through 2004 reliability estimates for the party identification index and for the 10 questions about federal spending on specific policies. Party identification again shows extremely high reliability, with coefficients from 0.898 in the first panel wave to 0.912 in the third wave. In contrast, the spending questions have much lower and more variable reliabilities. The coefficients for latter items range from a very low 0.263 (for spending on programs to reduce crime in 2004) to a more acceptable 0.699 (for spending on programs to help Blacks in 2000). The mean reliability for the spending items is 0.509. We believe these results are quite telling. The reliabilities for another battery of widely-used survey questions from the ANES (i.e., the spending items) tend to be quite a bit lower than the reliabilities for the feeling thermometers— even taking into account the fact that a number of the latter are also quite low. Overall, the feeling thermometers come off very well in comparisons to other survey questions. 10

While the evidence from the two ANES datasets provides very few examples of unacceptably low reliabilities among the feeling thermometers, the fact remains that the reliability does vary markedly from one stimulus object to the next. And this begs the question of why the variability exists. As we saw earlier, with the 2000-2002-2004 data, the reliabilities for institutions and groups are almost uniformly lower than those for individual people. This suggests that the reliability for a given thermometer is affected by the nature of the stimulus object being evaluated. But rather than focusing on substantively-defined categories of the objects (i.e., groups versus individuals, etc.), we will consider three characteristics that are common to all of the objects being evaluated with the feeling thermometers. First, there is the degree of polarization with respect to the object. That is, how divided are the ANES respondents in their opinions about the “target” of the feeling thermometer. This characteristic is measured very easily by simply taking the variance of the thermometer ratings for each object. Polarization is hypothesized to have a positive effect on item reliability, since more extreme opinions typically are more strongly held and resistant to change. Second, there is the degree of opinionation with respect to each object. Stated differently, this is the extent to which people possess nonneutral feelings about each stimulus object. Opinionation also is expected to have a positive effect on reliability, since a neutral response implies that a person really has no affective reaction one way or the other with respect to the object. Opinionation is measured by taking the proportion of ANES respondents who give feeling thermometer responses that are either greater or less than 50 (i.e., the neutral value on the thermometer scale). Third, there is the degree of consensus in feelings about the stimulus object. Here, consensus is defined as a situation in which most respondents report the same general evaluation of an object. For example, there appears to be a consensus of very positive feelings toward the elderly. This stimulus almost uniformly receives very high thermometer scores, with mean values of 80.815, 76.316, and 78.514 in 2000, 2002, and 2004, respectively. Of course, there is variability around each of these central tendencies; the corresponding standard deviations are 17.500, 17.697, and 17.282, respectively. So even the lower evaluations of the elderly will still tend to fall above the neutral point of 50. We hypothesize that consensus will be negatively related to reliability. Our reasoning is that, if everyone feels the same way about a given stimulus object, then variability in ratings will tend to stem from random fluctuations rather than meaningful differences in the 11

thermometer scores. For each stimulus, consensus is measured by summing the thermometer scores greater than fifty, summing the scores less than fifty, and then taking the absolute value of the difference between the two sums.2 To the extent that thermometer scores are either consistently positive or consistently negative, the absolute difference between the two sums should be greater than zero. Polarization, opinionation, and consensus are used as independent variables in a regression equation to predict the reliability of the thermometers. The OLS estimates of the model parameters are obtained separately for the two ANES datasets. Note that the three “observations” for each stimulus object— corresponding to the three waves of each panel study— are not independent of each other. Therefore, robust standard errors are reported, clustered on the 15 stimuli in the 1980 data, and on the 29 stimuli in the 2000-2002-2004 data. Preliminary results for the 1980 data revealed that the three observations for George H. W. Bush are outliers that are highly influential on the estimates. Therefore, a dummy variable is included in the 1980 model coded one for these three observations and zero otherwise. Finally, inspection of diagnostic plots revealed a strongly nonlinear relationship between polarization and reliability. This pattern appeared in both datasets. Accordingly, the regressor for that variable in each model is actually -1 times the inverse of the polarization value for each stimulus object. This nonlinearity makes it a bit more difficult to describe the relationship between polarization and reliability. But, it definitely provides a more accurate depiction of the structure in the data than does a linear functional form. The OLS estimates for the regression models are shown in Table 5. Note that the model fit is very good in both datasets, with R2 = 0.739 in the 1980 data, and R2 = 0.834 in the 2000-2002-2004 data. The standard deviation of the residuals is 0.060 for the 1980 model and 0.059 for the 20002002-2004 estimates, showing that the average accuracy with which reliability can be predicted from the regressors is basically identical for both equations. The signs on the coefficients for the three main independent variables are all consistent with prior expectations. And, again, the results are very similar for the two equations. Polarization and opinion both show significant positive effects on reliability while consensus has a significant negative impact. The outlying observations for Bush show up clearly with the highly significant positive coefficient for that variable. 2

This variable’s values are divided by 10,000 in order to make the coefficients in the regression model more reasonable.

12

Given the transformations carried out on two of the independent variables, the coefficients are somewhat difficult to interpret on their own. Therefore, effects plots for the two models are shown in Figures 5 and 6. Each panel in the two figures shows the predicted reliability across the range of one independent variable, with the other independent variables held at their mean values. The first panel in each figure shows how polarization corresponds to increasing reliability, with the rate of increase getting smaller across the range of the polarization variable. Although discernible in both panels, the nonlinearity is especially pronounced in the 2000-2002-2004 data. Similarly, the overall effect of polarization is greater in the latter dataset. In 1980, reliability increases approximately 0.15 units, from about 0.65 to approximately 0.8 across the range of polarization values. In the later dataset, the total increase is about 0.4 units, from about 0.36 to about 0.76. The linear effects of the other two independent variables are clearly evident in the remaining two panels within each figure. Note also that the effects for each of these variables are quite similar across the datasets. The results from these regressions show that the reliabilities of the feeling thermometer variables definitely are affected by the nature of the stimulus object being evaluated. Divisions of opinion— captured here by the polarization variable and inversely by the consensus measure— lead to greater systematic structure in the thermometer scores and higher reliability. Similarly, the attitudinal salience of the stimulus object, as manifested in the opinionation variable, promotes reliability. Reasonably enough, sizable segments of the mass public must have non-neutral feelings about an object in order to provide meaningful thermometer evaluations about the object. In summary, the thermometer reliabilities vary in fairly obvious ways. There is nothing in the regression results to suggest that they are particularly suspect representations of affective reactions toward the stimulus objects. CONCLUSION The results from this study should provide strong reassurance to the many social scientists who use feeling thermometers as empirical variables in their research. The charge that the feeling thermometers are extremely unreliable is simply wrong. The reliabilities for most of the feeling thermometer variables examined in this study are very high, both in absolute terms and in comparison to the reliabilities for other well-known and commonly-used variables from the ANES. And in the relatively few cases where reliabilities are low, the problem clearly lies with the nature of 13

the stimulus object rather than the characteristics of the measurement instrument. When used to evaluate a salient stimulus for which there are sizable differences of opinion within the public, the feeling thermometers provide very reliable gauges of sentiment. But, when they are applied to ambiguous objects toward which fewer crystallized opinions actually exist, it should be no surprise that the resultant thermometer scores contain a higher amount of “noise” variation. This is precisely how any measurement procedure or tool should work. While the results presented here are clearcut, they definitely are not the final word on the measurement characteristics of the feeling thermometers. Other issues, such as the measurement level of the scores and the potential for differential item functioning, still require attention. But there would be little reason to worry about these latter measurement properties if the thermometer scores proved to be extremely unreliable— essentially randomly fluctuating numbers rather than meaningful reflections of individual attitudes. The evidence presented here indicates that definitely is not the case. In fact, the findings about reliability should provide even stronger motivation for pursuing the other measurement characteristics. The large amount of prior research that has been carried out using the feeling thermometers demonstrates clearly that the latter are related to a variety of other interesting phenomena. This makes it all the more important to understand more precisely how the specific thermometer scores correspond to individual attitudes (i.e., the question of measurement level) and whether one person’s scores can be meaningfully compared to the scores provided by another person (i.e., the issue of differential item functioning). Scientific progress in any field depends heavily on the quality with which its central concepts are measured. And, the current study provides important evidence on exactly that point. The general conclusion is that, if the feeling thermometers are “notoriously unreliable” then that notoriety is badly misplaced. The actual data show that the ANES feeling thermometers definitely comprise a reliable measurement instrument.

14

REFERENCES

Adida, Claire L.; Lauren D. Davenport; Gwyneth McClendon. (2015) “Ethnic Cueing Across Minorities: A Survey Experiment on Candidate Evaluation in the U.S.” Public Opinion Quarterly. Forthcoming. Alwin, Duane F. (1992) “Information Transmission in the Survey Interview: Number of Response Categories and the Reliability of Attitude Measurement.” Sociological Methodology 22: 83118. Alwin, Duane F. (1997) “Feeling Thermometers Versus 7-Point Scales: Which Are Better?” Sociological Methods and Research 25: 318-340. Ansolabehere, Stephen; Jonathan Rodden; James M. Snyder, Jr. (2008) “The Strength of Issues: Using Multiple Measures to Gauge Preference Stability, Ideological Constraint, and Issue Voting.” American Political Science Review 102: 215-232. Brady, Henry. (1985) “The Perils of Survey Research: Inter-Personally Incomparible Responses.” Political Methodology 11: 269-290. Broockman, David; Joshua Kalla; Peter Aronow. (2015) “Irregularities in LaCour (2014).” Unpublished manuscript. Available online at https://people.stanford.edu/dbroock/sites/ default/files/broockman_kalla_aronow_lg_irregularities.pdf Conover, Pamela Johnston, and Stanley Feldman. (1981) “The Origins and Meaning of Liberal/Conservative Self-Identifications.” American Journal of Political Science 25: 617-45. Delavande, Adeline and Charles F. Manski. (2010) “Probabilistic Polling and Voting in the 2008 Presidential Election: Evidence from the American Life Panel.” Public Opinion Quarterly74: 433-459. Feldman, Stanley. (1988) “Structure and Consistency in Public Opinion: The Role of Core Beliefs and Values.” American Journal of Political Science 32: 416-40. Green, Donald Philip. (1988) “On the Dimensionality of Public Sentiment toward Partisan and Ideological Groups.” American Journal of Political Science32: 758-780. Heise, David R. (1969) “Separating Reliability and Stability in Test-Retest Correlation.” American Sociological Review 34: 93-101. Jacoby, William G. (1986) “Levels of Conceptualization and Reliance on the Liberal- Conservative Continuum.” Journal of Politics 48: 423-432. Jacoby, William G. (1999) “Levels of Measurement and Political Research: An Optimistic View.” American Journal of Political Science 43: 271-301. Jacoby, William G. and David A. Armstrong II. (2014) “Bootstrap Confidence Regions for Multidimensional Scaling Solutions” American Journal of Political Science 58: 264-278. Kalkan, Kerem Ozan; Geoffrey C. Layman; Eric M. Uslaner. (2009) “‘Bands of Others’ ? Attitudes Toward Muslims in Contemporary American Society.” Journal of Politics71: 847-862.

Krosnick, Jon A. (1991) “The Stability of Political Preferences: Comparisons of Symbolic and Nonsymbolic Attitudes.” American Journal of Political Science35: 547-576. LaCour, Michael J. and Donald P. Green. (2014) “When Contact Changes Minds: An Experiment On Transmission Of Support For Gay Equality.” Science 346: 1366-1369. Markus, Gregory B. and Philip E. Converse. (1979) “A Dynamic Simultaneous Equation Model of Electoral Choice.” American Political Science Review 73: 1055-1070. Nunnally, Jum C. and Ira H. Bernstein. (1994) Psychometric Theory (Third Edition). New York, NY: McGraw-Hill. Page, Benjamin I. and Calvin Jones. (1979) “Reciprocal Effects of Policy Preferences, Party Loyalties, and the Vote.” American Political Science Review 73: 1071-1089. Rabinowitz, George B. (1976) “A Procedure for Ordering Object Pairs Consistent with the Multidimensional Unfolding Model.” Psychometrika 41: 349-373. Rabinowitz, George B. (1978) “On the Nature of Political Issues: Insights from a Spatial Analysis.” American Journal of Political Science22: 793-817. Rabinowitz, George and Stuart E. Macdonald. (1989) “A Directional Theory of Issue Voting.” American Political Science Review 83: 93-121. Reynolds, H. T. (1974) “Rationality and Attitudes Toward Political Parties and Candidates.” Journal of Politics 36: 983-1005. Weisberg, Herbert F. (1980) “A Multidimensional Conceptualization of Party Identification.” Political Behavior 2: 33-60. Weisberg, Herbert F. and Jerrold G. Rusk. (1970) “Dimensions of Candidate Evaluation.” American Political Science Review 64: 1167-1185. Wilcox, Clyde; Lee Sigelman; Elizabeth Cook. (1989) “Some Like It Hot: Individual Differences in Responses to Group Feeling Thermometers.” Public Opinion Quarterly53: 246-257. Wiley, David E. and James A. Wiley. (1970) “The Estimation of Measurement Error in Panel Data.” American Sociological Review 35: 112-117. Winter, Nicholas and Adam Berinsky. (1999) “What’s Your Temperature? Thermometer Ratings and Political Analysis.” Paper presented at the 1999 Annual Meetings of the American Political Science Association. Zinni, Frank P., Jr.; Franco Mattei; Laurie A. Rhodebeck. (1997) “The Structure of Attitudes toward Groups: A Comparison of Experts and Novices.” Political Research Quarterly 50: 595-626.

16

Table 1: Wiley-Wiley reliability estimates for feeling thermometers from the 1980 ANES Major Panel Study.

Panel wave January

June

September

Jimmy Carter

0.802

0.834

0.838

Edward Kennedy

0.856

0.829

0.822

George McGovern

0.781

0.702

0.702

Walter Mondale

0.709

0.632

0.690

Gerald Ford

0.717

0.697

0.674

George Bush

0.940

0.927

0.940

Jerry Brown

0.800

0.745

0.694

Howard Baker

0.693

0.643

0.645

John Connally

0.823

0.783

0.780

Ronald Reagan

0.806

0.790

0.812

George Wallace

0.779

0.740

0.736

Democratic party

0.769

0.776

0.729

Republican party

0.718

0.719

0.675

Independents

0.660

0.546

0.520

Parties in general

0.550

0.395

0.506

Table 2: Wiley-Wiley reliability estimates for party identification index, ideological self-placement variable, seven-point issue scales from the 1980 ANES Major Panel Study.

Panel wave January

June

September

Party identification

0.862

0.869

0.870

Ideology self-placement

0.635

0.654

0.657

Defense spending

0.701

0.724

0.689

Govt. spending vs. services

0.634

0.637

0.602

Aid to Russia

0.594

0.527

0.502

Inflation vs. Unemployment

0.619

0.612

0.498

Table 3: Wiley-Wiley reliability estimates for feeling thermometers from the 2000-2002-2004 ANES Panel Survey.

Panel wave 2000

2002

2004

Bill Clinton

0.820

0.812

0.808

Hillary Clinton

0.849

0.832

0.827

Al Gore

0.789

0.726

0.756

Ralph Nader

0.770

0.736

0.736

Jesse Jackson

0.773

0.707

0.708

George W. Bush

0.739

0.767

0.836

Dick Cheney

0.798

0.850

0.889

Liberals

0.635

0.571

0.617

Conservatives

0.646

0.651

0.694

Federal Government

0.573

0.557

0.657

Supreme Court

0.466

0.493

0.489

Congress

0.511

0.463

0.563

Military

0.608

0.529

0.550

Whites

0.405

0.411

0.397

Blacks

0.431

0.376

0.338

Note: Table is continued on the next page.

Table 3: Wiley-Wiley reliability estimates for feeling thermometers from the 2000-2002-2004 ANES Panel Survey (Continued ).

Panel wave 2000

2002

2004

Hispanics

0.418

0.376

0.368

Asian Americans

0.426

0.419

0.413

Catholics

0.463

0.491

0.545

Jews

0.536

0.511

0.512

Protestants

0.492

0.502

0.508

Labor Unions

0.692

0.627

0.678

Big Business

0.530

0.489

0.567

Poor People

0.486

0.473

0.497

People on Welfare

0.568

0.523

0.512

Fundamentalists

0.715

0.700

0.745

The Elderly

0.410

0.470

0.432

Environmentalists

0.629

0.632

0.630

Gays and Lesbians

0.742

0.705

0.740

Feminists

0.561

0.510

0.571

Table 4: Wiley-Wiley reliability estimates for party identification index and questions about increasing or decreasing federal spending from the 2000-2002-2004 ANES Panel Survey.

Panel wave 2000

2002

2004

Party identification

0.898

0.900

0.912

Fed spending, welfare

0.460

0.476

0.397

Fed spending, fighting AIDS

0.593

0.529

0.476

Fed spending, foreign aid

0.487

0.522

0.514

Fed spending, helping the poor

0.526

0.405

0.263

Fed spending, Social Security

0.647

0.597

0.587

Fed spending, protecting environment

0.583

0.558

0.446

Fed spending, public schools

0.542

0.469

0.500

Fed spending, reducing crime

0.342

0.400

0.305

Fed spending, helping children

0.624

0.546

0.503

Fed spending, assistance to Blacks

0.699

0.689

0.597

Table 5: Predicting the reliability of the feeling thermometer variables from the characteristics of the objects being evaluated with each thermometer. Table entries are OLS regression coefficients with standard errors in parentheses. Standard errors are clustered by stimulus objects. All coefficients are statistically different from zero at the 0.05 level, with one-sided tests.

Data source 1980 ANES Major Panel Study

2000-2002-2004 ANES Panel Study

-(1/Polarization)

79.578 (28.326)

164.297 (14.224)

Opinionation

0.581 (0.204)

0.530 (0.117)

Consensus

-0.032 (0.016)

-0.016 (0.004)

George H. W. Bush

0.246 (0.024)



Intercept

0.511 (0.178)

0.638 (0.099)

R2

0.739

0.834

Se

0.060

0.059

N of observations

45

87

N of clusters

15

29

Figure 1: Wiley-Wiley reliability estimates for feeling thermometers from the 1980 ANES Major Panel Study.

Reliability, Jan. 1980



Reliability, June 1980 Reliability, Sept. 1980

George Bush



Edward Kennedy



Jimmy Carter



John Connally



Ronald Reagan



Jerry Brown



George McGovern



George Wallace



Democratic Party



Republican Party



Gerald Ford



Walter Mondale



Howard Baker



Independents



Parties in General



0.0

0.2

0.4

0.6

0.8

Wiley−Wiley Reliability Estimates

1.0

Figure 2: Wiley-Wiley reliability estimates for party identification index, ideological self-placement variable, seven-point issue scales from the 1980 ANES Major Panel Study.

Reliability, Jan 1980



Reliability, June 1980 Reliability, Sept 1980

Party ID



Def spending



lib−con ID



Gov spending



Inflation



Russia



0.0

0.2

0.4

0.6

0.8

Wiley−Wiley Reliability Estimates

1.0

Figure 3: Wiley-Wiley reliability estimates for feeling thermometers from the 2000-2002-2004 ANES Panel Survey (Continued ).

Reliability, 2000



Reliability, 2002 Reliability, 2004

Dick Cheney Hillary Clinton George W. Bush Bill Clinton Al Gore Jesse Jackson Ralph Nader Fundamentalists Gays and Lesbians Conservatives Labor Unions Federal Government Liberals Environmentalists Military Feminists People on Welfare Big Business Congress Catholics Jews Protestants Poor People Supreme Court The Elderly Blacks Asian Americans Hispanics Whites

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.0

0.2

0.4

0.6

0.8

Wiley−Wiley Reliability Estimates

1.0

Figure 4: Wiley-Wiley reliability estimates for party identification index and questions about increasing or decreasing federal spending from the 2000-2002-2004 ANES Panel Survey.

Reliability, 2000



Reliability, 2002 Reliability, 2004

Party ID



Fed spending, Blacks



Fed spending, Soc Secur



Fed spending, children



Fed spending, AIDS



Fed spending, environ



Fed spending, schools



Fed spending, help poor



Fed spending, for aid



Fed spending, welfare



Fed spending, crime



0.0

0.2

0.4

0.6

0.8

Wiley−Wiley Reliability Estimates

1.0

Figure 5: Effect plots for the independent variables predicting feeling thermometer reliabilities in the 1980 ANES Major Panel Study.



● ●

0.9

Wiley−Wiley reliability for feeling thermometer

● ● ● ●

0.8





● ● ●

0.7





● ●





● ●





● ● ●



● ●

● ●



● ●





● ●

●● ●

0.6 ●● ● ●

0.5

0.4



400

600

800

Polarization



● ●

0.9

Wiley−Wiley reliability for feeling thermometer

● ●

0.8



● ●

● ●



0.7

● ●





● ● ● ●























● ●

● ● ●



● ●

● ●

0.6 ●

● ● ●

0.5

0.4



0.6

0.7

0.8

0.9

Opinionation

● ●



0.9

Wiley−Wiley reliability for feeling thermometer

● ● ●

● ●



0.8



● ● ●●



● ● ●

● ● ●

0.7







● ●

●●



●●





● ●



● ●● ●

0.6 ●

● ● ●

0.5

0.4



0

1

2

3

Consensus

4

5

Figure 6: Effect plots for the independent variables predicting feeling thermometer reliabilities in the 2000-2002-2004 ANES Panel Study.

0.9



Wiley−Wiley reliability for feeling thermometer



● ●

●●

0.8



● ●●

● ●



● ● ● ●

●●







0.7

● ●



● ● ●



● ●

● ●

● ● ● ●● ● ●

0.6

● ● ●●

● ●

● ●



● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●

0.5 ● ●

● ●● ●



0.4

● ●





● ● ●

400

600

800

1000

Polarization

0.9



Wiley−Wiley reliability for feeling thermometer



● ●

● ●

0.8



● ● ●

● ●



● ●





● ● ●

● ●

● ●

0.7





● ●



● ● ●





● ●









0.6 ● ●

● ●

● ●





● ●●



● ●



● ● ●

0.5







● ●



●●





● ●









● ● ●



0.4 ●





● ●





● ●

0.6

0.7

0.8

0.9

Opinionation

0.9



Wiley−Wiley reliability for feeling thermometer





●●

0.8



















● ●



●● ●







● ●

0.7









● ●



● ●













0.6 ●●●

● ●

● ●





● ● ●

0.5



● ● ●



●● ● ● ● ● ● ● ●

● ●



● ●● ●



0.4













0

2





● ●







4

6

Consensus

8

10

Suggest Documents