Dummy Coding and Structural Sets

Dummy Coding and Structural Sets Copyright © 2013, J. Toby Mordkoff One important aspect of all forms of regression is that the values of the predic...
Author: Patrick Dennis
1 downloads 1 Views 244KB Size
Dummy Coding and Structural Sets

Copyright © 2013, J. Toby Mordkoff

One important aspect of all forms of regression is that the values of the predictors must be quantitative (i.e., numbers that can be taken seriously as numbers). This isn’t a problem when you have only two levels to a categorical variable, since they can be coded as 1 and 0, with 1 meaning that the subject has some attribute and 0 meaning that he or she doesn’t. (This is why such variables should be named for the attribute, instead of the variable; e.g., when 1 = female and 0 = male, the name for the variable should be “female[ness],” not “sex” or “gender.”) But when categorical or nominal variables have more than two levels, they begin to be more difficult. You can’t, for example, just code them as 0, 1, and 2 -- at least not in any case where the variable is truly categorical -- since that would imply a continuum with the levels coded as 0 and 2 being more different from each other than they are from the level coded as 1; even worse, it would also imply that the level coded as 2 has twice as much of something as the level coded by 1, and that the level coded as 0 has none of this attribute. So we need to do something else. When-ever you have a nominal predictor with more than two values, you have to create a set of new variables to code for its specific value. Again, you cannot just use, e.g., the numbers 1 thru 3 to code for dark-colored hair, light-colored hair, and bald, because SPSS will take the numbers seriously as numbers and assume that blondes have twice as much of something as brunettes and that bald people have three times as much. You also can’t use strings (e.g., “dark” / “light” / “bald”) because SPSS doesn’t accept strings as the predictors (or as the predicted) in MRC. To understand what you have to do, first note that a nominal variable that takes on three values is said to have two “aspects.” (Aspects are like degrees of freedom; there is one less aspect than the number of different values or settings.) It is easiest to think about this when there is a control condition, so let’s think of bald as the control. The first aspect might be the difference between dark-haired people and bald people; the second aspect would be the difference between light-haired people and bald people. The difference between dark- and light-haired people is implied by the first two aspects, which is why we don’t need a third aspect. For example, if dark-haired folks score 5 points higher than those who are bald, and light-haired folks score 3 points lower than balds, clearly dark-haired folks will score 8 points higher than light-hairs. So, before you conduct an analysis involving a nominal variable with more than two levels, you must create a coding scheme and implement it. (This will be part of the preparation step; i.e., part of Step 0.) There are a variety of types of coding schemes available; I’ll start with the most popular and then add the other two later. In fact, because the type of scheme you choose has no effect at all on any of the statistics related to the variable as a whole, if all you care about is the proportion of variance that can be explained by, e.g., hair-color, then you do not need to know the other two options. Before getting into any details, however, let me make some suggestions about the spread-sheet. I’m assuming that you’ll be starting with a spread-sheet that includes the categorical variable as a string. Thus, for our running example of three values of hair-color, each row has either “dark” or “light” or “bald” in it. My first suggestion is to not delete the column, even after you have recoded it for the analysis. Not only might someone else want to use a different coding scheme, but you’ll want the original values there as both a reminder and a way to double-check that the coding

scheme was implemented correctly. Second, the new variables that you create should be immediately to the right of the original column. And the names of the new variables should have a clear and consistent relationship with the original. If the original is “Hair_Clr” (to obey the eight-character limit of some versions of SPSS), then the new columns should be something like “HC1” and “HC2” or “dc_HC1” and “dc_HC2” (to include the fact that dummy coding was used). Dummy Coding The simplest and most-popular way to code a nominal variable is called “dummy coding.” Using 1s and 0s to code a two-level nominal is an example of this. The key concept behind dummy coding is that one value is (treated as) the “none” or “control” or “baseline” condition. Each other condition (is treated as if it) differs from the “baseline” condition in a unique and specific way. The upshot of this is that one value of the nominal is coded as 0s on every aspect. (This causes SPSS to use the mean of this condition as the constant or “intercept” in the GLM equation, instead of using the overall mean.) Each of the other conditions “takes over” one aspect by being coded as a 1 on that aspect (while all other values of the nominal get 0s) and 0s on all other aspects. Sticking with light-haired, dark-haired, and bald subjects, if we decide to treat bald as the baseline condition, and decide to use the first aspect (dc_HC1) to code for light hair (instead of bald) and the second aspect (dc_HC2) to code for dark hair (instead of bald), the coding scheme would be what is shown to the right.

dc_HC1

dc_HC2

light

1

0

dark

0

1

bald

0

0

The way to “read” this table is as follows: all subjects who have “light” in the original (string) column should get a 1 in the new dc_HC1 column (which should be marked as a scale in the variable-view window in SPSS) and a 0 in the new dc_HC2 column; all subjects who have “dark” in the original (string) column should get a 0 in the new dc_HC1 column and a 1 in the new dc_HC2 column; bald subjects get 0s in both new columns. Why does this matter? As mentioned above, if all you care about is the proportion of variance that can be explained by the variable as a whole, then it actually does not matter how you re-code the variable. But if you approach this a bit more like one-way ANOVA where, whenever the main effect is significant you want to know which conditions differ from which other conditions, then, if hair-color as a variable turns out to be a good predictor, you might want to know which particular hair colors have what kinds of effect. This is when coding-scheme matters. When you use dummy coding, each aspect is used to compare one of the non-baseline values to the baseline value. In the example above, the first aspect (i.e., dc_HC1) codes for the difference between light-haired people and bald people and the second aspect codes for the difference

between dark-haired people and bald people. Those are the only two comparisons that you get. This is not like ANOVA in that you can run all possible planned comparisons and correct the p-values later; you only get the comparisons that you built into the model when you designed and implemented your coding scheme. (Well, more accurately, there are ways to do all possible comparisons and then apply a correction, but that’s rarely done and almost never done correctly; if you want to learn about it, ask me outside of class.) In any event, because the coding scheme determines what follow-up-like questions are answered by the analysis, we will now discuss the other two methods of coding a nominal variable. Why? Because you will often have nominal variables that do not have a “none” or “control” or “baseline” value. Effects Coding This one is also pretty straight-forward. Recall that dummy coding uses the aspects to compare each non-baseline value to the baseline value, one at a time. The second type of coding uses the aspects to compare all but one of the values to the overall mean, one at a time. It’s very similar to dummy coding in that each of the special values is compared to something, one at a time. But where dummy coding compares each non-baseline value to the baseline, effects coding compares each value to the overall mean. This coding scheme is particularly useful when you have a “none-of-the-above” option to a measure. Have the non-of-the-above value be the one value that doesn’t get compared to the overall mean. Continuing with hair-color, what if instead of bald, we had other as the third option (to include not only bald folks but also the occasional person with red or blue or whatever hair). We will still give the first aspect to light-haired, so it is coded as 1 on ec_HC1 and 0 on ec_HC2, and we’ll still give the second aspect to dark-haired, so it gets 0 and 1, but the catch-all/junk-pile/other value gets a -1 on both aspects.

ec_HC1

ec_HC2

light

1

0

dark

0

1

other

‒1

‒1

☞ This is a good time to mention something. There must be at least one subject with each value in order for the value to get a row and/or column in the coding scheme. A lot of people prefer effects coding to dummy coding (because they prefer to compare individual values to the overall mean) and, therefore, these researchers always include a “none-of-the-above” option in all of their questionnaire items. That’s fine as long as at least one subject (and preferable five or more) actually chooses this option. If no subject reports a particular value, then that value must be omitted from the entire coding scheme because it won’t exist in the recoded columns. Contrast Coding This is the most-powerful and most-complicated option. Because of this, I will include more than one example. But we’ll start with hair-color with three values: light, dark, and bald.

In general, contrast coding allows you to set up a bunch specific comparisons that can each involve more than two values. Each comparison will be coded by an aspect … and we always have one fewer aspects than values … so you get to set up one fewer comparisons than you have values. That’s the easy part, as well as what makes this method attractive: you get to set up your aspects to answer specific questions. The hard part is that the comparisons that you set up should be orthogonal, which means that they should be independent questions about the values. You can’t, for example, set up one aspect to compare value A to value B and then set up another aspect to compare values A and C (grouped together) to value B, because this aspect would overlap (a lot) with the first. The way that you verify that your aspects are orthogonal is by making sure that two rules are obeyed. These rules concern the values in the coding-scheme table (i.e., the 1s, 0s, -1s, etc), not the actual data. First, the sum of the coding-scheme values for each column must always be zero. If you go back and look again at both dummy coding and effects coding you’ll see that dummy coding does not obey this rule because each column in the coding-scheme table sums to one, not zero. Conversely, effects coding does obey this rule because each column has one 1 and one -1, plus one or more 0s, which will always sum to zero. The second rule for contrast coding is that all possible cross-products of columns also sum to zero. To check this, you create new (temporary) columns that are pairs of coding-scheme columns. When there are only two aspects, as there are for the example tables above, this is easy: for each data-value (i.e., for each row), multiply the two coding-scheme values. Thus, for the dummy coding example, this cross-products column would be 0, 0, and 0, because (from top [light-haired] to bottom [bald]) 1×0=0, 0×1=0, and 0×0=0. The sum of three 0s is zero, so dummy coding obeys this rule. Conversely, effects coding does not, because 0, 0, and 1 do not sum to zero. With regard to the two rules of contrast coding, dummy coding obeys one (but not the other), while effects coding obeys the other (but not the one). This has a few implications that will be briefly covered in lecture. If, at this point, you’re saying to yourself: “self, I hereby declare that I will not be using contrast coding for any of my work because this is too complicated to be worth it” please relax. Most people -- at least at first -- just memorize a set of known-to-work schemes that will cover 99% of all actual situations. Plus, once you get the hang of it, creating a new coding scheme that obeys both rules isn’t very difficult. Three-value Contrast Coding Scheme Although I will not attempt to prove it formally, please trust me when I say that there is only one possible scheme for contrast coding three values. The scheme boils down to this: one aspect compares two of the values to the third, the other aspect compares the two values that were together (in the first case) to each other. So, all you have to do is ask yourself which of the three values is the “odd” one. That one is compared the other two (combined) on the first aspect; the two non-odd values are compared to each other on the second aspect.

Let’s treat bald as the “odd” value. The first aspect will compare the two values with hair to bald. So, both light- and dark-haired get 1s and bald gets -2. You can switch all the signs, but must be sure that it sums to zero to obey Rule 1. The second aspect compares the people with hair, so light-haired gets 1 and dark-haired gets -1, while bald stays out of it with a 0.

cc_HC1

cc_HC2

light

1

1

dark

1

‒1

bald

‒2

0

Note that the cross-product rule is also obeyed because (from top to bottom) 1, -1, and 0 sum to zero. While the choice of which value will be treated as “odd” is completely up to you, all three-value contrast coding schemes will be some version of this one scheme. There is no other three-value scheme that will obey both rules. Four-value Contrast Coding Schemes When you have four values, there are two possible general coding schemes that work. The first option is a slight extension of the three-value scheme. You just add another aspect that compares the fourth group to all three of the groups that will be involved in the three-group scheme. Let me illustrate by simply adding red-haired to the same three options we’ve been using. As before, bald will be the “odd” value for the first aspect, so it gets -3 and the three hair-values get 1s. The second aspect starts a replication of the three-value scheme with red-haired as the new odd value, so red-heads get -2 and other two hair-colors get 1s. Finally, we compare light to dark on the final aspect, with light getting 1 and dark getting -1, while both bald and red-heads stays out of it with 0s.

cc_HC1

cc_HC2

cc_HC3

light

1

1

1

dark

1

1

‒1

red

1

‒2

0

bald

‒3

0

0

I leave it to you to verify that the three cross-product columns that you can create from the three columns in the table will all sum to zero. The second known-to-work coding scheme for four values approaches the four values as if they were -- in effect -- two pairs. The first aspect will compare one pair to the other pair. The second aspect will compare the two members of the first pair (while ignoring both members of the second pair). The third aspect will compare the two members of the second pair (while ignoring both members of the first pair). To illustrate this one, I’m going to pretend that no-one is bald, but that a few people with blue hair were sampled, instead. I will then come at this question as being a two-pairs case by saying (even if this might offend some people) that light- and dark-haired go together as being more frequent or typical, while red- and blue-haired go together as being infrequent or not typical.

The first aspect is used to compare the two pairs, so light- and dark-haired both get 1s and red- and blue-haired both get -1s. The second aspect is used to “look inside” the first pair, with light getting 1 and dark getting -1, while the members of the other pair get 0s. The last aspect is used to “look inside” the second pair, with red getting 1 and blue getting -1, while the members of the first pair are excluded with 0s.

cc_HC1

cc_HC2

cc_HC3

light

1

1

0

dark

1

‒1

0

red

‒1

0

1

blue

‒1

0

‒1

There are no other ways to code four values than these two. Again, who gets paired with whom is up to you, but it will always boil down to 1 vs 3 (first option) or 2 vs 2 (second option) in every case that obeys the two rules of contrast coding.