Chapter 10: Analysis of Categorical Data

STAT503 Lecture Notes: Chapter 10 1 Chapter 10: Analysis of Categorical Data November 9, 2009 Our observations fall into categories instead of bei...
Author: Ralf Webster
6 downloads 1 Views 138KB Size
STAT503

Lecture Notes: Chapter 10

1

Chapter 10: Analysis of Categorical Data November 9, 2009

Our observations fall into categories instead of being continuous variables. We count the number of observations falling into each category. As usual we assume that the sample points are independent. If there are only two categories, •

the number of observations in one category has a binomial distribution.

If there is more than two categories, •

we can focus on one category and group others together (still binomial)



or, we can define probabilities for all categories (p1 , p2 , . . .).

We will use a new distribution called a χ2 -distribution (chi-squared). •

It is another cousin to the Normal(0, 1) distribution.



Definition: If Z1 , Z2 , . . . , Zk are independent Normal(0, 1) random variables then Pk 2 has a χ2 distribution (a chi-squared distribution with k degrees of freeZ i=1 i k dom).

There are several different tests which use the χ2 distribution to determine critical values. They differ in their setup just like there are many different kinds of t-tests. [Draw χ2 curve.]

10.1 The χ2 Goodness of Fit Test In this section we consider testing if the observed frequencies for a categorical variable are compatible with a null hypothesis that specifies the probabilities of the categories. Thus, we study if the data seem to fit the hypothetical distribution. For example the question “Is this a fair coin?” may be answered by this method. •

Description of such test for categorical data, based on a random sample of size n. I

Need hypothesized values for the population proportions pi for each category. These are specified in or implied by the given problem.

STAT503

2

Lecture Notes: Chapter 10

I

We calculate the expected number of observations in each category under H0 . We will use the formula npi (number of observations times the population proportion).

I

The test is only approximate and works when the sample size is large. The expected number in each category should be at least 5.

The Test Statistic The test statistic is computed as follows: I

Xs2

=

k X (Observed − Expected)2 i=1

Expected

=

k X (O − E)2 i=1

E

.

I

Under the null hypothesis Xs2 has approximately χ2k−1 distribution.

I

Table 9 gives critical values for the χ2k distribution.

I

Rejection rule: Large values of Xs2 lead to the rejection of H0 .

Example In the sweet pea, the allele for purple flower color (P) is dominant to the allele for red flowers (p), and the allele for long pollen grains (L) is dominant to the allele for round pollen grains (l). The first group (of grandparents) will be homozygous for the dominant alleles (PPLL) and the second group (of grandparents) will be homozygous for the recessive alleles (ppll). We will obtain the F1 generation (of parents) obtained by crossing these. They are all PpLl and will have purple flowers and long pollen grains. The F1’s are crossed to give an F2 generation, which is of interest to us. It is thought that the genes controlling these two traits are 25.5 cM apart. If that were true then the F2 offspring proportions should be the following: •

66% purple/long : PPLL or PpLL or PPLl or PpLl, [(q 2 − 2q + 3)/4]



9% purple/round : PPll or Ppll, [(2q − q 2 )/4]



9% red/long = ppLL or ppLl, [(2q − q 2 )/4]



16% red/round = ppll, [(1 − q)2 /4]

Here q = 0.2 (probability of recombination between the genes). Question: Are these genes 25.5 cM apart? 381 F2 offspring are collected, and we observe •

284 purple/long



21 purple/round

STAT503

Lecture Notes: Chapter 10



21 red/long



55 red/round

3

Note that all of these are not smaller than 5. Let p1 , p2 , p3 , p4 be the probabilities of purple/long, purple/round, red/long, red/round offspring, respectively, resulting from this F2. H0 : p1 = 0.66, p2 = 0.09, p3 = 0.09, p4 = 0.16; the category probabilities are those predicted by a 25.5cM genetic distance. HA : the category probabilities are different from those predicted by a 25.5cM genetic distance [Of course: hypotheses must be about probabilities or population proportions, and NOT observations in sample] Use a χ2 goodness-of-fit test with df = #categories - 1. Here df= 4 - 1 = 3. P Xs2 = (O − E)2 /E has approximately the χ23 distribution under H0 . Test at level α = 0.05; critical value for χ23 is 7.81. Will reject H0 if Xs2 > 7.81. Category numbers expected under H0 (npi ): •

E1 = 381(0.66) = 251.5,



E2 = 381(0.09) = 34.3,



E3 = 381(0.09) = 34.3,



E4 = 381(0.16) = 61.0.

These are all ≥ 5. Xs2 =

(284 − 251.5)2 (21 − 34.3)2 + 251.5 34.3 +

(21 − 34.3)2 (55 − 61.0)2 + = 15.10425. 34.3 61.0

15.10 > 7.81 so reject H0 . This study provides evidence at the .05 significance level that the offspring proportions are different from those predicted by a 25.5cM genetic distance. [This could mean map distance is different, or some difference in viability, etc.]

Two Categories Many times we only have two categories: •

e.g. male/female, left/right, yes/no, success/failure, improve/get worse, etc.

STAT503

Lecture Notes: Chapter 10

4

When there are only two categories the test may be directional or non-directional. Example: We wish to study the genetic model of a certain trait. There are two homozygous lines of Drosophila, one with red eyes, and one with purple eyes. It has been suggested that there is a single gene responsible for this phenotype, with the red eye trait dominant over the purple eye trait. If that is true we expect these two lines to produce F2 progeny in the ratio 3 red : 1 purple. We want to test the hypothesis that red is (autosomal) dominant. To do this we perform the cross of red-eyed and purple-eyed flies with several parents from the two lines and obtain 43 flies in the F2 generation, with 29 red-eyed flies and 14 purple-eyed flies. Categories: •

Red eyes; hypothesized proportion p = 3/4 = 0.75



expected number: E1 = (43)(0.75) = 32.25. I

Purple eyes; hypothesized proportion 1 − p = 0.25

I

expected number: E2 = (43)(0.25) = 10.75.

Is the red-eye trait dominant over purple? Let p be the probability that an F2 fly has red eyes H0 : p = 0.75; the F2 progeny are in a 3:1 ratio of red to purple-eyed flies HA : p 6= 0.75; the F2 progeny are not in a 3:1 ratio P Use a chi-square goodness of fit test. χ2s = (O − E)2 /E has a χ21 distribution. Test at level α = 0.05; Critical value (Table 9) = 3.84. Calculate the chi-square statistic summing over categories: X Xs2 = (O − E)2 /E = 0.3275 + 0.9826 = 1.310 Since 1.310 < 3.84, we do not reject H0 . This study does not provide evidence at the 0.05 significance level that the F2 progeny depart from the 3:1 ratio. Of course, we have NOT PROVED that the offspring do follow the 3:1 ratio! However, we would reject the H0 of a 1:3 ratio (i.e. purple is dominant); the test statistic would be Xs2 = 41.3 for the same data. We could have had HA be directional, for example, p < 0.75. In that case we would reject if BOTH of the following conditions hold: 1. the test statistic Xs2 > χ21 (2α), i.e. 2.71

STAT503

Lecture Notes: Chapter 10

5

2. pˆ < 0.75 (i.e. the estimate deviates from the hypothesized value in the same direction as HA ) If HA : p > 0.75 then we would reject if the following conditions hold: 1. the test statistic Xs2 > χ21 (2α), i.e. 2.71 2. pˆ > 0.75 (i.e. the estimate deviates from the hypothesized value in the same direction as HA ) Details for a chi-square goodness-of-fit test: 1. Define pi ’s for each category (must add to 1) and state hypotheses. 2. The problem statement will give the pi ’s; may be explicit or implied (e.g. 9:3:3:1 ratio). 3. If only two categories, state HA with symbols as well as words; can also be directional. 4. If more than two categories, state HA in words only; cannot be directional. 5. Calculate Ei = npi for each category. Verify that all the Ei are at least 5. If not, stop; cannot use this test. P 6. Calculate Xs2 = (O − E)2 /E summing over all categories. 7. Compare with χ2df critical value for df = #categories-1; reject H0 if test statistic is > c.v.

10.2 -10.3 χ2 Test for 2 × 2 Contingency Tables Contingency Tables 2 × 2 (two by two) means 2 rows and 2 columns in a table. Categorical data with 4 categories which are related in pairs. Two main contexts (sometimes blurred): 1. One sample; observe two different binomial variables on each. 2. Two independent random samples; one binomial variable observed in each. Examples of context 1: •

observe eye color (red/purple) and wing shape (normal/vestigial)



observe whether people smoke (ye/no) and exercise (yes/no)

Examples of context 2: •

samples are “drug” and “placebo” (any two treatments); observed variable “improve” or “don’t improve”



samples are “male” and “female” (any two groups we set up to compare); observed variable “eye color” red or purple



Context 1: 4 Categories; observations in 2x2 table:

STAT503

6

Lecture Notes: Chapter 10

wing shape

normal vestigial

eye color red purple 39 11 18 32

Test for independence of row and column variables. (This would be a test for lack genetical linkage.) Example of context 2 (informal): Observed Improve Don’t Improve Total

Treatment Drug Placebo 15 4 11 17 26 21

Total 19 28 47

p1 = probability that a patient will improve if they take the drug [ i.e. Pr(Improve | Drug) ] p2 = probability that a patient will improve if they take the placebo [ i.e. Pr(Improve | Placebo) ] H0 : p1 = p2 HA : p1 6= p2 ( or p1 > p2 ) Unlike Goodness-of-Fit-Test, this one doesn’t have specific values hypothesized for p’s. Instead, H0 is that two probabilities are the same (like p1 = p2 ), which can be phrased in terms of independence. HA is that the probabilities are different, which can be phrased in terms of a lack of independence, which is called “association”. Vocabulary: not independent = associated. pˆ1 = Rel.F requency(Improve | Drug) = # who improve with drug/# who take drug = 15/26 = 0.58 pˆ2 = Rel.F requency(Improve | Placebo) = # who improve with placebo/# who take placebo = 4/21 = 0.19 •

These pi ’s differ a lot, don’t they? This will be a basis for a statistical reasoning.

What values would we expect under H0 ? Total who improve is 19 so under H0 we estimate the proportion who improve overall by 19/47 = 40.4%. 26 patients took the drug; under H0 we would expect 40.4% of them to improve = (26)0.404 = 10.51 = (26x19)/47.

STAT503

7

Lecture Notes: Chapter 10

Similarly, we expect number of those who improve under placebo to be (21x19)/47 = 8.5. For the next row, we expect that (26x28)/47 = 15.5 don’t improve with drug and (21x28)/47 = 12.5 don’t improve with placebo. We can place these expected counts in a separate table but is usually more common to combine them with the original table: Observed (Expected) Improve Don’t Improve Total

Treatment Drug Placebo 15 (10.5) 4 (8.5) 11 (15.5) 17 (12.5) 26 21

Total 19 28 47

Note that the row and column totals are the same as for observed. In general, E =(row total)(column total)/(grand total) for each of the four categories. Before we proceed, note that the E for each category must be at least 5 for this method to work. Example (formal): Are patients who take the drug more likely to improve than those who take the placebo? p1 = probability that a patient will improve if they take the real drug p2 = probability that a patient will improve if they take the placebo H0 : p1 = p2 ; the probability of improving is the same whether drug or placebo is taken OR, outcome and treatment are independent HA : p1 > p2 ; the probability of improving is greater if the drug is taken than if the placebo is taken [This example is directional; if non-directional, HA : improvement is independent of the treatment] Use a χ2 test of independence [or χ2 contingency table test] P Xs2 = (O − E)2 /E has a χ21 distribution under H0 (df = 1 here, explanation later on). Test at level α = 0.01; reject H0 if Xs2 > χ21 (2α) = 5.41 [use 0.02 column since directional] and pˆ1 > pˆ2 . [Since HA is directional, must do the following pˆ step: but it’s a good idea whether directional or not.] pˆ1 =

# who improve with drug = 0.58. # who take drug

pˆ2 =

# who improve with placebo = 0.19. # who take placebo

STAT503

Lecture Notes: Chapter 10

8

Check: 0.58 > 0.19, so pˆ1 > pˆ2 is in the same direction as HA ,

Xs2 =

(15 − 10.5)2 (4 − 8.5)2 + 10.5 8.5 +

(11 − 15.5)2 (17 − 12.5)2 + = 7.23 15.5 12.5

Observe 7.23 > 5.41 so reject H0 . This study provides evidence at the 0.01 significance level that the probability of improving is greater if the drug is taken than if the placebo is taken. Degrees of Freedom •

df = 1 for the 2x2 contingency table.



In general it is df = (#rows-1)(#columns-1)



Xs2 has a χ2df distribution under H0 .

Critical values •

If HA is non-directional look in the α column for the critical value.



If HA is directional (2 × 2 only) look in the 2α column.



For α = 0.05 the χ21 critical value is 3.84, but if the test is directional we use .10 column: the critical value is 2.71. Only the first row of Table 9 will be involved in directional hypotheses.

What does rejecting H0 mean? •

Sometimes have to be careful about conclusions in this test.



If you reject H0 with a chi-squared test then that indicates the two variables are associated, i.e. not independent; that does not always imply a causal relationship.



This study provides evidence that patients who take the drug are more likely to improve than patients who take the placebo.



Here we controlled the drug vs. placebo and observed the improvement, so it was reasonable to infer causation. But if we had done the test on red/purple eyes and normal/vestigial wings, we could not say that eye color causes wing shape or vice versa. We could only say that these two phenotypes are associated.

2 × 2 Contingency table χ2 -test of independence Back to the fly example (back-cross RrWw to rrww). Fill in E’s in the table: (57)50/100 = 28.5 ; (43)50/100 = 21.5

STAT503

9

Lecture Notes: Chapter 10

Observed (Expected) normal wing shape vestigial wing shape Total

eye color red purple 39 (28.5) 11 (21.5) 18 (28.5) 32 (21.5) 57 43

Total 50 50 100

Are eye color and wing shape independent in this back-cross population? Let p1 be the probability that a fly has red eyes if it has normal wings Let p2 be the probability that a fly has red eyes if it has vestigial wings H0 : p1 = p2 ; eye color and wing shape are independent HA : p1 6= p2 ; eye color and wing shape are associated pˆ1 = 39/50 = 0.78. pˆ2 = 18/50 = 0.36. Use a chi-squared test of independence (or chi-squared contingency table test) P Xs2 = (O − E)2 /E has a χ21 dist’n under H0 . Test at level α = 0.05; reject if Xs2 > 3.84 = χ21 (α)   1 1 1 1 2 2 + + + = 18.0 Xs = (10.5) 28.5 28.5 21.5 21.5 Reject H0 . This study provides evidence at the 0.05 significance level that eye color and wing shape are associated. [NOTE: the numerators are all the same here. This is true for 2 × 2 tables; not for general r × k] Interpretation You cannot say that normal wings cause red eyes or vice versa. An appropriate conclusion for this cross would be that wing shape and eye color are associated, or that flies with normal wings are more likely to have red eyes than flies with vestigial wings are. Since you just observed these variables and did not control them, you cannot infer any causal relationship. [In this case there is association because the two genes are linked and there is linkage disequilibrium.] Comments In this example we defined •

p1 = Pr(red eyes | normal wings); and,



p2 = Pr(red eyes | vestigial wings).

We could have defined

STAT503

10

Lecture Notes: Chapter 10



p1 = Pr(normal wings | red eyes); and,



p2 = Pr(vestigial wings | red eyes).

Or even •

p1 = Pr(purple eyes | normal wings); and,



p2 = Pr(purple eyes | vestigial wings).

While our estimates for p1 and p2 would be different in each case, the test statistic is the same. In the drug example, it did not make much sense to consider Pr(Drug | Improve) and Pr(Placebo | Improve).

10.4 Fisher’s Exact Test •

Omit

10.5 r × k Contingency Tables The 2 × 2 Contingency Table Test extends easily to r × k tables (r rows, k columns). Example: 4 × 3 (#rows=4, #columns=3)

Hair Color Brown Black Fair Red Total

Brown 438 (331.7) 288 (154.1) 115 (356.5) 16 (14.6) 857

Eye Color Grey/Green 1387 (1212.3) 746 (563.3) 946 (1303.0) 53 (53.4) 3132

Blue 807 189 1768 47 2811

(1088.0) (505.6) (1169.5) (48.0)

Total 2632 1223 2829 116 6800

Are hair color and eye color associated? H0 : Hair color and eye color are independent. HA : Hair color and eye color are associated. [Must be non-directional] Do a χ2 test for independence. P Xs2 = (O − E)2 /E has a χ26 distribution under H0 . Calculate: df = (r − 1)(k − 1) = (2)(3) = 6. Test at α = 0.0001 level. Critical value χ26 (0.0001) = 27.86. Xs2 = 1073.5 > 27.86. Reject H0 .

STAT503

11

Lecture Notes: Chapter 10

This study provides evidence at the 0.0001 significance level that eye color and hair color are associated (not independent). The test of independence is testing (for all categories) whether the true probabilities of blue eyes are all the same as each other, and whether the probabilities of brown eyes are all the same as each other no matter what the hair color, etc.... For large tables we usually just state H0 and HA in words only.

10.6 Applicability of Methods •

Read

10.7 Confidence Interval for Difference between Probabilities The 2 × 2contingency table answers the question whether the estimated probabilities pˆ1 and pˆ2 differ enough to conclude the true probabilities p1 and p2 are not equal. By constructing a confidence interval for p1 − p2 we can get an idea of the magnitude of the difference. Define (for i = 1, 2) p˜i =

yi + 1 ni + 2

2 Note we are not using the Zα/2 quantities for this procedure — we just add one count to each category/cell of the table.

The standard error for p˜1 − p˜2 is defined by q SEp˜1 −˜p2 = SE21 + SE22 s p˜1 (1 − p˜1 ) p˜2 (1 − p˜2 ) = + n1 + 2 n2 + 2 A (1 − α)100% Confidence interval for p1 − p2 is given by p˜1 − p˜2 ± Zα/2 SEp˜1 −˜p2 .

Recall our drug example: Observed Improve Don’t Improve Total

Treatment Drug Placebo 15 4 11 17 26 21

Total 19 28 47

STAT503

12

Lecture Notes: Chapter 10

p˜1 =

15 + 1 = 0.571, 26 + 2

SEp˜1 −˜p2 =

4+1 = 0.217 21 + 2

q

SE21 + SE22

s

p˜1 (1 − p˜1 ) p˜2 (1 − p˜2 ) + n1 + 2 n2 + 2

= =

p˜2 =



0.008749 + 0.007387

= 0.1271 95% C.I. for p1 − p2 is 0.571 − 0.217 ± 1.96(0.1271) 0.354 ± 0.249 Interpretation??

Omit Sections 10.8 and 10.9 Summary on Page 454.

= [0.105, 0.603]