Business Statistics COMPARISONS

Business Statistics TWO πœ‡S OR MEDIANS: COMPARISONS CONTENTS Comparing two samples Comparing two unrelated samples Comparing the means of two unrela...
Author: Victor Harrell
1 downloads 0 Views 580KB Size
Business Statistics

TWO πœ‡S OR MEDIANS: COMPARISONS

CONTENTS Comparing two samples Comparing two unrelated samples Comparing the means of two unrelated samples Comparing the medians of two unrelated samples Old exam question

COMPARING TWO SAMPLES It often happens that we want to compare two situations  do I sell more when there is music in my shop?  is the expensive machine more precise than the cheap one?  are adverisements on TV or internet equally profitable?  do people buy more on Tuesdays than on Wednesday?  in couples, who drinks more: the man or the woman?  etc.

COMPARING TWO SAMPLES In all these questions we compare two populations  Situation 1: two populations (or sub-populations) with similar variable  sales in 105 days without music  sales in 96 days with music

 Data matrix: two options

SPSS requires this data presentation

COMPARING TWO SAMPLES  Situation 2: one sample with paired observations  drinks of the man in 78 couples  drinks of the woman in the same 78 couples

 Data matrix: one option only

 Will be discussed in a later lecture

COMPARING TWO UNRELATED SAMPLES Situation 1  independent samples/unrelated samples  introduce symbols for the two random variables  e.g., using 𝑋1 en 𝑋2

 𝑋1 with sample 𝑋1,1 , 𝑋1,2 , … , 𝑋1,𝑛1 and 𝑋2 with sample 𝑋2,1 , 𝑋2,2 , … , 𝑋2,𝑛2

 or using 𝑋 and π‘Œ

 𝑋: 𝑋1 , 𝑋2 , … , 𝑋𝑛𝑋 and π‘Œ: π‘Œ1 , π‘Œ2 , … , π‘Œπ‘›π‘Œ

 sample sizes can be different

Or of course using β€œmeaningful” indices: 𝑋𝐡 and 𝑋𝐺 for Belgium and Germany. Not 𝐡 and 𝐺, because we need to stress that it is β€œabout” a variable 𝑋 (like sales)

COMPARING TWO UNRELATED SAMPLES We want to test hypothesis such as  are the means equal?

 𝐻0 : πœ‡π‘‹ = πœ‡π‘Œ or 𝐻0 : πœ‡1 = πœ‡2 or 𝐻0 : πœ‡π‘‹1 = πœ‡π‘‹2 or ...

 are the variances equal?  𝐻0 : πœŽπ‘‹2 = πœŽπ‘Œ2 or etc.

 are the proportions equal  𝐻0 : πœ‹π‘‹ = πœ‹π‘Œ or etc.

Also:  inequalities, like 𝐻0 : πœ‡π‘‹ β‰₯ πœ‡π‘Œ  and non-zero differences, like 𝐻0 : πœ‡π‘‹ = πœ‡π‘Œ + 85

COMPARING TWO UNRELATED SAMPLES Context:  sample 𝑋1 : sales in 𝑛1 = 105 days without music  sample 𝑋2 : sales in 𝑛2 = 96 days with music General idea: 𝑋1 ~π‘‘π‘‘π‘‘π‘‘π‘‘π‘‘π‘‘π‘‘π‘‘π‘‘π‘œπ‘œ πœƒ1  οΏ½ πœƒ1 = πœƒ2 ? 𝑋2 ~π‘‘π‘‘π‘‘π‘‘π‘‘π‘‘π‘‘π‘‘π‘‘π‘‘π‘œπ‘œ πœƒ2

COMPARING THE MEANS OF TWO UNRELATED SAMPLES Assumption (for now!):  𝑋~𝑁 πœ‡π‘‹ ; πœŽπ‘‹2  π‘Œ~𝑁 πœ‡π‘Œ ; πœŽπ‘Œ2  in words: both samples come from normally distributed populations with known variances Question  are πœ‡π‘‹ and πœ‡π‘Œ different?  can we test this, on the basis of the (limited) evidence concerning π‘₯Μ… and 𝑦�?  so, can we reject 𝐻0 : πœ‡π‘‹ = πœ‡π‘Œ ? To decide  use 𝑋� βˆ’ π‘ŒοΏ½ ~𝑁 πœ‡π‘‹οΏ½βˆ’π‘ŒοΏ½ , πœŽπ‘‹2οΏ½ βˆ’π‘ŒοΏ½

COMPARING THE MEANS OF TWO UNRELATED SAMPLES For one sample, we had 𝑋� βˆ’ πœ‡π‘‹οΏ½ ~𝑁 0,1 πœŽπ‘‹οΏ½ As it turns out, for two samples, we have 𝑋� βˆ’ π‘ŒοΏ½ βˆ’ πœ‡π‘‹οΏ½ βˆ’ πœ‡π‘ŒοΏ½ ~𝑁 0,1 πœŽπ‘‹οΏ½βˆ’π‘ŒοΏ½    

πœ‡π‘‹οΏ½ βˆ’ πœ‡π‘ŒοΏ½ = πœ‡π‘‹ βˆ’ πœ‡π‘Œ follows from the null hypothesis for instance 𝐻0 : πœ‡π‘‹ = πœ‡π‘Œ or 𝐻0 : πœ‡π‘‹ βˆ’ πœ‡π‘Œ = 85 π‘₯Μ… and 𝑦� are obtained from the data but what is πœŽπ‘‹οΏ½βˆ’π‘ŒοΏ½ ?

COMPARING THE MEANS OF TWO UNRELATED SAMPLES For one sample, we had

2 𝜎 𝑋 2 πœŽπ‘‹οΏ½ = 𝑛 As it turns out, for two independent samples, we have πœŽπ‘‹2οΏ½ βˆ’π‘ŒοΏ½ = πœŽπ‘‹2οΏ½ + πœŽπ‘ŒοΏ½2 , so

πœŽπ‘‹οΏ½βˆ’π‘ŒοΏ½ =

πœŽπ‘‹2 πœŽπ‘Œ2 + 𝑛𝑋 π‘›π‘Œ

 recall that variances add up when 𝑋 and π‘Œ are independent 2 2  e.g., πœŽπ‘‹+π‘Œ = πœŽπ‘‹2 + πœŽπ‘Œ2 but also πœŽπ‘‹βˆ’π‘Œ = πœŽπ‘‹2 + πœŽπ‘Œ2

COMPARING THE MEANS OF TWO UNRELATED SAMPLES Example Context:  do I sell more when there is music in my shop? Experiment  on some days the music is turned on, on other days the music is turned off  you keep track of the sales during each day Data:  sample of sales on days with music (π‘₯1 , π‘₯2 , … , π‘₯105 )  sample of sales on days without music (𝑦1 , 𝑦2 , … , 𝑦96 ) Five step procedure

COMPARING THE MEANS OF TWO UNRELATED SAMPLES  Step 1:

 𝐻0 : πœ‡π‘‹ = πœ‡π‘Œ ; 𝐻1 : πœ‡π‘‹ β‰  πœ‡π‘Œ ; 𝛼 = 0.05

 Step 2:

 sample statistic: 𝑋� βˆ’ π‘ŒοΏ½  reject for β€œtoo large” and β€œtoo small” values

 Step 3:

 null distribution  valid because ...

 Step 4:

 𝑧𝑐𝑐𝑐𝑐 =  𝑧𝑐𝑐𝑐𝑐 =

 Step 5:

π‘‹οΏ½βˆ’π‘ŒοΏ½ βˆ’ πœ‡π‘‹ βˆ’πœ‡π‘Œ πœŽπ‘‹ οΏ½ βˆ’π‘Œ οΏ½

 reject or not reject because ...

=

π‘‹οΏ½βˆ’π‘ŒοΏ½ ~𝑁 πœŽπ‘‹ οΏ½ βˆ’π‘Œ οΏ½

0,1

in a minute we will supply full details and a worked example ...

COMPARING THE MEANS OF TWO UNRELATED SAMPLES  But, wait ...

 ... isn’t it weird to assume that πœŽπ‘‹2 and πœŽπ‘Œ2 are known, while πœ‡π‘‹ and πœ‡π‘Œ are not known?

 In reality the population variances will often be unknown as well!

remember we had the same problem in the one-sample case? there we decided to estimate the value of 𝜎 2 with the value of 𝑠 2 and paid a price of using the wider 𝑑-distribution here we will do the same: estimate the two 𝜎 2 -values with two 𝑠 2 -values  and pay the same price: use 𝑑-dsitribution instead of 𝑧-distribution    

COMPARING THE MEANS OF TWO UNRELATED SAMPLES For one sample, we had

𝑋� βˆ’ πœ‡π‘‹οΏ½ ~𝑑df 𝑆𝑋� As it turns out, for two samples, we have 𝑋� βˆ’ π‘ŒοΏ½ βˆ’ πœ‡π‘‹οΏ½ βˆ’ πœ‡π‘ŒοΏ½ ~𝑑df π‘†π‘‹οΏ½βˆ’π‘ŒοΏ½    

πœ‡π‘‹οΏ½ βˆ’ πœ‡π‘ŒοΏ½ = πœ‡π‘‹ βˆ’ πœ‡π‘Œ follows from the null hypothesis π‘₯Μ… and 𝑦� are obtained from the data but what is π‘ π‘‹οΏ½βˆ’π‘ŒοΏ½ ? and how to choose df?

COMPARING THE MEANS OF TWO UNRELATED SAMPLES Two options for π‘ π‘‹οΏ½βˆ’π‘ŒοΏ½ :  1: estimating πœŽπ‘‹2 and πœŽπ‘Œ2 from 𝑠𝑋2 and π‘ π‘Œ2 respectively  2: assuming πœŽπ‘‹2 = πœŽπ‘Œ2 = 𝜎 2 and estimating 𝜎 2 as the weighted average of both sample variances Both options lead to a different value of df

COMPARING THE MEANS OF TWO UNRELATED SAMPLES Option 1:  estimating πœŽπ‘‹2 and πœŽπ‘Œ2 from 𝑠𝑋2 and π‘ π‘Œ2 respectively π‘ π‘‹οΏ½βˆ’π‘ŒοΏ½ =

 testing with 𝑑-distribution with df = quick rule, but bad approximation: 𝑑𝑑 β‰ˆ min 𝑛𝑋 βˆ’ 1, π‘›π‘Œ βˆ’ 1

𝑠𝑋2 π‘ π‘Œ2 + 𝑛𝑋 π‘›π‘Œ

2 2 2 𝑠𝑋 π‘ π‘Œ + 𝑛𝑋 π‘›π‘Œ 2 2 2 2 𝑠𝑋 π‘ π‘Œ 𝑛𝑋 π‘›π‘Œ

𝑛𝑋 βˆ’ 1

+

π‘›π‘Œ βˆ’ 1

Compare to πœŽπ‘‹οΏ½βˆ’π‘ŒοΏ½ =

πœŽπ‘‹2 πœŽπ‘Œ2 + 𝑛𝑋 π‘›π‘Œ

COMPARING THE MEANS OF TWO UNRELATED SAMPLES Option 2:  estimating the common 𝜎 2 from both samples

 a β€œweighted mean” of 𝑠𝑋2 and π‘ π‘Œ2 , the pooled variance 𝑠P2

 and

2 2 𝑛 βˆ’ 1 𝑠 + 𝑛 βˆ’ 1 𝑠 𝑋 π‘Œ 𝑋 π‘Œ 𝑠P2 = 𝑛𝑋 βˆ’ 1 + π‘›π‘Œ βˆ’ 1

π‘ π‘‹οΏ½βˆ’π‘ŒοΏ½ =

𝑠P2

𝑛𝑋

+

𝑠P2

π‘›π‘Œ

Compare to π‘ π‘‹οΏ½βˆ’π‘ŒοΏ½ =

 testing with 𝑑-distribution with df = 𝑛𝑋 βˆ’ 1 + π‘›π‘Œ βˆ’ 1 = 𝑛𝑋 + π‘›π‘Œ βˆ’ 2

𝑠𝑋2 π‘ π‘Œ2 + 𝑛𝑋 π‘›π‘Œ

COMPARING THE MEANS OF TWO UNRELATED SAMPLES

COMPARING THE MEANS OF TWO UNRELATED SAMPLES Use of SPSS

a data set on Computer Anxiety Rating split by gender

COMPARING THE MEANS OF TWO UNRELATED SAMPLES Results split by gender

Results of 𝑑-test

COMPARING THE MEANS OF TWO UNRELATED SAMPLES Zoom in

𝑑-test with pooled estimate of πœŽπ‘‹2 = πœŽπ‘Œ2

𝑑-test with separate estimates of πœŽπ‘‹2 and πœŽπ‘Œ2

value of the 𝑑-statistic (𝑑calc )

degrees of freedom

𝑝-value (2-sided)

COMPARING THE MEANS OF TWO UNRELATED SAMPLES And one more thing ...

tests of the assumption of equal variance 𝐻0 : πœŽπ‘‹2 = πœŽπ‘Œ2 versus 𝐻1 : πœŽπ‘‹2 β‰  πœŽπ‘Œ2

𝑝-value for this test

COMPARING THE MEANS OF TWO UNRELATED SAMPLES For these two tests, we need both 𝑋� and π‘ŒοΏ½ to be normally distributed  This means either of the following three:  𝑋 and π‘Œ have normally distributed populations  𝑋 has a symmetric distribution and 𝑛𝑋 β‰₯ 15, and the same holds for π‘Œ  𝑛𝑋 β‰₯ 30 and π‘›π‘Œ β‰₯ 30

 Very similar to the one-sample case!

COMPARING THE MEDIANS OF TWO UNRELATED SAMPLES  Recall the non-parametric one-sample test for the median

 the Wilcoxon signed ranks test  replacing the values by ranks and testing the sum of the positive ranks

 Can we also develop a non-parametric (rank-order) order test for two unrelated samples?  Yes we can: Wilcoxon-Mann-Whitney test

 named after Frank Wilcoxon, Henry Mann, and Donald Whitney  also named Wilcoxon (Mann-Whitney) test, Mann-Whitney test, etc.

COMPARING THE MEDIANS OF TWO UNRELATED SAMPLES  Computational steps of the Wilcoxon-Mann-Whitney test

 combine both samples (𝑋 and π‘Œ)  assign ranks to the combined sample  ties get an average rank  sum the ranks of both samples separately (𝑇𝑋 and π‘‡π‘Œ )  compare the test statistic 𝑇𝑋 (or π‘‡π‘Œ ) to a critical value from the table

COMPARING THE MEDIANS OF TWO UNRELATED SAMPLES Example (same as before)  Sample data are collected on the capacity rates (in %) for two factories  factory A, the rates are 71, 82, 77, 94, 88  factory B, the rates are 85, 82, 92, 97

 Are the median operating rates for two factories the same (at a significance level 𝛼 = 0.05)?

COMPARING THE MEDIANS OF TWO UNRELATED SAMPLES Example  data A: π‘₯𝑖 (𝑛𝑋 = 5)  data B: 𝑦𝑖 (π‘›π‘Œ = 4)  one case of ties (82)  π‘‡π‘Œ = 24.5

a tie: observations 3 and 4 are 82, so assign rank 3.5 to facilitate the discussion, we focus on the sample with the smallest sample size

Capacity Factory A

Rank

Factory B

Factory A

71

1

77

2

82

3.5

Factory B

82

3.5

85

5

88

6 92

94

7 8

97 Rank sums:

9 20.5

24.5

COMPARING THE MEDIANS OF TWO UNRELATED SAMPLES Testing the Wilcoxon-Mann-Whitney 𝑇 statistic  using a table of critical values  included in tables at exam

 using a normal approximation

 valid for large samples when Wilcoxon-Mann-Whitney table of critical values is not sufficient

COMPARING THE MEDIANS OF TWO UNRELATED SAMPLES Table of critical values of Wilcoxon statistic

 for 𝑛π‘₯ = 𝑛1 = 4 and 𝑛𝑦 = 𝑛2 = 5 at 𝛼 = 0.05:  𝑇lower = 11, 𝑇upper = 29  𝑅crit = 0,11 βˆͺ [29,45]

COMPARING THE MEDIANS OF TWO UNRELATED SAMPLES Conclusion from small sample Wilcoxon-Mann-Whitney test  π‘‡π‘Œ = 24.5 is between 𝑇lower = 11 and 𝑇upper = 29  Therefore, do not reject the null hypothesis (𝐻0 : 𝑀𝑋 = π‘€π‘Œ ) at the 5% level  There is not enough evidence to conclude that the medians are different

COMPARING THE MEDIANS OF TWO UNRELATED SAMPLES Large sample approximation  Under 𝐻0 , it can be shown that  

π‘›π‘Œ 𝑛𝑋 +π‘›π‘Œ +1 𝐸 π‘‡π‘Œ = 2 𝑛 𝑛 𝑛 +𝑛 +1 var π‘‡π‘Œ = 𝑋 π‘Œ 𝑋 π‘Œ 12

 Further, when 𝑛𝑋 β‰₯ 10 or π‘›π‘Œ β‰₯ 10, we use a normal approximation:  π‘‡π‘Œ ~𝑁  𝑍=

𝑛𝑋 𝑛𝑋 +π‘›π‘Œ +1 2

𝑛 𝑛 +𝑛 +1 π‘‡π‘Œ βˆ’ π‘Œ 𝑋 π‘Œ 2

𝑛𝑋 π‘›π‘Œ 𝑛𝑋 +π‘›π‘Œ +1 12

𝑛𝑋 π‘›π‘Œ 𝑛𝑋 +π‘›π‘Œ +1 , 12

~𝑁 0,1

COMPARING THE MEDIANS OF TWO UNRELATED SAMPLES Large sample approximation (continued)  so you can compute 𝑧calc =

π‘›π‘Œ 𝑛𝑋 +π‘›π‘Œ +1 π‘‡π‘Œ,calc βˆ’ 2 𝑛𝑋 π‘›π‘Œ 𝑛𝑋 +π‘›π‘Œ +1 12

 and compare it to 𝑧crit (e.g., Β±1.96)

COMPARING THE MEDIANS OF TWO UNRELATED SAMPLES Use of SPSS

𝑇 = 345

𝑧-score with normal approximation 𝑝-value (2-sided)

OLD EXAM QUESTION 21 May 2015, Q2a