Statistics 100 Summarizing bivariate data

Statistics 100 – Summarizing bivariate data Bivariate data: • Each observation is measured on two variables. • Typically we want to summarize the rela...
Author: Carmel Quinn
0 downloads 0 Views 365KB Size
Statistics 100 – Summarizing bivariate data Bivariate data: • Each observation is measured on two variables. • Typically we want to summarize the relationship between the two variables. Examples: Obtain information on 50 subjects in a study. • Measure weight and height for each subject. • Record race and LDL cholesterol level. • Record political party affiliation and marital status. Three types of data combinations: • Two quantitative variables • One quantitative and one categorical variable • Two categorical variables Examples: • Weight and height • Cholesterol level and race • Party affiliation and marital status Motivating example: Consumer Reports 1993 Car data (each observation is a car) This data set contains 93 observations on 27 variables, of which we’ll examine: • Engine size (liters) • Horsepower • Weight (pounds) • City MPG (miles per gallon), • Highway MPG (miles per gallon), • Type of vehicle (compact, large, midsize, small, sporty, van) A graphical method for summarizing bivariate data: Scatter plot What to look for in a scatter plot: 1

• Type of association (linear, curved, etc.) • Strength of association (amount of scatter) • Direction of association (positive, negative) – A positive relationship is a scatter that goes up and to the right. – A negative relationship is a scatter that goes down and to the right. • Special features (outliers, other unusual aspects)

35 20

25

30

Highway MPG

40

45

50

Consumer reports 1993 car data

15

20

25

30

35

40

City MPG

Example scatter plots: City MPG versus Highway MPG • strong positive linear relationship • no outliers – the points to the right are consistent with the linear pattern

2

45

200 50

100

150

Horsepower

250

300

Consumer reports 1993 car data

1

2

3

4

Engine size (liters)

Example scatter plots: (continued) Engine size versus Horsepower • positive linear relationship • more variability at higher engine sizes than at low engine sizes • an outlier with low engine size and high horsepower (Mazda RX-7)

3

5

30 15

20

25

City MPG

35

40

45

Consumer reports 1993 car data

2000

2500

3000

3500

Weight (lbs)

Example scatter plots: (continued) Vehicle Weight and City MPG • moderately strong negative relationship • no unusual observations • relationship is curved Can add a scatter plot “smoother” to see the curved relationship better.

4

4000

30 15

20

25

City MPG

35

40

45

Consumer reports 1993 car data

2000

2500

3000

3500

4000

Weight (lbs)

Graphically representing the relationship between a quantitative variable with a categorical variable: Side-by-side boxplots Idea: • Divide sample into groups according to the categorical variable. • Construct boxplots for the quantitative variable within each group. Example: Weight by car type

5

3000 2000

2500

Weight (lbs)

3500

4000

Consumer reports 1993 car data

Compact

Large

Midsize

Small

Sporty

Van

Car type

Comments on boxplot: • Vans and large cars tend to be heavier, on average. • Small cars are the lightest type of cars. • Sporty cars have the largest variability in weights, both in the IQR and range. • There are a couple of small cars that are outliers; these happen to be a Ford Festiva (1845 lbs) and a Geo Metro (1695 lbs). Numerical summaries of bivariate data: Measuring the linear association between two quantitative variables: correlation Suppose we collect a sample of n observations on two variables. The values of the first variable are x1 , x2 , . . . , xn and the values of the second variable are y1 , y2 , . . . , yn .

6

Let x ¯ = the sample mean of x1 , . . . , xn , y¯ = the sample mean of y1 , . . . , yn , sx = the standard deviation of x1 , . . . , xn , and sy = the standard deviation of y1 , . . . , yn . Then the correlation coefficient is defined as Pn

r = =

−x ¯)(yi − y¯) P 2 ( i=1 (xi − x ¯) )( ni=1 (yi − y¯)2 ) Pn 1 (xi − x ¯)(yi − y¯) · i=1 n−1 sx sy p Pn

i=1 (xi

Properties of correlation: • −1 ≤ r ≤ 1 always • r = 1 when all the points (xi , yi ) lie on a line with positive slope • r = −1 when all the points (xi , yi ) lie on a line with negative slope • When r = 0, then there is no positive or negative linear association between the two variables (though the two variables may have a non-linear relationship). See following scatter plots for examples of different correlations.

7

10

20

Y

30

40

r=1

0

20

40

60

80

100

60

80

100

X

−10

0

10

20

Y

30

40

50

60

r = 0.73

0

20

40 X

8

−60

−50

−40

−30

Y

−20

−10

0

10

r = −0.73

0

20

40

60

80

100

60

80

100

X

−100

−50

0

Y

50

100

150

r = 0.11

0

20

40 X

9

0.0 −1.0

−0.5

Y

0.5

1.0

r=0

0

20

40

60

80

100

60

80

100

X

0

2

4

Y

6

8

10

r=0

0

20

40 X

10

Correlations for car data: • City MPG and Highway MPG r = 0.944 • Engine size and Horsepower r = 0.732 • Vehicle Weight and City MPG r = −0.843 Summarizing bivariate categorical data: Contingency tables Example: Berkeley graduate admissions data (1973)

Male Female Total

Admitted 3738 1494 5232

Not admitted 4704 2827 7531

Total 8442 4321 12763

This represents a sample of 12763 admission candidates cross-classified into four categories. Frequencies of certain categories: • Frequency of candidates who were admitted males: # admitted males 3738 = = 0.293 # total candidates 12763 • (Marginal) Frequency of candidates who were admitted: # admitted 5232 = = 0.410 # total candidates 12763 Frequencies of certain categories: (continued) • (Conditional) Frequency of males that were admitted: # admitted males 3738 = = 0.443 # total males 8442 • (Conditional) Frequency of females that were admitted: 1494 # admitted females = = 0.346 # total females 4321

11

Notice that males are admitted more frequently than females! A closer look at Berkeley admissions data (by major):

Major A B C D E F

Male Number Percent Applied Admitted 825 62 560 63 325 37 417 33 191 28 373 6

Female Number Percent Applied Admitted 108 82 25 68 593 34 375 35 393 24 341 7

This is an example of Simpson’s paradox. When introducing a new variable (Major), the relationship reverses. This phenomenon happens surprisingly often! Some graphical displays for multivariate data: • Scatter plot matrix – Examine relationship of variables in pairs – Allows seeing relationship between all pairs of variables simultaneously • Chernoff faces – a truly multivariate technique – Each observation is a face – Each variable is a facial feature – Invented mostly as a joke!

12

Scatterplot matrix of car data 20

25

30

35

40

45

50

100

150

200

250

300

3000

4000

15

35

45

2000

Weight

40

50

15

25

MPG.city

250

20

30

MPG.highway

1

2

EngineSize

3

4

5

50

150

Horsepower

2000

3000

4000

20

25

30

35

40

45

50

1

2

3

4

5

Example Chernoff faces: Car data • Width of face – vehicle weight • Height of face split – City MPG • Length of face – Highway MPG • Width of top half of face – Horsepower • Width of bottom half of face – Engine size Other facial features: Length of nose, curvature of mouth, size of eyes, angle of eyebrows, etc.

13

Acura Integra

Acura Legend

Audi 90

Audi 100

Cadillac Seville

Chevrolet Cavalier

Chevrolet Corsica

Chevrolet Camaro

Chrysler LeBaron

Chrysler Imperial

Dodge Colt

Dodge Shadow

Dodge Spirit

Dodge Caravan

Ford Festiva

Ford Escort

Ford Tempo

Ford Mustang

Ford Probe

Ford Aerostar

Buick Century

BMW 535i

Buick LeSabre

Buick Roadmaster

Buick Riviera

Cadillac DeVille

Chevrolet Lumina Chevrolet Lumina_APV Chevrolet Astro

Chevrolet Caprice

Chevrolet Corvette

Chrylser Concorde

Dodge Dynasty

Dodge Stealth

Eagle Summit

Eagle Vision

Ford Taurus

Ford Crown_Victoria

Geo Metro

Geo Storm

Recap: Univariate graphical summaries: • Stem and leaf display • Histogram • Boxplot • Barplot Univariate numerical summaries: • Sample mean (¯ x) • Sample median (M ) • Sample standard deviation (s) and sample variance (s2 ) • Percentiles Univariate numerical summaries: (continued) 14

• First and third quartiles (1Q and 3Q) • Interquartile range (IQR) • Range • 5-number summary • Rule for outliers Bivariate graphical summaries: • Scatter plot • Side-by-side Boxplots Bivariate numerical summary: • Correlation Bivariate categorical data summaries: Contingency tables • Frequencies • Marginal frequencies • Conditional frequencies

15