Statistics 100 – Summarizing bivariate data Bivariate data: • Each observation is measured on two variables. • Typically we want to summarize the relationship between the two variables. Examples: Obtain information on 50 subjects in a study. • Measure weight and height for each subject. • Record race and LDL cholesterol level. • Record political party affiliation and marital status. Three types of data combinations: • Two quantitative variables • One quantitative and one categorical variable • Two categorical variables Examples: • Weight and height • Cholesterol level and race • Party affiliation and marital status Motivating example: Consumer Reports 1993 Car data (each observation is a car) This data set contains 93 observations on 27 variables, of which we’ll examine: • Engine size (liters) • Horsepower • Weight (pounds) • City MPG (miles per gallon), • Highway MPG (miles per gallon), • Type of vehicle (compact, large, midsize, small, sporty, van) A graphical method for summarizing bivariate data: Scatter plot What to look for in a scatter plot: 1
• Type of association (linear, curved, etc.) • Strength of association (amount of scatter) • Direction of association (positive, negative) – A positive relationship is a scatter that goes up and to the right. – A negative relationship is a scatter that goes down and to the right. • Special features (outliers, other unusual aspects)
35 20
25
30
Highway MPG
40
45
50
Consumer reports 1993 car data
15
20
25
30
35
40
City MPG
Example scatter plots: City MPG versus Highway MPG • strong positive linear relationship • no outliers – the points to the right are consistent with the linear pattern
2
45
200 50
100
150
Horsepower
250
300
Consumer reports 1993 car data
1
2
3
4
Engine size (liters)
Example scatter plots: (continued) Engine size versus Horsepower • positive linear relationship • more variability at higher engine sizes than at low engine sizes • an outlier with low engine size and high horsepower (Mazda RX-7)
3
5
30 15
20
25
City MPG
35
40
45
Consumer reports 1993 car data
2000
2500
3000
3500
Weight (lbs)
Example scatter plots: (continued) Vehicle Weight and City MPG • moderately strong negative relationship • no unusual observations • relationship is curved Can add a scatter plot “smoother” to see the curved relationship better.
4
4000
30 15
20
25
City MPG
35
40
45
Consumer reports 1993 car data
2000
2500
3000
3500
4000
Weight (lbs)
Graphically representing the relationship between a quantitative variable with a categorical variable: Side-by-side boxplots Idea: • Divide sample into groups according to the categorical variable. • Construct boxplots for the quantitative variable within each group. Example: Weight by car type
5
3000 2000
2500
Weight (lbs)
3500
4000
Consumer reports 1993 car data
Compact
Large
Midsize
Small
Sporty
Van
Car type
Comments on boxplot: • Vans and large cars tend to be heavier, on average. • Small cars are the lightest type of cars. • Sporty cars have the largest variability in weights, both in the IQR and range. • There are a couple of small cars that are outliers; these happen to be a Ford Festiva (1845 lbs) and a Geo Metro (1695 lbs). Numerical summaries of bivariate data: Measuring the linear association between two quantitative variables: correlation Suppose we collect a sample of n observations on two variables. The values of the first variable are x1 , x2 , . . . , xn and the values of the second variable are y1 , y2 , . . . , yn .
6
Let x ¯ = the sample mean of x1 , . . . , xn , y¯ = the sample mean of y1 , . . . , yn , sx = the standard deviation of x1 , . . . , xn , and sy = the standard deviation of y1 , . . . , yn . Then the correlation coefficient is defined as Pn
r = =
−x ¯)(yi − y¯) P 2 ( i=1 (xi − x ¯) )( ni=1 (yi − y¯)2 ) Pn 1 (xi − x ¯)(yi − y¯) · i=1 n−1 sx sy p Pn
i=1 (xi
Properties of correlation: • −1 ≤ r ≤ 1 always • r = 1 when all the points (xi , yi ) lie on a line with positive slope • r = −1 when all the points (xi , yi ) lie on a line with negative slope • When r = 0, then there is no positive or negative linear association between the two variables (though the two variables may have a non-linear relationship). See following scatter plots for examples of different correlations.
7
10
20
Y
30
40
r=1
0
20
40
60
80
100
60
80
100
X
−10
0
10
20
Y
30
40
50
60
r = 0.73
0
20
40 X
8
−60
−50
−40
−30
Y
−20
−10
0
10
r = −0.73
0
20
40
60
80
100
60
80
100
X
−100
−50
0
Y
50
100
150
r = 0.11
0
20
40 X
9
0.0 −1.0
−0.5
Y
0.5
1.0
r=0
0
20
40
60
80
100
60
80
100
X
0
2
4
Y
6
8
10
r=0
0
20
40 X
10
Correlations for car data: • City MPG and Highway MPG r = 0.944 • Engine size and Horsepower r = 0.732 • Vehicle Weight and City MPG r = −0.843 Summarizing bivariate categorical data: Contingency tables Example: Berkeley graduate admissions data (1973)
Male Female Total
Admitted 3738 1494 5232
Not admitted 4704 2827 7531
Total 8442 4321 12763
This represents a sample of 12763 admission candidates cross-classified into four categories. Frequencies of certain categories: • Frequency of candidates who were admitted males: # admitted males 3738 = = 0.293 # total candidates 12763 • (Marginal) Frequency of candidates who were admitted: # admitted 5232 = = 0.410 # total candidates 12763 Frequencies of certain categories: (continued) • (Conditional) Frequency of males that were admitted: # admitted males 3738 = = 0.443 # total males 8442 • (Conditional) Frequency of females that were admitted: 1494 # admitted females = = 0.346 # total females 4321
11
Notice that males are admitted more frequently than females! A closer look at Berkeley admissions data (by major):
Major A B C D E F
Male Number Percent Applied Admitted 825 62 560 63 325 37 417 33 191 28 373 6
Female Number Percent Applied Admitted 108 82 25 68 593 34 375 35 393 24 341 7
This is an example of Simpson’s paradox. When introducing a new variable (Major), the relationship reverses. This phenomenon happens surprisingly often! Some graphical displays for multivariate data: • Scatter plot matrix – Examine relationship of variables in pairs – Allows seeing relationship between all pairs of variables simultaneously • Chernoff faces – a truly multivariate technique – Each observation is a face – Each variable is a facial feature – Invented mostly as a joke!
12
Scatterplot matrix of car data 20
25
30
35
40
45
50
100
150
200
250
300
3000
4000
15
35
45
2000
Weight
40
50
15
25
MPG.city
250
20
30
MPG.highway
1
2
EngineSize
3
4
5
50
150
Horsepower
2000
3000
4000
20
25
30
35
40
45
50
1
2
3
4
5
Example Chernoff faces: Car data • Width of face – vehicle weight • Height of face split – City MPG • Length of face – Highway MPG • Width of top half of face – Horsepower • Width of bottom half of face – Engine size Other facial features: Length of nose, curvature of mouth, size of eyes, angle of eyebrows, etc.
13
Acura Integra
Acura Legend
Audi 90
Audi 100
Cadillac Seville
Chevrolet Cavalier
Chevrolet Corsica
Chevrolet Camaro
Chrysler LeBaron
Chrysler Imperial
Dodge Colt
Dodge Shadow
Dodge Spirit
Dodge Caravan
Ford Festiva
Ford Escort
Ford Tempo
Ford Mustang
Ford Probe
Ford Aerostar
Buick Century
BMW 535i
Buick LeSabre
Buick Roadmaster
Buick Riviera
Cadillac DeVille
Chevrolet Lumina Chevrolet Lumina_APV Chevrolet Astro
Chevrolet Caprice
Chevrolet Corvette
Chrylser Concorde
Dodge Dynasty
Dodge Stealth
Eagle Summit
Eagle Vision
Ford Taurus
Ford Crown_Victoria
Geo Metro
Geo Storm
Recap: Univariate graphical summaries: • Stem and leaf display • Histogram • Boxplot • Barplot Univariate numerical summaries: • Sample mean (¯ x) • Sample median (M ) • Sample standard deviation (s) and sample variance (s2 ) • Percentiles Univariate numerical summaries: (continued) 14
• First and third quartiles (1Q and 3Q) • Interquartile range (IQR) • Range • 5-number summary • Rule for outliers Bivariate graphical summaries: • Scatter plot • Side-by-side Boxplots Bivariate numerical summary: • Correlation Bivariate categorical data summaries: Contingency tables • Frequencies • Marginal frequencies • Conditional frequencies
15