Chapter 3

Summarizing Quantitative Data: Statistics • statistic Any quantity computed from the data values in the sample; used to quantify some distributional feature. • parameter A quantity used to describe some distributional feature of the larger population. There are two chief systems of statistics in common use: ordered statistics, based on an ordering of the data values from lowest to highest; and weighted statistics, which locate features relative to how they balance against the rest of data set on the measuring scale. Feature

Ordered statistic

Weighted statistic

Center

median

Spread

interquartile range standard deviation (IQR) (s)

Relative percentile standing 1

mean (¯ x)

z-score (z)

Chapter 3

• mode the data value (or values) with the highest frequency (which applies also to qualitative data); there may be one mode, no mode or multiple modes in a set of data • median the middle observation in a sorted list of the data values (for an even number of values, average the two middle observations) [Excel:=MEDIAN] • percentile a pth percentile is a number greater than or equal to exactly p % of the data and less than (100 − p) % of the data (the median is the 50th percentile)

2

Chapter 3

There are two common methods for computing percentiles, and they typically produce slightly different values: – the exclusive method (used by the authors of our textbook) locates the position of the pth percentile using the formula p Lp = (n + 1) 100 So, for example, if L30 = 13.42, the 30th percentile will be the number 42% of the way between the 13th and 14th number in the list that sorts the data from lowest to highest. However, the method suffers from the disadvantage that it fails to return any value for very small or very large values of p. – the inclusive method locates the position of the pth percentile using the formula p Lp = (n − 1) +1 100 This method always produces a definite value for the pth percentile. [Excel:=PERCENTILE.EXC or =PERCENTILE.INC]

3

Chapter 3

• lower/upper quartiles (Q1 and Q3) the observations which are one quarter (Q1) and three quarters (Q3) of the way up the list; also equal to the median values of each half of the data located below/above the median; or the 25th and 75th percentiles (The 0th quartile is the minimum value: Q0 = min; the second quartile is the median: Q2 = median; and the fourth quartile is the maximum value: Q4 = max.) [Excel:=QUARTILE.EXC or =QUARTILE.INC; =MAX; =MIN] • interquartile range (IQR) difference IQR = Q3 − Q1 between the two quartiles • five-number summary of a data set is the list of its five quartiles: – minimum (min) – lower quartile (Q1) – median – upper quartile (Q3) – maximum (max)

4

Chapter 3

• boxplot display of the five-number summary formed by – drawing a box over a number line so that the sides of the box are located at the two quartiles (the width of the box equals the IQR), – drawing the wall (a line across the box) at the location of the median, and – drawing whiskers (lines parallel to the number line) extended from the sides of the box to the min and max. [Excel: Compute Q1, Q2 − Q1, Q3 − Q2, then. . . Insert > Charts > Bar > Stacked Bar; Select Data > Switch Row/Column; ChartTools > Layout > Analysis > Error Bars > Options; Display > Plus/Minus; Error Amount > Custom]

5

Chapter 3

• modified boxplot same as above, except: – whiskers extend from the sides of the box to the fences, points positioned 1.5 IQR from each end of the box; and – outliers (any values lying outside the fences) are individually marked with symbols; far outliers, which lie more than 3 IQR from the ends of the box, are often marked with different symbols for emphasis. • resistance to outliers moving the extreme values of a data set either further away or closer to the center of the distribution does not change the median value; hence, the median (and other ordered statistics) is often preferred when describing skewed data sets (income data, housing prices, etc.).

6

Chapter 3

• (arithmetic) mean (¯ x, µ) a data set that includes repeated measures of some variable quantity x (sample size = n; population size = N ) has mean value equal to its arithmetical average, the sum of the values divided by the number of values: P xi x¯ = sample mean Pn xi population mean µ= N It represents the point on the number scale where the distribution “balances” (as if the histogram were made of some massive substance) [Excel:=AVERAGE] • sensitivity to outliers moving the extreme values of a data set either further away or closer to the center of the distribution can substantially alter the mean value; hence, the mean (and other weighted statistics) is used to describe only symmetric data sets or those without much skew. In skewed distributions, the mean is pulled in the direction of the skewness (the longer tail) • geometric mean mean value appropriate for multiplicative data, those that are combined by multiplication (e.g., rates of growth) 7

Chapter 3

If repeated percentage growth rates gi over various time intervals are collected, their corresponding growth factors have the form (1 + gi), whence the geometric mean of these factors satisfies (1 + Gg ) = [(1 + g1)(1 + g2) · · · (1 + gn)]1/n . Therefore, the average growth rate equals p Gg = n (1 + g1)(1 + g2) · · · (1 + gn) − 1. If various financial rates of return Ri are given, their geometric mean return is p GR = n (1 + R1)(1 + R2) · · · (1 + Rn) − 1. If instead of growth rates gi, the periodic amounts xi are given, then as the rates of return satisfy xi+1 = (1 + gi)xi, the average growth rate simplifies to r r x2 x3 xn xn Gg = n · ··· −1= n − 1. x1 x2 xn−1 x1 The geometric mean is generally somewhat less sensitive to outliers than the arithmetic mean for the same data. [Excel:=GEOMEAN] 8

Chapter 3

• range difference between the max and min values of the data; coarsest measure of spread, and highly sensitive to outliers

• deviation from the mean (x − x¯) the difference between a single data value x and the mean x¯ of the data set; values greater than the mean have positive deviations, while those below the mean have negative deviations. Each number in the data set has its own deviation from the mean. • mean absolute deviation (MAD) average of the absolute values of the deviations from the mean: P |xi − x¯| ¯ sample MAD d= P n |xi − µ| population MAD δ= N Statistical theory that includes the MAD as the measure of spread is surprisingly difficult, so it is not commonly used in practice.

9

Chapter 3

• variance (s2, σ 2) an estimate of the average squared deviation from the mean: P (x − x¯)2 2 sample variance s = Pn − 1 2 (x − x¯) 2 population variance σ = N [Excel:=VAR.S or =VAR.P] Variance is in (meaningless) squared units, so the more important measure of spread is. . . • standard deviation (s, σ) square root of the variance, a measure of spread that estimates the size of a typical deviation from the mean: rP (x − x¯)2 sample standard deviation s= r Pn − 1 (x − x¯)2 pop. standard deviation σ= N [Excel:=STDEV.S or =STDEV.P] The greater the standard deviation, the further away from the mean will most values be found. Also, it weights large deviations from the mean more than does the MAD 10

Chapter 3

• coefficient of variation (CV) a measure of relative spread that determines how large the standard deviation is relative to the mean value; used exclusively for values of x measured on a ratio scale: s sample coefficient of variation CV = x¯ σ pop. coefficient of variation CV = µ Note that the notation CV is used for both sample and population values; only context will distinguish which measure is being used.

• z-score, or standardized value a measure of relative standing that determines how far each data value is from the mean measured in units of standard deviations: x − x¯ z= s Positive z-scores correspond to values larger than the mean; negative z-scores to values below the mean. [Excel:=STANDARDIZE] 11

Chapter 3

• mean-variance analysis In finance, an investment I that fluctuates in value over time will produce variability in its rates xi of return. This variability is summarized by the mean rate of return x¯I and standard deviation of the rates of return sI for the investment. The mean measures the investment’s reward, and the standard deviation its risk. Further, the degree to which the return on the investment compensates for the risk that the investor takes can be measured by comparing the difference between its reward x¯I and that Rf of a risk-free investment (like a treasury bill), but in units of risk sI ; this measure is called the Sharpe ratio: x¯I − Rf Sharpe ratio = sI The higher the Sharpe ratio, the better the investment will compensate its investors for the risk they are taking by investing.

12

Chapter 3

General properties of distributions • Chebyshev’s Theorem states that for any data set, the proportion of values that lie within k standard deviations from the mean will be equal or greater than 1 − k12 whenever k > 1. • Empirical Rule states that for data sets that are nearly normally distributed, i.e., symmetric and bell-shaped (with a prominent peak describing the central cluster of data and tails of more extreme that dissipate as one moves away from the center), – approximately 68% of the data will lie within one standard deviation of the mean, within the interval x¯ ± s; – approximately 95% of the data will lie within two standard deviations of the mean, within the interval x¯ ± 2s; and – nearly all of the data will lie within three standard deviation of the mean, within the interval x¯ ± 3s. 13

Chapter 3

Working with grouped data When raw data is processed by being aggregated into intervals along its scale of measure, the mean and standard deviation of the set can be approximated by assuming that all values in an interval are concentrated at the midpoint m of the interval. If the n data points of the sample – or the N data points of the population – are partitioned into k intervals with midpoints m1, m2, . . . , mk , and the intervals contain frequencies of f1, f2, . . . , fk , respectively, then P

P

mifi mifi mean x¯ = , µ = n n P P 2 (mi − x¯) fi 2 (mi − x¯)2fi 2 variance s = , σ = n−1 n−1 √ √ 2 stand. dev. s = s , σ = σ2

14

Chapter 3

Weighted mean We can reinterpret the mean value formula above by considering the midpoints mi as a data set to which we assign weights given by the corresponding relative frequencies fi/n; a greater weight gives the midpoint value a larger contribution to the value of the mean. More generally, if data values Pxi are assigned corresponding weights wi, where wi = 1, then we find the weighted mean by the similar formulas X X x¯ = wi xi µ= w i xi

15

Chapter 3

Statistics for paired data • covariance (sxy , σxy ) a measure of the direction and strength of the linear association between paired data variables, where x is the explanatory and y the response variable: P (x − x¯)(y − y¯) sample covariance sxy = n−1 P (x − x¯)(y − y¯) population covariance σxy = N Note that covariance is measured in square units. [Excel:=COVARIANCE.S or =COVARIANCE.P] • correlation (rxy , ρxy ) a standardized version of covariance: sample correlation

rxy

population correlation

ρxy

[Excel:=CORREL (for samples only)]

16

sxy = sx sy σxy = σx σy

Chapter 3

Analyzing Data: Paired Quantitative Data • positive values of sxy , rxy indicate a positive association between x and y; negative values of sxy , rxy indicate a negative association between x and y • rxy always lies between 1 and 1, with values close to 0 indicating weak association, values close to 1 a strong positive association, and values close to 1 a strong negative association • while it is possible to compute the covariance and correlation for any pair of quantitative variables, only linear associations are evaluated by these statistics • the statistics sxy , rxy are highly sensitive to outliers, so the presence of an outlier can dramatically alter their values • there may be a strong association between variables without there being any cause/effect relation between them: association does not signify causation. Sometimes, there is a third lurking variable (one that is not measured by the investigator) standing behind the other two variables as a common and hidden determining factor.

17