STAT 248: Exploratory Data Analysis Handout 2

STAT 248: Exploratory Data Analysis Handout 2 GSI: Gido van de Ven September10th, 2010 1 Introduction Today’s section will focus on Exploratory Dat...
Author: Kory Hood
20 downloads 0 Views 267KB Size
STAT 248: Exploratory Data Analysis Handout 2 GSI: Gido van de Ven September10th, 2010

1

Introduction

Today’s section will focus on Exploratory Data Analysis (EDA), a set of techniques developed to make a dataset more easily and effectifely handleable by human minds. This handout first discusses some fundamental issues in EDA, and then focuses on some basic data visualizing techniques: histogram, stem and leaf plot, boxplot and probability plots (QQ-plot). During section, we won’t go over all the concepts mentioned in this handout, as it is more meant as reference. We will start the section with a discussion on the goal of Statistics and where EDA fits in. After that we will go over some examples in R.

2

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach for data analysis originally developed by Tukey. Four major ingredients of EDA stand out: 1. Displays visually reveal the behaviour of the data and the structure of the analyses; 2. Residuals focus attention on what remains of the data after some analysis; 3. Re-expressions by means of simple mathematical functions such as the logarithm and the square root, help to simplify behaviour and clarify analyses; and 4. Resistance ensures that a few extraordinary data values do not unduly influence the results of an analysis. The seminal work in EDA is ”Exploratory Data Analysis”, Tukey, (1977).

3

Basic Concepts

Depths: It is related to an ordered batch and how far from the low or high end of the batch is situated a data value. We define depth of each data value, as the value position in an enumeration of values that starts at the nearer end of the batch. Each extreme value is the first value in the enumeration and therefore has depth 1; the second largest and second smallest values each have a depth of 2; and so on. In general in a batch of size n, two data values have depth i: the ith and the (n+1−i). Median: If n is odd, there is a deepest data value-one as far from either end of the ordered batch. This is the median (M ) and it marks the middle of the batch, exactly half the remaining n numbers in the batch are less than or equal to it, and exactly half are greater than or equal to it. The depth of the median, denoted by d(M ), is calculated as: d(M ) =

1

n+1 2

If n is even, d(M ) will have a fractional part equal to a half. In that case, the usual convention is to use the average of the two observations with depth just above and below d(M ) to calculate the median M . Hinges: The median splits an ordered batch in half. We might naturally ask next about the middle of each of the halves. The hinges (H) are the summary values in the middle of each half of the data. They are denoted by the letters LH (=lower hinge) or U H (=upper hinge) and are about a quarter of the way in from each end of the ordered batch. Each hinge is at depth d(H): [d(M )] + 1 2 where [x] means that you take the largest integer value smaller than or equal to x (i.e. if d(M ) contains an 12 , you drop it). d(H) =

Quartiles: The hinges are almost similar to the quartiles, which are defined so that one quarter of the data lies below the lower quartiles and one quarter of the data lies above the upper quartile. The main difference between hinges and quartiles is that the depth of the hinges are calculated from the depth of the median with the result that the hinges often lie closer to the median than do the quartiles. Note that the hinges equal the quartiles for odd n, and slightly differ for even n. 5-number summary: Introduced by Tukey, this summary includes the median, the two hinges and the two extreme values. Spread: The spread of a dataset responds to the variability of the data. Values that can learn us about the spread are the interquartile range (IQR, also called H-spread) which is the distance between the two hinges (i.e. IQR = U H − LH). Another measure for the variability in the data is the range, which simply is the largest value minus the smallest. Outliers: Let’s first define the inner fences (IF ) and the outer fences (OF ) of a batch of data:  LH − 1.5 ∗ (H-spread) InnerF ences = U H + 1.5 ∗ (H-spread)  LH − 3 ∗ (H-spread) OuterF ences = U H + 3 ∗ (H-spread) Any data value beyond either inner fences we term outside/outliers, and any value beyond either outer fence we call far outside/extreme outliers. Boxplot: An boxplot is a with solid lines marked off box from hinge to hinge, with the median shown as a solid line in the middle. A dashed whisker runs from the box to the value on each end that is still not beyond the corresponding inner fences (these two values are known as adjacent values). Outliers are shown individually, and if possible they are labelled. Histogram: A histogram can be used to illustrate the shape, or the distribution, of data. Histograms consists of side-by-side bars (also called bins), where each data value is represented by an equal amount of area in its bar. The height of the bins either represents counts, or it represents proportions. In the last case the histogram is said to be a ”true” histogram, or to be in ”densityscale”. Histograms can be used to immediately judge whether the collection of data is approximately symmetric or skewed . Also, we can see whether the histogram has one single peak – then the data is said to be unimodal – or multiple peals – a multimodal distribution. Stem-and-leaf Plot: A variation to an histogram is the stem-and-leaf plot. In this plot a certain number of the digits at the beginning of each data value serve as the basis for sorting, and the next digit appears in the display. Note that in a stem-and-leaf plot, except maybe the sequence of the data values, no information is lost. 2

3.1

R functions

median(), fivenum(), quantile(), IQR(), range(), boxplot(), hist(), stem()

3.2

Example

The following example, to illustrate the above concepts, is taken from Velleman and Hoaglin’s book. New Jersey has 21 different counties; sorted into increasing order, the areas in squared miles of these counties are: 47, 103, 130, 192, 221, 228, 234, 267, 307, 312, 329, 362, 365, 423, 468, 476, 500, 527, 569, 642, 819. Applying the definitions laid out above gives:

n = 21 d(H) =

d(M ) =

n+1 = 11 2

median = 329

d(M ) + 1 =6 LH = 228 U H = 476 2 IQR = H-spread = U H − LH = 248

IF down = LH − 1.5 ∗ IQR = 228 − 1.5 ∗ 248 = −144 IF up = U H + 1.5 ∗ IQR = 476 + 1.5 ∗ 248 = 848 OF down = LH − 3 ∗ IQR = 228 − 3 ∗ 248 = −516 OF up = U H + 3 ∗ IQR = 476 + 3 ∗ 248 = 1220 Adjacent values = {47, 819}

Boxplot: area of the New Jersey counties

200

400

600

800

Area (in squared miles)

Density

0.0000

0.0010

0.0020

Histogram: area of the New Jersey counties

0

200

400

600

800

Area (in squared miles)

Figure 1: New Jersey Counties. Top: Boxplot. Bottom: Histogram and kernel density plot. > par(mfrow = c(2,1)) > boxplot(New.Jersey, horizontal = TRUE, main = "Boxplot: area of the New Jersey counties", xlab = "Area (in squared miles)") > hist(New.Jersey, prob = TRUE, col = "red", main = "Histogram: area of the New Jersey counties", xlab = "Area (in squared miles)") > lines(density(New.Jersey))

3

4

Probability Plots

Probability plots are an extremely useful graphical tool for qualitatively assessing the fit of data to a theoretical distribution. Consider a sample of size n from a uniform distribution U [0, 1]. Denote the ordered sample values by X(1) < X(2) ... < X(n) . These values are order statistics. It can be shown that: j n+1

E(X(j) ) =

1 n This suggests plotting the ordered observations X(1) , ..., X(n) against their expected values n+1 , ..., n+1 . If the underlying distribution is uniform, the plot should look roughly linear. Now suppose you have the hypothesis that the observations follows a certain distribution F different from the uniform distribution.

Proposition: Let X be a continuous random variable with a strictly increasing cumulative distribution function FX and if Y = FX (X) then Y has a uniform distribution on [0, 1]. Given a sample X1 , ..., Xn we can plot the ordered observations (which may be viewed as the observed or empirical quantiles) versus the quantiles of the theoretical distributions:

X(k)

vs

F −1 (

k ) n+1

k k where F −1 ( n+1 ) is the n+1 quantile of the distribution F ; that is, it is the point such that the k . If the underlying probability that a random variable with distribution F is less than that point is n+1 distribution is the one that we hypothesized, the plot should look roughly linear.

QQ−plot against U[0,1] 1.0

Normal Q−Q Plot

0.8

800



● ● ● ● ●



0.6 0.4

Sample Quantiles

● ● ● ●

400

Sample Quantiles

600







0.2

200

● ● ● ● ● ●

0.0



−2

−1

0

1

2

● ●● ●● ● ●●● ● ●●● ●●● ●● ●●● ●●●●●●● ●●●● ● ● ●●●● ●● ●●●●●● ● ●● ●● ● ●● ● ●●● ●●● ● ●● ●●● ●● ●● ●● ● ● ●● ●●●● ●●●● ●●● ●● ●●●● ● ●●● ●●

0.0

Theoretical Quantiles

0.2

0.4

0.6

0.8

1.0

Theoretical Quantiles

Figure 2: Left: QQplot areas New Jersey Counties. Right: QQplot simulated Uniform [0,1]. > qqnorm(New.Jersey) > qqline(New.Jersey) > U = runif(1000) > theoretical = seq(from = 1/1001, to = 1000/1001, length.out = 1000) > qqplot(theoretical, U, ylab = "Sample Quantiles", xlab = "Theoretical Quantiles", main = "QQ-plot against Uniform-distr") > abline(lm(U[order(U)] ˜ theoretical))

4

4.1

Backup

Suppose that F is the cdf of a continuous random variable and is strictly increasing on some interval I, and that F = 0 to the left of I and F = 1 to the right of I. I may be unbounded. Under this assumption the inverse function F −1 is well defined: x = F −1 (y) if y = F (x). The pth quantile of the distribution F is defined to be that value xp such that F (xp ) = p or P (X ≤ xp ) = p. Under the preceding assumptions stated, xp is uniquely defined as xp = F −1 (p). Special cases are p = 12 which corresponds to the median of F ; and p = 14 and p = 43 which correspond to the lower and upper quartiles of F . Proposition Let X be a random variable with a cdf F that adheres above assumptions and let Z = F (X). Then Z has a uniform distribution on [0, 1]. Proof: P (Z ≤ z) = P (F (X) ≤ z) = P (X ≤ F −1 (z)) = F (F −1 (z)) = z Proposition Let U be uniform on [0,1] and let X = F −1 (U ). Then the cdf of X is F Proof: P (X ≤ x) = P (F −1 (U ) ≤ x) = P (U ≤ F (x)) = F (x)

5

Re-expression of the Data

Usually, it is preferable if the distribution of your data is symmetric. For example, if you have heavily skewed data, a histogram or a boxplot won’t be so informative. But not only for visualization, many models or parametric tests assume that the data are (at least approximately) normally distributed, and if your data is heavily skewed you are sure this assumption is violated. Hence, sometimes it is better to transfrom your data prior to analyzing it. Note that linear transformations (of the form x → a+bx) do not change the shape of the distribution, so for this purpose only non-linear transformations are useful. An important class of such transformations is Tukey’s Ladder of Powers, given by: y → y p , for some power p. This class of transformations will change the distribution of the data, but won’t affect the ordering of the data points. If you use p < 1, the upper tail of the distribution will be pulled in and the lower tail stretched out. A value p > 1 will have the reverse effect. (Note that for p = 1 the transformed data is equal to the original data.) The farther away from one the value of p is, the stronger this effect. Special case is p = 0. Usually y 0 is defined to be 1, but it would be useless to transform all data values to 1. However, it turns out that when the powers p are ordered according to the strength of their reshaping effect, the logarithm-functions falls naturally on the place of the zero power (i.e. y → log(y)). For p < 0, the re-expression in the Ladder of Powers is given by: y → −y p , where the extra minus sign is added in order to preserve the original ordering in the data. The three main objectives for transforming data are: • Normalize the distribution: Many statistical models and parametric test assume that data is (approximately) normal. Common transformations to get data closer to being normal are the logarithm transformation and the inverse probability transformation 1 . • Stabilize the variances: Besides normality, many parametric tests require equal variances of the data points as well. A typical example of a variance stabilizing transformation is the square √ root transformation: y → y. • Linearize the trend: Regression analysis techniques require the assumption of linearity. For non-linear data there exist non-linear regression approaches, but in many cases it is easier to apply a linearizing transformation. A typical example of such a re-expression is the logarithmic transformation. This method is based on the fact that if X is distributed according to cdf F (with F −1 (.) being well defined), then Y = Φ−1 (F (X)) is standard normal distributed. (Note that Φ(.) is the cdf of the standard normal distribution.) 1

5

6

Example: Exploratory Data Analysis

On last week’s handout there was an exercise about the dataset ”jj.dat”, which contains for the period 1960 to 1980 the quarterly earnings per share of the company Johnson and Johnson. This time series, which is taken from the website of Schumway and Stoffer’s book, can be downloaded from the Section Website. Below, this time series will be used to illustrate some of today’s concepts using R.

The decimal point is at the | 0 2 4 6 8 10 12 14 16

| | | | | | | | |

466667778888899900022333445566899 123334784667 03349003688 0146899778 379955 0369 120 078 02

Figure 3: Stem and leaf plot of quarterly earnings per Johnson & Johnson’s share > stem(jj.ts)

Histogram

0

0.00

0.05

0.10

Density

10 5

Earnings per Share

0.15

15

0.20

Boxplot

0

5

10

15

Earnings per Share

Figure 4: Quarterly earnings per Johnson & Johnson’s share. Left: Boxplot. Right: Histogram. > par(mfrow = c(1,2)) > boxplot(jj.ts, ylab = "Earnings per Share", main = "Boxplot") > hist(jj.ts, xlab = "Earnings per Share", main = "Histogram", prob = TRUE))

As these three plots illustrate, the dataset is skewed to the left. We want to apply a transformation to make the distribution of this data more symmetric. As the data has a large right tail, we can use a transformation from Tukey’s Ladder of Powers, choosing p < 1. Let’s choose p = 0, i.e. the logarithmic transformation. 6

Histogram

0.2

Density

0.0

0

0.1

1

log(Earnings per Share)

2

0.3

Boxplot

−1

0

1

2

3

log(Earnings per Share)

Figure 5: Boxplot and histogram of logarithmic transformation of quarterly earnings per Johnson & Johnson’s share. > par(mfrow = c(1,2)) > boxplot(log(jj.ts), ylab = "log(Earnings per Share)", main = "Boxplot") > hist(log(jj.ts), xlab = "log(Earnings per Share)", main = "Histogram", prob = TRUE))

Normal qq−plot, earnings per J&J−share

Normal qq−plot, log(earnings per J&J−share)







15

● ●

●● ●●



●● ●● ●●● ● ●●●



2



●●● ●● ● ●●● ● ● ●●●

●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ●



● ● ● ● ● ● ● ● ●●●

0

● ●●● ● ● ● ●●● ●

● ● ● ● ● ● ● ● ● ●● ●●●●● ●●●●● ●●●● ●●●●●●●● ● ● ● ● ●●



●●● ●●● ● ●● ●● ●●

● ● ● ● ●

0



● ● ● ● ● ● ● ●

1

Sample Quantiles

10

●● ● ● ●●●

5

Sample Quantiles

●● ● ●

● ● ●●

● ●



−2

−1

0

1

2

−2

Theoretical Quantiles

> > > > >

−1

0

1

2

Theoretical Quantiles

Figure 6: Normal QQ-plots. Left: Original data. Right: After log-transformation. par(mfrow = c(1,2)) qqnorm(jj.ts, main = "Normal qq-plot, earnings per J&J-share") qqline(jj.ts) qqnorm(log(jj.ts), main = "Normal qq-plot, log(earnings per J&J-share)") qqline(log(jj.ts))

7



Johnson & Johnson

1

log(Earnings per Share)

10 0

0

5

Earnings per Share

2

15

Johnson & Johnson

1960

1965

1970

1975

1980

1960

Time

1965

1970

1975

1980

Time

Figure 7: Time series plots. Left: Original data. Right: After log-transformation. > par(mfrow = c(1,2)) > plot(jj.ts, ylab = "Earnings per Share", main = "Johnson & Johnson") > plot(log(jj.ts), ylab = "log(Earnings per Share)", main = "Johnson & Johnson") Note that the logarithmic transformation not only symmetrizes the data, it also makes the initial exponential trend in the time series more or less linear.

7

Bibliography

This handout is based on handouts prepared by Irma Hernandez-Magallanes, previous GSI for this course. Additional sources that are used, and that could be useful for you: • “Exploratory Data Analysis” by John W. Tukey • “Applications, Basics and Computing of Exploratory Data Analysis” by Paul F. Velleman and David C. Hoaglin • “Modern Applied Statistics with S” by W.N. Venables and B.D. Ripley • “Time Series Analysis and Its Applications: With R Examples” by Robert Schumway & David Stoffer

8