Graphics. Chapter Examining data with graphics

Chapter 6 Graphics Modern computers with high resolution displays and graphic printers have revolutionized the visual display of information in field...

Author: Colin Wood

0 downloads 0 Views 918KB Size

Report

Download PDF

Recommend Documents

Categorical Data Analysis with Graphics

Chapter 9. Object graphics

Graphics

Smart Graphics: Graphics and Perception

Graphics

3D Computer Graphics with Python 3D Graphics the Pythonic Way

3D Graphics System. Getting into Java Graphics

Graphics Standards. Graphics Standards Page 1

Introduction to Computer Graphics 1. Graphics Systems

Chapter 6 Camera, Graphics and Video

Chapter 3: IO Actions (Simple Graphics)

CHAPTER 9 HP-GL Graphics Language

Hyper3D: 3D Graphics Software for Examining Cultural Artifacts

7870 Graphics

CS 563 Advanced Topics in Computer Graphics Chapter 15: Graphics Hardware. by Dan Adams

GRAPHICS AND VISUALISATION WITH MATLAB Part 1

Multipanel plotting in R (with base graphics)

Chapter 6

Graphics Modern computers with high resolution displays and graphic printers have revolutionized the visual display of information in fields ranging from computeraided design, through flow dynamics, to the spatiotemporal attributes of infectious diseases. The impact on statistics is just being felt. Whole books have been written on statistical graphics and their contents are quite heterogeneous– simple how-to-do it texts (e.g., ?; ?), reference works (e.g., Murrell, 2006) and generic treatment of the principles of graphic presentation (e.g., ?). There is even a web site devoted to learning statistics through visualization http: //www.seeingstatistics.com/. Hence, it is not possible to be comprehensive in this chapter. Instead, I focus on the types of graphics used most often in neuroscience (e.g., plots of means) and avoid those seldom used in the field (e.g., pie charts, geographical maps). The here are four major purposes for statistical graphics. First, they are used to examine and screen data to check for abnormalities and to assess the distributions of the variables. Second, graphics are very useful aid to exploratory data analysis (??). Exploratory data analysis, however, is used for mining large data sets mostly for the purpose of hypothesis generation and modification, so that use of graphics will not be discussed here. Third, graphics can be used to assess both the assumptions and the validity of a statistical model applied to data. Finally, graphics are used to present data to others. The third of these purposes will be discussed in the appropriate sections on the statistics. This chapter deals with the first and last purposes–examining data and presenting results.

6.1

Examining data with graphics

The main purpose here is to view the data to detect outliers and to make decisions about transforming variables. If the design has groups (even ordered groups), then your plots should be constructed separately for each group. It is possible for an outlier in one group to be hidden by the data points for other 1

CHAPTER 6. GRAPHICS

2

6 5 4 0

0

1

5

2

3

Frequency

15 10

Frequency

20

7

Figure 6.1: Examples of histograms: IQ scores from patients in a pediatric neurology clinic.

40

50

60

70

80 IQ

90 100

50

60

70

80

90

100

IQ

groups.

6.1.1

Histograms

In the past, teachers of statistics tortured undergraduate students by giving them a data set and requiring them to draw a histogram on graph paper. With access to graphing software, those days are over (mostly). The histogram groups a numeric variable into “bins” and then plots the midpoint of the bin on the horizontal axis the frequency of scores within the that bin on the vertical axis. The frequency can be expressed as either the raw number of scores or the percent of scores in the bin. In general, histograms are best used for relatively large sample sizes. The most significant decision for the user is the number of bins. When the number is too small, information is eﬀectively “hidden.” When the number is too large, the histogram may have only one or two observations per bar and provides little information about the distribution. Also, bin size is a function of sample size. Smaller bins can be used with larger samples. Good statistical and graphing software usually gives a reasonable default size for the bins based on sample size. All software will give the use the option of specifying either the bin size (i.e., width of the bar) or the number of bins. Figure 6.1 gives histograms of the IQ scores of 63 children seen at a pediatric neurology clinic. The left hand plot in Figure 6.1 was generated using a default for the number of bins . The right hand plot specified 25 bins. Note that histograms result in a “wide” plot for a variable. Also, the range of the horizontal axis is often determined by the software. Hence, comparing groups using histograms can require extra work and the resulting graphic may

CHAPTER 6. GRAPHICS

3

Table 6.1: R code for producing a dot p[lot (strip chart). s t r i p c h a r t ( pkcgamma$open_arm ~ pkcgamma$genotype , method=" j i t t e r " , v e r t i c a l=TRUE, y l a b="P e r c e n t Time i n Open Arm" , x l a b="Genotype " , c o l ="b l u e " , pch =1, cex =1.5) not be easy to interpret. There are fancy routines that that will plot histograms for two (sometimes three) groups using “transparent” color schemes, but the simplest way to examine groups is through dot plots and/or box plots.

6.1.2

Dot plots (strip charts)

A “dot plot” means diﬀerent things to diﬀerent statistical packages. Here, the term is used in its traditional sense (?) and encompasses what R calls a “strip chart.” Dot plots can be one of the most useful ways of displaying and perusing data for neuroscience because sample sizes are usually small to moderate. The type of dot plot most useful in neuroscience has the groups on the horizontal axis and the values for the variable on the vertical axis. Each observation then becomes a point in the graph, plotted by its value. Figure 6.2 gives two examples for the PKC-γ data. The left hand panel gives the traditional dot plot. In the right hand panel, the points are “jittered” or moved slightly to the left or right in order to avoid overlapping points. Table 6.1 presents the R code for these plots. This code produced the right-hand or “jittered” figure. To produce the left-hand figure omit the argument method=”jitter” . The advantages of a dot plot are: (1) every value in the data can be visually inspected; (2) the range of the data for each group is apparent in the event that there may be significant diﬀerences in variance; and (3) outliers can be readily identified. The major disadvantage comes when data sets are so large that overlapping points make it diﬃcult to appreciate the distribution of scores (although some graphical software can overcome this limitation with moderately sized data sets). A second disadvantage is that the dot plot does not give information about the statistics of a distribution.

CHAPTER 6. GRAPHICS

4

Figure 6.2: Example of a dot plot.

● ● ● ● ● ● ● ● ● ● ● ● ●

++

● ● ●

●

● ● ● ● ●

● ● ● ● ● ● ●

●

● ●

●

+−

−−

Genotype

6.1.3

30 20

●

0

●

10

● ● ● ●

●

Percent Time in Open Arm

30 20 10 0

Percent Time in Open Arm

●

● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●

++

●

● ● ●

● ●

● ● ● ●● ● ● ●●

●● ● ● ●

●

● ● ●

●

+−

−−

Genotype

Box (and whisker) plot

Figure 6.3 provides a box plot (aka box-and-whisker plot) of the same data. In this plot, individual data points are not identified. Instead the shape of the distribution is visually portrayed through the shape of a box and arms (or whiskers). The bottom arm begins with the lowest connected data point in the series (we postpone the definition of a connected data point for the moment). The lower part of the box starts with the score at the first quartile and upper part ends with the score at the third quartile. The horizontal line close to the middle of the box is the median. Finally, the upper “whisker” starts above the third quartile and ends with the uppermost connected data point. Hence, the size of the box gives the interquartile range of the data, i.e., the 25th through the 75th percentiles. Box plots are more useful than dot plots when there is a moderate to large number of observations per group. Their symmetry permits one to assess skewness and they are very helpful in detecting outliers. In symmetrical distributions, the median splits the box into equal halves and two whiskers are of equal length. A distribution with a positive skew (see the right hand box plot in Figure 6.4) has a short whisker at the bottom, a median that is located below the half way point in the box, a long whisker at the top, and a number of unconnected data points at the high end. A variable with a negative skew has the opposite shape—short whisker at the top, a median above the half way point in the box, a long whisker at the bottom, and a number of unconnected data points at the low end (see Figure 6.4).

CHAPTER 6. GRAPHICS

5

30 10

20

●

0

Percent Time in Open Arm

Figure 6.3: Box plots for the PKC-γ data by genotype.

++

+−

−−

Genotype

6.1.3.1

Unconnected data points

Most graphing software oﬀers two options for dealing with very high and very low values. The first option is to plot them as unconnected data points above and below the whiskers. This was the option used to generate Figures 6.3 and 6.4. The second option is extend the whiskers to the lowest and to the highest values. There are no formal criteria defining an unconnected data point. Hence, it is always necessary to consult the manual for the software. Many programs define an unconnected data point as a value lower than 1.5 times the interquartile range below the first quartile. For example, if the first quartile is at 31.4 and the third quartile is at 42.7, then the interquartile range is 42.7 – 31.4 = 11.3. Hence the cut oﬀ for a lower unconnected data point would be 31.4 – 1.5*11.3 = 14.45. A similar criterion is used for a unconnected data point at the higher end of the distribution. One cautionary note is in order—never consider an “unconnected” data point an outlier without further evidence. An unconnected data point may achieve its status simply because sample size is not large and the interquartile range for a group is small. This is precisely the case for the heterozygote’s box plot in Figure 6.3. Examination of the highest value for genotype +- in the dot plot of Figure 6.2 reveals that it belongs to the rest of the distribution. Comparison of the size of the three boxed in Figure 6.3 demonstrates that genotype +- has a

CHAPTER 6. GRAPHICS

6

Figure 6.4: Box plots of skewed variables.

●

4

Score

6

8

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●

0

2

● ● ● ● ● ● ●

Negative

Positive

Skewness

smaller interquartile range than the other two.

In a box plot, never consider an unconnected data point as an outlier without further evidence. A further issue with unconnected data points is that you will (not may) observe them in very large samples. Recall the property of the range (see Section X.X) that as the sample size grows larger, the chance of observing extreme values increases, and hence the range increases. Very large sample sizes will stabilize the size of the box and length of the whiskers, but increase the likelihood that extreme scores will be sampled. 6.1.3.2

Box plots and small samples

Upon learning to screen data using graphics, students often have trouble grasping the extent to which box plots can vary when sample size is small. Figure 6.5 provides an example. Here, ten diﬀerent samples with an N of 8 in each sample were generated using random numbers from a normal distribution. Thus, there are no outliers in any of the samples and there are no significant diﬀerences in variability in any of the samples. One can certainly use graphical methods for

CHAPTER 6. GRAPHICS

7

Figure 6.5: Box plots for 10 random samples of size 8.

screening, but it is imperative to use objective statistics to assess whether or not the distributions of scores diﬀer across groups.

6.1.4

Box and dot plots

The flexibility of modern graphics software permits hybrids of plots. The downside is that you must do the work to create the hybrids. One useful hybrid is a combination of a box plot and a dot plot. This can be performed by creating a box plot and then superimposing the dot plot over it. An example using the simulated data from Figure 6.5 is given in Figure 6.6.

6.1.5

Violin Plots

A violin plot combines information from a box plot with smoothed information about the frequency of scores at a value. To understand a violin plot, it is easier to look first and explain later, so examine Figure 6.7 which gives a violin plot for the same data used to construct Figures 6.5 and 6.6. In place of a rectangle, the figure in a violin plot is scaled so that the width represents the density of scores at a certain value. If the distribution of scores were normal, then a violin plot would resemble that of Sample 5 but the shape

CHAPTER 6. GRAPHICS

Figure 6.6: A dot plot superimposed over a box plot.

8

CHAPTER 6. GRAPHICS

9

60

80

Figure 6.7: Examples of violin plots.

● ●

●

2

3

● ●

●

●

● ●

20

40

●

1

4

5 6 Sample

7

8

9

10

CHAPTER 6. GRAPHICS

10

would be completely symmetric. Sample 3 is almost normal but has a light negative skew. Samples 1 and 8 illustrate positive skewness. Finally, Samples 4 and 10 depict a uniform distribution (i.e., one with a histogram that resembles a rectangle). Violin plots may also include a symbols for the median (the white dot in Figure 6.7) and lines denoting the interquartile range (denser vertical line in the Figure) and connected data points (finer vertical line). As in a box plot, the are not universal standards for these symbols or for the definition of a “connected” data point, so always consult the software’s documentation.

6.1.6

Assessing distributions

Dot, box, and violin plots are very useful in examining data and can give hints about the underlying distributions. For many statistical analyses, these graphics should be suﬃcient to detect potential problems. In some cases, however, it is necessary to use stricter criteria to see if the scores fit a specific distribution. Here, three types of graphics are often used: (1) a histogram with superimposed plots of a theoretical distribution and/or kernel density; (2) a plot of observed and theoretical cumulative distribution; and (3) a quantile-quantile or QQ plot. These are illustrated in Figure 6.8 for 200 scores randomly sampled from a normal distribution with a mean of 10 and standard deviation of 2. 6.1.6.1

Histogram with theoretical and kernel densities

If a distribution is normal with mean µ and standard deviation σ, then drawing a normal curve based on those statistics over an observed histogram should reveal a close fit. The only trick here is to make certain that the scale of the vertical axis for the histogram is the proportion of data and not a raw count. To further assess the fit, one should also plot a kernel density. A kernel density may be viewed as an agnostic (technically, nonparametric) method of estimating the shape of the distribution. It uses one of several functions (the kernel function) and applies that function over a small section of the distribution to arrive at an estimate of the density for a value within that small section. Think of it as a smoothed histogram. Hence, if the kernel density agrees well with the theoretical density, then there is good evidence that the observed data follow the theoretical distribution. Table 6.2 gives the R code for producing the plot in the upper panel of Figure 6.8. It is clear that the kernel estimate agrees well with the theoretical normal. 6.1.6.2

Observed and theoretical cumulative distribution

The observed cumulative distribution function for X plots the value of X on the horizontal axis and the proportion of all observed scores less than or equal to X on the vertical axis. If the distribution is normal with a mean equal to the observed mean and a standard deviation equal to the observed one then one can construct a plot of the area under this normal curve from negative infinity to X

CHAPTER 6. GRAPHICS

11

Figure 6.8: Three graphical means for assessing the fit between an observed and theoretical normal distribution: A histogram with kernel density and normal density plot (upper panel); observed and theoretical cumulative density plot (middle panel); and quantile-quantile or QQ plot (lower panel). 0.20

Histogram Normal

0.10 0.00

0.05

Density

0.15

Kernel

4

6

8

10

12

14

x

1.0

Cumulative Density Function

0.6 0.4 0.0

0.2

Frequency

0.8

Empirical cdf Theoretical cdf

●

●

●

● ● ●

●

4

●

●● ● ● ● ●● ● ● ●● ● ●●

●● ● ●● ●●

●● ● ●● ●● ● ● ●● ●●

6

8

●● ●● ● ● ● ●● ● ●● ● ● ●● ● ●● ●● ● ●● ● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ●

●● ● ● ● ●● ● ●● ● ● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ●

● ● ●● ●● ● ● ●● ● ●● ●● ●● ● ●

● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●

10

● ●● ●● ●●

●● ●

● ●● ● ● ●●

●● ● ●● ● ● ●●

●

●● ● ●● ● ●● ● ●

●●

12

● ●●

●

●●

●

14

16

X

QQ Plot ●

12 10 8

●●

●● ● ●●●●● ●● ●●● ●●●●●● ●●●●●● ●●●● ● ● ● ● ●●● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●●● ●● ●●●●● ●●●● ●●●● ●●●●●● ●●● ●●● ● ● ● ●●●

●

● ● ●

●

●

● ●

6

Sample Quantiles

14

● ●●

● ●

● ●

●

●

−3

−2

−1

0 Theoretical Quantiles

1

2

3

CHAPTER 6. GRAPHICS

12

Table 6.2: R code for plotting a histogram, theoretical normal, and kernel density. h i s t ( x , c o l ="b i s q u e " , f r e q=FALSE) meanx