2. Introduction to Statistics and Sampling

2. Introduction to Statistics and Sampling 2.1 Definitions 2.1.1 2.1.2 2.1.3 2.1.4 Population Sample Probability Continuous vs. discrete variables Æ ...
Author: Colin Stone
6 downloads 0 Views 262KB Size
2. Introduction to Statistics and Sampling 2.1 Definitions 2.1.1 2.1.2 2.1.3 2.1.4

Population Sample Probability Continuous vs. discrete variables Æ we will concentrate on continuous variables

2.2 Graphical Representation of a Finite Sample Example 2-1: Twenty ball bearings are taken off an assembly line, and their diameters measured, with the resulting values below. Table 2.1. Ball bearing diameters (in.). 0.98 1.26 0.96 0.78

2.2.1

1.07

1.04

1.08

0.86

1.16

1.02

1.02

0.94

1.34

0.99

1.11

0.96

1.21

1.06

0.86

0.68

Frequency table ƒ F j = number of occurrences within a given range (the measurement’s frequency) ƒ ƒ ƒ

n = number of data points in sample ∆x j = interval width

Number of intervals (or bins) to use: Usually between 5 and 20. The right number depends on your situation. As a starting point, several equations exist to make a first guess. Figliola and Beasley suggest k ≈ 1.87(n − 1) 0.40 + 1 ,

(2.1)

which seems to work well for small samples, say, below 50. Montgomery and Runger suggest k≈ n , (2.2) which seems to work well with larger samples. Any method you use is just a rule of thumb, and you will likely need to adjust k to make the data presentation as meaningful as possible. In this example, n = 20, so we use Eq. (2.1) to calculate k to be 7.072 (or 7, choosing the closest integer). ƒ

How do we choose the interval width, ∆x j ? Use

2-1

. ƒ

(2.3)

In this example,

xmax − xmin k 1.34 − 0.68 = 7.072 = 0.0933

∆x j =

ƒ ƒ 2.2.2

We want to choose a simple bin width, one that could be read easily from a plot. We therefore choose 0.10 as the bin width. Knowing the bin width, we can now assign measurements to bins in a table.

Histogram ƒ A histogram is a frequency distribution in graphical form. ƒ Call F j (defined above) the frequency ƒ

Can also plot a histogram of the relative frequency, which is the frequency divided by the total number of samples:

fj =

ƒ ƒ

Fj

n Note that the relative frequency is the same as the probability Note also that k

∑f

j

=1

(2.3)

(2.4)

8

0.40

7

0.35

6

0.30

5

0.25

4

0.20

3

0.15

2

0.10

1

0.05

0

0.00 0.55 0.65 0.75 0.85 0.95 1.05 1.15 1.25 1.35 1.45 Bearing diameter, in.

Relative Frequency

Frequency

j =1

This bar means that 4 ball bearings have diameters within 1.05 < d ≤ 1.15 in.

Figure 2.1. Histogram of ball bearing diameters from Example 2.1. 2.2.3

Cumulative Frequency or Relative Frequency Plots ƒ A plot in which each bar represents the total observation count to that point

2-2

1.00

16

0.80

12

0.60

8

0.40

4

0.20

0

Cumulative Rel. Frequency

Cumulative Frequency

20

0.00 0.55 0.65 0.75 0.85 0.95 1.05 1.15 1.25 1.35 1.45 Bearing diameter, in.

This bar means that 16 ball bearings (80 percent) have diameters ≤ 1.15 in.

Figure 2.2. A cumulative distribution plot of the ball bearing diameters of Example 2.1. 2.2.4 Some definitions ƒ ƒ ƒ

ƒ ƒ

Percentile: the point below which a certain percentage of the data lie (Example, in Figure 2.2, a value of 1.15 inches is the 80th percentile, since 80% of the measurements are below 1.15 in.) Fractile (or Quantile): the point below which a stated fraction of the measurements lie (1.15 inches is the 0.80-fractile) Quartile: the values that divide data into four equal sets. The median (middle value) of the entire set is the second quartile; below this value 50% of the data lie (same as 50th percentile). The median of the lower half of the data set is the first or lower quartile (25th percentile) and the median of the upper half of the data set is the third or upper quartile (75th percentile). Interquartile range (IQR): the difference between the lower quartile and upper quartile; that is, the range covering the middle 50% of the data: See Figure 2.3 below. Notice that the above definitions describe the spread of the data Interquartile range

25%

50%

75%

Lower quartile

Median

Upper quartile

Figure 2.3. Illustration of quartiles.

2-3

2.2.5 Box Plots ƒ

A box plot is another graphical depiction of a data set, which shows the following fivenumber summary of the data: 1. 2. 3. 4. 5.

the minimum value the lower quartile (1st quartile) value the median (middle) value the upper quartile (3rd quartile) value the maximum value

ƒ

These values are determined by first arranging the data in ascending order. minimum and maximum are at the top and bottom of the list, respectively.

ƒ

To estimate the quartiles, we will use the following approximate technique: We first estimate the numbered position of the quartile. We calculate this by (n+1)·p, where n is the total number of observations, and p is the percentile. Thus the position of the first quartile (25th percentile) is 0.25(n+1), the position of the median is 0.5(n+1), and the third quartile is located at position 0.75(n+1). o o o

ƒ

The

If the position is an integer, the observation corresponding to that position is chosen as the quartile value If the position is halfway between two integers, the average of the two corresponding observations is chosen If the position is neither an integer or halfway between two, the observation closest to the position is chosen.

There are a number of more sophisticated ways of estimating quartiles (Excel has a QUARTILE function, for example), but the method described here will suffice for most applications.

Example 2-2: Construct a five-number summary and box plot for the data of Example 2-1. Solution:

First, we arrange the data in ascending order and number them, as shown in Table 2.2. We can then identify the maximum and minimum values immediately; the quartiles are calculated five summary numbers as . Note that since there is an even number of data, there are two median values, data numbers 10 and 11. If these values were different, the average of these two values would represent the median. This is in fact done for the lower and upper quartile values.

2-4

Table 2.1 Data of Example 2-1, arranged in ascending order. number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

lower quartile = 0.25(20+1) = 5.25 ≈ 5

median = 0.5(20+1) = 10.5

upper quartile = 0.75(20+1) = 15.75 ≈ 16

ƒ

diameter 0.68 0.78 0.86 0.86 0.94 0.96 0.96 0.98 0.99 1.02 1.02 1.04 1.06 1.07 1.08 1.11 1.16 1.21 1.26 1.34

minimum = 0.68

lower quartile = 0.94

median = (1.02+1.02)/2 = 1.02

upper quartile = 1.11

maximum = 1.34

Now, we can create a box plot of the data set as shown in Figure 2.4. Here, the interquartile range (3rd quartile minus 1st quartile) is represented by the box. The line inside the box is the median, and the vertical lines on either end of the horizontal line represent the extreme points. interquartile range (middle 50% of data) minimum

0.4

0.6

maximum

median

0.8

1

1.2

1.4

1.6

diameter

Figure 2.4. Box plot of the data of Example 2-1. ƒ

As you can see from Figure 2.4, a box plot gives a quick visual “feel” for the data set, much like a histogram, but with additional information.

2-5

2.2.6

A note about histograms and box plots: they only depict overall sample behavior. You can (and probably should) plot the individual measurements in sequence, as shown below. The data in fact show a general downward trend, which in this case may indicate some kind of drift in the manufacturing process.

Ball bearing diameter, in.

1.6 1.4 1.2 1.0 0.8 0.6 0.4 0

2

4

6

8

10

12

14

16

18

20

Observation Number

Figure 2.4. Sequential plot of measured ball bearing diameters from Example 2.1.

2.3 Statistical Measures for a Sample The five-number summary described in Section 2.2 gives some information about the behavior of the sample. A more detailed mathematical analysis of the data yields additional information. 2.3.1

Measurement of central tendency ƒ Sample mean: for n data points in a sample, 1 n x= xi n i =1 ƒ Median: the middle value (or average of the two middle values). ƒ Mode: the most frequently recurring measurement



2.3.2

Measurement of variability ƒ Standard deviation of a sample:

⎡ 1 n ⎤ s=⎢ ( xi − x ) 2 ⎥ ⎣ n − 1 i =1 ⎦



ƒ 2.3.3

(2.4)

1/ 2

(2.5)

Variance = s 2

Other descriptive parameters ƒ Skewness (lack of symmetry) n

skewness =

ƒ

∑ (x

i

− x)3

i =1

(n − 1) s 3

Kurtosis (“peakedness”)

2-6

(2.6)

n

∑ (x

− x)4

i

i =1

kurtosis =

(2.7)

(n − 1) s 4

2.4 Statistical Measures for a Population 2.4.1

Finite populations ƒ Define N as number of data in population ƒ Population mean: 1 N µ= xi N i =1 ƒ Population standard deviation:



⎡ 1 N ⎤ σ =⎢ ( xi − µ ) 2 ⎥ ⎣ N − 1 i =1 ⎦



ƒ 2.4.2

(2.8) 1/ 2

(2.9)

Variance = σ 2

Infinite populations ƒ

Population mean

µ = lim

N →∞

ƒ

1 N

N

∑x

(2.10)

i

i =1

Population standard deviation

⎡1 σ = lim ⎢ N →∞ N ⎣

2⎤

N

∑ (x i =1

i

− µ) ⎥ ⎦

1/ 2

(2.11)

References 1. Levine, D.M., Ramsey, P.P., and Smidt, R.K., Applied Statistics for Engineers and Scientists, PrenticeHall, New Jersey, 2001. 2. Montgomery, D.C. and Runger, G.C., Applied Statistics and Probability for Engineers, 2nd Edition, John Wiley and Sons, 1999.

2-7

Homework 2-1 Consider the following table of data. d [mm] 4.90 5.02 4.90 5.10 5.02 4.74 4.78 4.94 4.98 4.50

d [mm] 4.32 4.98 4.89 5.01 4.18 5.00 4.66 4.74 4.92 5.01

d [mm] 4.60 4.71 4.78 4.32 4.77 4.85 5.12 5.09 5.20 5.20

a. Develop a frequency table for the data. b. Draw a frequency distribution for the data. You may use Excel to do this. c. Create a sequential plot of the data. 2-2 For the data of Problem 2-1, and the intervals determined there, construct a cumulative frequency plot. 2-3 Estimate the mean value of the data represented in the probability (relative frequency) distribution below. 0.40

Probability fj

0.35

N=60

0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.1

0.2

0.3

0.4

0.5

0.6

0.7

Measurement 2-4 The probability distribution on the next page is missing one interval value, between 11.0 and 11.5. From the plot, determine the probability of a value occurring between 1.00 and 11.5.

2-8

0.30

relative frequency

0.25 0.20 0.15 0.10 0.05 0.00 8

8.5

9

9.5

10

10.5

11

11.5

12

12.5

13

Measurement x

Cumulative Frequency

2-5 For the cumulative frequency distribution shown below, a. Estimate the probability of measurements being less than or equal to 40. b. Find how many measurements occurred between 36 < x ≤ 41? c. What value approximates the 2nd quartile? d. What values approximate the following percentiles: 10th, 70th, 90th?

50 45 40 35 30 25 20 15 10 5 0 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Measurement x 2-6 Calculate the mean and standard deviation for the following data sets. Calculate these by hand. a. (-10.0, 5.0, 15.0) b. (-23.5, 26.5) c. (1, 2) d. (2001, 2002) 2-7 Calculate the mean and standard deviation of the data set of Problem 2-1, a. Using Excel, but not using the embedded statistical functions AVERAGE() and STDEV(). b. Using Excel, and the embedded functions listed in part a. Do the calculations match?

2-9

Suggest Documents