Statistical Methods in Microbiology

Vol. 3, No. 3 CLINICAL MICROBIOLOGY REVIEWS, JUly 1990, p. 219-226 0893-8512/90/030219-08$02.00/0 Copyright © 1990, American Society for Microbiology...
Author: Emery Bridges
0 downloads 0 Views 1MB Size
Vol. 3, No. 3

CLINICAL MICROBIOLOGY REVIEWS, JUly 1990, p. 219-226 0893-8512/90/030219-08$02.00/0 Copyright © 1990, American Society for Microbiology

Statistical Methods in Microbiology DUANE M. ILSTRUP Section of Biostatistics, Mayo Medical School, Mayo Clinic and Mayo Foundation, Rochester, Minnesota 55905

INTRODUCTION .................................................................. 219 THE NATURE OF VARIABLES .................................................................. 219 DESCRIPTIVE ANALYSES .................................................................. 219 INDEPENDENT VERSUS DEPENDENT OBSERVATIONS .......................................................... 220 HISTORICAL VERSUS PROSPECTIVE STUDIES .................................................................. 220 ESTIMATION VERSUS TESTING, ERROR RATES, AND STATISTICAL POWER ......................... 220 STATISTICAL HYPOTHESIS TESTS .................................................................. 221 Two Samples with Dependent or Paired Observations ................................................................221 Nominal data ................................................................... 221 Continuous Gaussian data .................................................................. 222 Continuous non-Gaussian data .................................................................. 222 Ordinal data .................................................................. 222 Two Samples with Independent Observations .................................................................. 223 Nominal data .................................................................. 223 Continuous Gaussian data .................................................................. 223 Continuous non-Gaussian data .................................................................. 223 Ordinal data ................................................................... 223 Three or More Samples with Dependent Observations ...............................................................224 Three or More Samples with Independent Observations .............................................................224 OTHER STATISTICAL METHODS .................................................................. 224 EVALUATING NEW DIAGNOSTIC TESTS .................................................................. 224 True Patient Status Is Known .................................................................. 224 224 Negative or positive diagnostic test ................................................................... Ordinal diagnostic test .................................................................. 225 Unknown True Patient Status .................................................................. 225 SUMMARY .................................................................. 226 LITERATURE CITED .................................................................. 226

measuring the variable. A commonly assumed distribution of a continuous variable is the Gaussian or normal distribution with its familiar bell-shaped form. Unfortunately, very few variables in medicine and in microbiology have mathematically precise Gaussian underlying distributions (14). Radioactivity counts, counts of CFU or fluorescenceforming units, and time in hours or days to growth of an organism in culture all tend to have skewed distributions (distributions in which most results are concentrated on one end of the distribution with a long tail of values extending in the opposite direction). When continuous variables are approximately Gaussian, parametric statistical methods may be used, but when the distributions are markedly nonGaussian and when they cannot be transformed to be Gaussian after a mathematical transformation such as the logarithm, nonparametric methods should be used. Both parametric and nonparametric methods are discussed later, but nonparametric methods are emphasized because of the non-Gaussian nature of many microbiological measurements.

INTRODUCTION With as few formulas and with as little theory as possible, statistical methods are described that are usually appropriate in specific experimental situations. These methods are not described in detail, but references are given that will allow the interested reader to study them more thoroughly. Before specific methods can be described, however, the reader must be able to recognize the nature of the variables being studied and whether the study observations are dependent or inde-

pendent. THE NATURE OF VARIABLES

Study variables may be classified into three general types: nominal, ordinal, and continuous. Nominal variables are those that take on only a finite (usually small) number of categories, when the categories have no logical ordering. Examples of nominal variables are death (no or yes) of an experimental animal in an antibiotic study and growth or no growth of an organism in a culture medium investigation.

DESCRIPTIVE ANALYSES It is beyond the scope of this article to give any more than a minimal introduction to descriptive statistical methods. Descriptive analyses may be grouped into three general categories: (i) measures of central tendency, (ii) measures of variability, and (iii) graphic displays. Measures of central tendency are statistics that describe

Ordinal variables are those that also take on a finite number of categories, but the categories have a logical ordering to them. An example of an ordinal variable is a level of intensity, growth, or cytopathic effect that is negative, +1, +2, etc. Continuous variables are measured variables that usually are limited in number only by the precision of the instrument

219

220

ILSTRUP

the center of a sample of data. These statistics are supposed to estimate what a "typical" data value should be. The mean (or average value) is calculated by adding all of the data points in the sample and then dividing this sum by the number of data points. The mean can be thought of as the center of gravity of the sample or the balance point of the distribution. The sample mean has many highly desirable mathematical properties that make it very useful for comparing one sample with another, but it has the undesirable property that it may be highly influenced by outliers in the data or by data that have an asymmetric distribution. One or two very high or very low values will strongly pull the mean away from the center of the distribution. The median is the halfway point of the sample, the point at which half of the data are below the point and half of the data are above the point. The median is not strongly influenced by asymmetry or outliers in the data, and in many cases it is a much better estimate of a typical data value than is the mean. There are several statistics that attempt to describe the variability of the sample data. They all try to measure how concentrated or, conversely, how dispersed the data are. The simplest of these statistics are the minimum, the maximum, and the range (maximum minus the minimum). These statistics have the virtue that they are simple to understand, but their values are a function of the sample size. For continuous data, as the sample size increases, the minimum decreases, and both the maximum and the range increase. Another way to describe the variability of the data is to estimate percentiles of the distribution such as the 25th percentile and the 75th percentile. These have the property that 25% of the data fall below the 25th percentile and 25% fall above the 75th percentile. Often the interquartile range (the 75th percentile minus the 25th percentile) is quoted. The most common and most abused statistic that describes variability is the sample standard deviation. The standard deviation is the square root of the sample variance, and the sample variance is the sum of the squared differences between the sample data points and the sample mean, all divided by the sample size minus 1. Many investigators believe that 95% of the sample data lie within 2 standard deviations of the mean. This is true when the underlying distribution is Gaussian, but as mentioned before this is rarely the case in most biomedical settings. Finally, when attempting to describe data, there is no better way than to use the "interocular test," that is, to display the data visually in a graphical form. Yogi Berra once said, "You can see a lot just by lookin," and this is particularly true with experimental data. O'Brien and Shampo (15, 16) give excellent examples of how to display data with histograms, frequency polygons, cumulative distribution polygons, and scatter diagrams. INDEPENDENT VERSUS DEPENDENT OBSERVATIONS

In addition to the nature of the variable being studied, the appropriate choice of statistical methodology is a function of whether the comparisons are made between independent or dependent experimental units. If two or more samples of experimental units are to be compared and the experimental units in one sample are not used again in the other sample, the observations in each experimental sample are independent of one another unless the units in one sample have been matched on a one-to-one basis with the units in the other sample. An example of two independent samples is a study of the effectiveness of gentamicin versus ciprofloxacin in the treatment of infected mice in which the mice have been randomly assigned to the two treatment groups.

CLIN. MICROBIOL. REV.

When the same experimental unit is tested repeatedly under two or more experimental conditions, the design is a dependent design. The results in one sample are correlated with the results from another sample because the same experimental units are tested in both groups. Another type of dependent design is one in which the experimental units in one sample have been chosen to match on a one-to-one basis the experimental units in another sample. Typically, this matching is done on factors that are related to the response variable. These designs are known as paired or matched studies. When more than two measurements are made on the same experimental unit, the study is called a repeatedmeasures design. An example of this would be the same patients evaluated before treatment with an antibiotic, 1 month after treatment, and 2 months after treatment. HISTORICAL VERSUS PROSPECTIVE STUDIES Comparisons of various treatments on, for example, their effectiveness in eradicating bacterial infections are made in two general ways: (i) by analyzing historical data not collected with a rigid protocol, and (ii) by planning a prospective study. The former type of study is known as a historical or retrospective study. Retrospective studies suffer from many weaknesses, the chief one being selection bias. If, for example, there are only two antibiotics to choose from, the attending physicians would choose the antibiotic which they think is best for the patient based on the characteristics of the patient. Typically, then, the patients who receive one antibiotic are qualitatively different from the patients who receive the other antibiotic. Therefore, one does not know whether an observed difference or lack of difference in outcome is due to the effectiveness of the antibiotics or to the differences in the nature of the two patient groups. Some historical studies, such as case control studies, can be well designed and be conducted by using a strict written protocol, but biases can still exist and interpretation of the results of such studies may be difficult. The problems of the historical study can be alleviated when the study is designed in advance and carried out prospectively. With appropriate treatment randomization and, ideally, both the patient and the evaluating clinician unaware of which treatment has been received, unbiased estimates of treatment response may be made (5). ESTIMATION VERSUS TESTING, ERROR RATES, AND STATISTICAL POWER It is important for an investigator to understand the difference between estimation and hypothesis testing. In almost every experiment, one of the goals is to determine the value of a true underlying population characteristic, for example, the median time to culture positivity of specimens from patients infected with a given organism. This is a process called estimation (17). In addition to the estimate of the true population parameter, it is common practice to also give 95% confidence intervals for the parameter (4). These intervals have the property that, if the experiment were repeated many times, 95% of the calculated confidence intervals would include the true unknown value of the population parameter. The 95% confidence limits usually are calculated in the form estimate ± 1.96 (standard error of the estimate) where 1.96 is the 97.5th percentile of the Gaussian distribution with mean = 0 and variance = 1.

STATISTICAL METHODS IN MICROBIOLOGY

VOL. 3, 1990

For the case in which one is interested in estimating the proportion, P, the number of times in which an event will occur out of n independent trials, one computes the confidence interval in the following way:

P

± 1.96 7

IP( 1- P) n

where the square root term is the standard error of P. For example, if a new screening test for the detection of Chlamydia trachomatis correctly classifies 80 of 100 infected patients, the estimate of sensitivity is P = 80/100 = 0.8 and the 95% confidence interval for the true unknown sensitivity is

/0.8 (1 - 0.8) 0.8 ± 1.96 xI 100 = 0.8 + 0.08 or (0.72 to 0.88) For the case in which, from a sample of n observations of a continuous variable, one estimates the mean, x, and the standard deviation, S, of the sample, the 95% confidence interval for the true mean is t (n

-1)

0.975

-

V

n

where the quantity x/40, this percentile of the t distribution may be approximated by 2. For example, if the mean time of detection of the early antigen of cytomegalovirus in 60 patients is 16 h and the standard deviation is 8 h, then a 95% confidence interval for the true mean is 16 ± 2

/82 60

16 ± 2 or (14 to 18) If the goals of an experiment include comparison of the parameter estimate to a hypothetical value or comparisons of estimates obtained under two or more experimental conditions, one may wish to perform statistical tests of hypotheses. Hypotheses tests, or significance tests, are usually formulated in terms of null and alternative hypotheses. The null hypothesis typically states that the population parameter is equal to some hypothesized value or, in the case of two samples, that the two population values are equal, for example, the median times to positivity in two different culture media. One rejects the null hypothesis when the evidence from the sample(s) suggests that the observed results in repeated experiments would have been very unlikely if the null hypothesis were true. Very unlikely is conventionally accepted to be 3.84, which is the 95th percentile of the chi-square distribution with 1 df, and we conclude that cytomegalovirus grew proportionately more often when the

1= None

Treated

2= Minimal

Growth 3 Moderate

Total

=

Confluent

No Yes

20 10

15 15

10 15

5 10

50 50

Total Avg rank

30 15.5

30 45.5

25 73

15 93

100

shell vials were centrifuged than when they were not centrifuged (P < 0.05). Continuous Gaussian data. If the underlying distribution of the study variable is approximately Gaussian and if the standard deviations in the two study groups are of similar magnitude, the two-sample t test (4) is the most powerful statistical test for detecting a shift in the centers of the distributions (the means in the case of the t test). Almost every personal computer and many hand calculators have programmed versions of the t test, but before trusting these programs the user should try an example such as that given in reference 4. Continuous non-Gaussian data. Many distributions in microbiology, such as colony counts and radioactivity counts, are markedly skewed, thereby violating the assumptions of the t test. Occasionally, a mathematical transformation of the underlying distribution, such as the logarithmic transformation for non-negative and non-zero distributions, will allow the investigator to use the t test on the transformed scale and then make inferences back to the original scale. Many times, however, no simple transformation will yield a distribution that is sufficiently Gaussian such that the t test can be used. In these cases the Wilcoxon rank-sum test (4) is generally applicable. The rank sum test can be more powerful than the t test when the distributions are non-Gaussian, and little power is lost compared with the t test even when the distributions are Gaussian. Ordinal data. One of the most frequent errors in medical studies is using a chi-square test on a two-way frequency table when one of the variables is ordinal in nature. Moses et al. (13) point out that the correct analysis would use the Wilcoxon rank sum test. Suppose that, instead of the dependent ordinal table given in Table 4, we now have two independent groups or cultures randomly treated with and without a growth-enhancing drug such as cycloheximide, and the response (growth of Chlamydia trachomatis) can be categorized only subjectively (in a blinded fashion) as 1 = no growth, 2 = minimal growth, 3 = moderate growth, and 4 = confluent growth. The results of such an experiment might be as found in Table 6. Rank sum for no treatment = (15.5)(20) + (45.5)(15) + (73)(10) + (93)(5) = 2,187.5 Rank sum for treated = (15.5)(10) + (45.5)(15) + (73)(15) + (93)(10) = 2,862.5 A formula without correction for ties for a normal relative deviate used to compare the two rank sums is given in Dixon

and Massey (4): Z-

- T - N1(N1 + N2 + 1)/2l + 0.5

N1N2(N1 + N2 + 1) 12

224

ILSTRUP

CLIN. MICROBIOL. REV.

where T is the rank sum for the smaller sample size, N1 is the smaller sample size, and N2 is the larger sample size. In this example, this becomes

2,862.5

-

50(101)/2

-\/(50)(50)(101)112

+ 5

338.0 = 2.33 145.057

This value is greater than the 97.5th percentile of the normal distribution, and we conclude that there was greater growth in the treated group of cultures (P < 0.05). Once again, in practice, a formula with correction for tied ranks should be used, and most computer statistical packages include appropriate rank sum tests. Three or More Samples with Dependent Observations Detailed descriptions of methods for three or more samples with dependent observations are beyond the scope of this article. The reader is encouraged to refer to examples given in the references. When the same experimental unit is studied on more than two occasions under different experimental conditions, the appropriate statistical methods become much more complex. If the response variable is a no/yes or positive/negative variable, Cochran's Q test (20) may be used. If the response variable is Gaussian, repeated-measures analysis of variance (12, 21) should be used. Finally, if the response variable is continuous but non-Gaussian, or if it is ordinal, Friedman's procedure (20) should be used. Three or More Samples with Independent Observations When different experimental units are studied under three or more experimental conditions, the following statistical methods should be considered depending on the nature of the response variable. If the response variable is nominal, chi-square methods (20) should be used. If the response variable is Gaussian, one-way analysis of variance (4) is appropriate. Finally, if the response variable is non-Gaussian or if it is ordinal, the Kruskal-Wallis one-way analysis of variance using ranks (20) should be used.

OTHER STATISTICAL METHODS When the relationships of two or more continuous variables to one another are of interest, correlation and regression methods are usually used. These methods are not addressed here. The interested reader should refer to an excellent book on this subject (6). If the response variable of interest in a study is a no/yes variable that is a function of time from some starting point, then survival or actuarial methods should be used. For example, if survival after treatment with two or more experimental drugs is being studied, the endpoint is a function of how long each patient is under follow-up. Methods for estimating and comparing such endpoints are found in a book by Lee (8). EVALUATING NEW DIAGNOSTIC TESTS Many articles have appeared in the medical literature on evaluation of new tests. Two articles written for physicians are highly recommended as introductory references, one by McNeil and Hanley in the New England Journal of Medicine (10) and the other by Metz in Seminars in Nuclear Medicine (11). Another article by McNeil et al. (9) is recommended for

TABLE 7. Format for a negative-positive diagnostic test True state

New test result

Positive

c

Negative (D-) b d

a+ b c+d

a+c

b+d

N= (a + b + c + d)

(D+)

Positive (T+) Negative (T-)

Total

a

Total

a more detailed analysis of receiver operating characteristic (ROC) curves. This section is divided into two parts: one for an experiment in which the true status of a patient is known (for example, diseased or not diseased) and the second for the situation in which the truth is not necessarily known but the results of another diagnostic test, considered to be gold standard, are known.

True Patient Status Is Known Suppose we are interested in whether a new diagnostic test accurately predicts whether a patient is infected with a particular organism, and suppose that we always know the true status of the patient, supposedly from another test or examination that never makes an error but perhaps is excessively costly or time-consuming compared with the proposed new diagnostic test. In this example, there are two separate situations to consider: (i) the new test is either positive or negative, and (ii) the new test is ordinal (perhaps 0, +1, +2, etc.) or continuous in nature. Negative or positive diagnostic test. When the new test is either negative or positive, the results from a series of patient evaluations can be displayed in a simple two-way frequency table. In Table 7, a represents the number of patients truly infected that were correctly called positive by the new test, b represents the number of patients truly noninfected who were incorrectly called positive by the new test, etc. Several indices of new test accuracy have been proposed for tables such as Table 7, but only three will be presented. (i) Sensitivity = Pr(T+/D+) = a/(a + c). Sensitivity is the probability (Pr) that the new test will be positive (T+) when the patient truly is infected (D+) and is estimated by the ratio al(a + c). (ii) Specificity = Pr(T-/D-) = dl(b + d). Specificity is the probability that the new test will be negative when, in fact, the patient is not infected. (iii) Positive predictive value = Pr(D+/T+) = al(a + b). The positive predictive value, sometimes called the "diagnosibility" of the test, is the proportion of times that the patient will, in fact, be infected when the new test is positive. For a new diagnostic test to be a "good" test, it is desirable that the sensitivity and specificity be as high as possible, preferably 90% or greater. Sensitivity and specificity are independent of the prevalence of infection in the population being studied. This is not true for positive predictive value, which is highly related to the population prevalence of infection, estimated by (a + c)/N. Tables 8 and 9 highlight this problem. Both sensitivity and specificity are the same in Tables 8 and 9, but the positive predictive value is much lower in Table 9. This is because the prevalence of disease is much lower in Table 9 (100/10,100 = 0.99% versus 100/200 = 50%). This problem is particularly apparent when the disease being

TABLE 8. Example of a negative-positive diagnostic test with high positive predictive value

Positive (T+) Negative (T-) Totals a Sensitivity = 95/100 =

1oo r 0

True state New test Negative (D-) result Positive (D+) result

"W.