Statistics for Diagnostic Procedures

Downloaded from www.ajronline.org by 37.44.207.175 on 01/19/17 from IP address 37.44.207.175. Copyright ARRS. For personal use only; all rights reserv...
Author: Roland Hart
1 downloads 1 Views 229KB Size
Downloaded from www.ajronline.org by 37.44.207.175 on 01/19/17 from IP address 37.44.207.175. Copyright ARRS. For personal use only; all rights reserved

203

Statistics

for Diagnostic Procedures

II. The Significance of “No Significance”: Statistical Test Really Means Warren

C.

Phillips,1’2

James

Scott,1

A.

and

George

groups

are

identical.

pIes, of



not on the

Often

this

sample size. well established

investigator’s

‘no significant

outcome The

intuition.

difference’



is merely

adequacy mathematical The

substantiating

evidence

the

tests

often

are

used

negative

meaning

is discussed.

in the

medical

literature

results

Basic

errors

in the interpretation

of negative

tests in the nonradiologic literature [1 , statistical test is the result of insufficient

statistical negative

2].

Often a sample

size [1 -3]. When sample size considerations were included retrospectively in the analysis of these papers, the conclusions of the authors often were weakly substantiated by

article; tween

one of us (W. C. P.) reviewed the major articles Radiology from January to December of 1980. consisted of reading the abstracts of each

if this groups,

indicated a lack then the contents

of statistical correlation were examined to find

Editor’s note-This 1

Department

is the second of a three-part series of papers of Radiology, Harvard Medical School, Massachusetts

two

are

Present

address:

3

Elscint,

Inc., Boston,

AJR

141:203-206,

Department

July

of Radiology,

Milford

Memorial

0361-803X/83/141

1-0203

to

mean

that

there

is no

illustrates conclusions.

what

parameters

influence

bewhat

Milford,

© American

Roentgen

groups

are being

compared,

statistical

the

different. obtaining assuming “p value”

One then proceeds to calculate the probability of a difference equal to or greater than that observed, that the null hypothesis is true. This is the familiar (p stands for probability). If this is found unlikely

groups hypothesis,

0.05),

then hypothesis

by defining states that

two there

under investigation. states that the

the

significant. that patients

eventually

relapsed

had

than those investigators

who did divided

not [4]. To support the patients into

concepts Boston,

DE 19963.

Ray Society

hypotheses. The is no difference

The groups

second, the are indeed

null hypothesis is rejected and accepted. Thus the results

‘ ‘statistically concluded

on basic statistical General Hospital,

Hospital,

or more

between alternative

MA 02215. 1983

interpreted

usually begins null hypothesis,

Office. 2

conclu-

evaluation first, the

(say p < alternative

their data. Recently, in AJR and This review

this

Principles

When

difference, ‘ ‘ many investigators erroneously conclude that this means that there is no difference between the groups. This is substantiated by several recent papers that have frequent

to support

to

compare two or more groups. Usually the results of these tests show that a statistically significant difference is indeed present. However, when the tests show ‘ ‘ no significant

documented

given

difference between the groups, then the readers of most of this material would have been unintentionally misinformed. This article explains the meaning of ‘ ‘significant’ ‘ and ‘ ‘not significant’ ‘ and the reliability of these

Statistical

was

sion. There were 1 1 such articles. In each instance, the pertinent mathematical parameters that would have been needed to justify this conclusion were not specified. If these

of sample princi-

real

a Negative

Blasczcynski3

Statistical tests are often used to compare two or more groups. When the tests show “no significant difference,” many authors/readers conclude that this means that the result of inadequate size depends on

What

‘ ‘ For example, a recent with primary lymphoma

more

positive

that are often applied to radiologic MA 021 1 4. Address reprint requests

the are

publication of bone who

radiographic

signs

this two

the those

statement, groups:

diagnostic procedures. to Radiology Research

204

PHILLIPS

who

were

disease-free

at 5 years

and

those

who

Downloaded from www.ajronline.org by 37.44.207.175 on 01/19/17 from IP address 37.44.207.175. Copyright ARRS. For personal use only; all rights reserved

mean groups

number could

of positive be explained

radiographic by random

Alternative hypothesis: not be explained reasonably Assuming the null hypothesis such

skewed

in

results

was

calculated

and

foregoing wishes is true,

The

results

were

declared

to

be



Types

unfamiliar

with

seem

classic

confusing.

statistical First,

two

found

Ha

to

. ..

True False

the

an investigator

to present evidence that the alternative hypothesis that is, that the groups are different. This is why the

alternative

=

is a difference

sis. The significant

hypothesis.

The alternative

decision result.

had

received

a larger

also

could

have

accounted

(i.e. , number

items

for

of positive

unlikely, the results cant. Two negatives

to be statistically signifiinvolved for a statistically

significant result to be present; the ‘ ‘no difference hypothesis’ ‘ is not likely to be true. This type of mental approach needed by statistics may seem like ‘ ‘reverse thinking’ ‘ to many. Clearly there are two types of errors an investigator could commit. The alternative hypothesis could be accepted erroneously when it is false, a ‘ ‘false-positive error’ ‘ ; or it could be rejected falsely when it is true, a ‘ ‘false-negative name given a false-negative

a ‘ ‘type II error. ‘ ‘ The probability given by the familiar p value. For mentioned had been 1 1 articles

‘ ‘

case it was no significant previously

to a false-positive error is called

of a false-positive example, in the

The following paragraphs discuss lead to a false-positive or false-negative

greater

is

2%. What if the results of the test difference, ‘ ‘ as was stated by the sited? What is the probability that

these decisions were wrong? What factors likelihood of an incorrect negative conclusion?

discussion implications

error previously

the

is related to the false-negative often are misinterpreted

influenced

factors error.

and

the

that Most

could of the

error because because of

its its

complexity.

False-Positive

Errors

False-positive

errors

modeling

error;

journals The

to publish first cause

conclusions. it be related

(2)

groups,

observed

difference.

primary

can

be caused

chance;

and

negative of error

(3)

results. clearly

any

lymphoma

or

all

of bone

factors:

be

lead

to

responsible

suppose who

remained

(1)

that

erroneous what could differences for

the

the patients disease-free

second

asserts

that

hypothestatistically

dose

than

those

the

difference.

‘ ‘confounded. circumstances.

possibility

These

signs No clear

‘ ‘

is chance

or



and

two radia-

conclusions

‘bad

luck.

‘ ‘

The

due solely to chance is arbiusually at the 5% or 1 % level. the investigator is saying, in

effect, ‘ ‘I am willing to accept a 5% chance of stating that the alternative hypothesis is true when it really is not. ‘ ‘ If the p value, which has been calculated from the experimental data,

is

less

acceptable

than

5%,

levels.

the

Thus,

the

chance

of

investigator

an

error

states

is the

below results

are statistically significant. This calculation is based on the assumption that the difference between the groups under investigation is zero (i.e. , the null hypothesis). What the p value really indicates is the probability of obtaining a difference equal to or greater than that observed when there are really no differences between the groups under investigation. The third authors/journals investigator

and

final factor to publish takes a 5% (1

is an apparent reluctance of negative results. Briefly, if an in 20) risk of a false-positive

error, then of every 20 times an experiment is performed, one would be expected to be positive on the basis of chance alone. This result is probably more likely to be published than

one

of

the

19

negative

studies,

causing

the

false-

positive error rate in the published literature to be greater than 5%. When the previously mentioned AJR-Radiology literature was reviewed, the proportion of positive to negastatistical

results

appeared

even 1 0 to 1 . This suggests less likely to be submitted explanation for this high excellent

of authors/

is found, consistent

could

For example,

by three reluctance will

If a significant difference to? If there are several

between with

The

tive

Errors

radiation

radiographic

possibility of making an error trarily set by the investigator, If the 5% level is selected,

‘ (table 1 ). The technical is a ‘ ‘type I error, ‘ ‘ and

which

as a ‘researcher’s that show a

who relapsed. It would not be possible to conclude that the number of radiographic signs was a significant predictor of prognosis, because the varying amounts of radiation therapy

tion dose) have been are possible in these

error’ error

hypothesis,

between groups, may be characterized that it is true is based on calculations

alternative hypothesis is sometimes referred to as the ‘ ‘ researcher’s hypothesis. ‘ ‘ However, in order to perform a statistical test, the opposite must be assumed, that is, that the null hypothesis is true. Then, if this turns out to be are declared thus are

False-positive error No error

False-negative error

Note-Ha there

False

No error

.

‘statistically

reasoning,

usually

Errors

True

the

the

About

for 5 years may

of Statistical

July 1983

Ha is really: Decision

be less than 2% (i.e. , p < 0.02). This is low, so the null hypothesis was rejected and the alternative hypothesis was accepted. significant.” To those

1:

two

The observed difference could by random variation. to be true, the probability of

2.

observing

signs variation

AJR:141,

TABLE

relapsed

within 5 years despite apparent curative therapy. Then hypotheses were erected: 1 . Null hypothesis: The observed difference between

ET AL.

pretrial

intuition

to experimentation cally significant

that result.

False-Negative

Errors

False-negative errors, significant relationship

to be at least

5 to 1 , perhaps

that negative results are indeed and/or published. An alternative positive-to-negative ratio is that on the

has

part

a high

of investigators probability

that is, declaring was not present

that when

leads

of a statisti-

a statistically it was, also

AJR:141,

may

be caused

eling

error;

FOR

STATISTICS

July 1983

by several

factors.

(2) inadequate

These

sample

size;

include: and

(1

DIAGNOSTIC

TABLE

) mod-

Downloaded from www.ajronline.org by 37.44.207.175 on 01/19/17 from IP address 37.44.207.175. Copyright ARRS. For personal use only; all rights reserved

is the major This selection

tal modeling. ing warrants

In this further

considerations. is an integral The central

then, experimental modeltogether with sample-size

needed to be reasonably difference if it exists?’

this question items, such

in any given situation as the type of sample

details

cern;

new context discussion

of these

only

the

various



situations

principles

are

Note.-FP

do

luck”

for

not



a clinically estimate of

NMR

machines

are

quite

error

(i.e. , stating

NMR

cannot

detect

10%

data,

a sample-size

need

305

patients

chance of false-negative of avoiding this error); and (2)

=

table in each

may group,

Armed

1 0%.

be consulted or a total

[5].

(3)

with

these

We

would

of 61 0 patients.

Suppose the three parameters are changed: What would happen to the required sample size? Using common sense, is it possible to predict if the required sample size will increase or decrease? If the acceptable levels of falsepositive/false-negative cal to expect the we

are

vestigation

demanding will

errors are decreased, sample size to increase. greater

be required.

precision, As expected,

it seems logWhy? Because

a more

FN - acceptable size.

Since

false-negative

we are examining

error;

two groups,

be needed.

CID=:

.,

.i 70

S

50

U C

S .c

500

error = (90% chance

difference

305 369 367 125

100

it is believed

of false-positive error = 1 0%

important

S5

10 10 10 15

0

when it really can). How many patients are answers to the three questions are: (1 ) chance

clinically

OD

10 10 5 10

6o so

a study

more lesions needed? The

5%;

actually

FN

5 2 5 5

CID=15%

In some data, the

with CT shows is about 80%.

expensive,

error;

FP

90

that NMR should detect at least 1 0% more (i.e., 90% of the lesions) or its purchase would not be justified clinically. The investigator is willing to take a 5% chance of a false-positive error (i.e. , stating NMR can detect 1 0% more lesions when it really cannot), but also wishes a 90% chance of avoiding a false-negative

would

false-positive

difference; 55 - sample of subjects per group.

Size

many

comparing the efficacy of computed tomography (CT) versus nuclear magnetic resonance (NMR) for detection of liver tumors. Two different sets of patients will be examined, one

Because

as many

introductory

important difference. the variability of the

one with NMR. Experience of liver tumors detected

on Sample

be of con-

an

‘standard deviation, ‘ ‘ also must be supplied. For example, suppose an investigator is planning

group on CT and that the proportion

important

Sample size is number

twice

discussion such as this. The factors that are of relevance in a general situation, and which must be specified by the investigator, are: (1 ) acceptable chance of false-positive error; (2) acceptable chance of false-negative error; and (3) what constitutes instances, an

acceptable

-

clinically

= .

on a number of form of the data.

need

relevant

Factors

Reference problem (see text) Decrease FP to 2% Decrease FN to 5% Increase CID to 1 5%

sure of finding a real The precise answer to

depends and the

of Various Situation

is neglected by medical researchshould be part of the experimen-

In addition, the role of chance or ‘ ‘ bad part of sample-size calculations. question to be addressed is, ‘ ‘ How

subjects are and important

The

item that process

2: Effect

(3) chance.

Modeling errors were briefly discussed previously and need not be discussed further here. The selection of sample size ers.

205

PROCEDURES

thorough

the required

insam-

pIe size is larger than 305 if the acceptable chances of errors are decreased (table 2). What if the ‘ ‘clinically impor-

Number Fig. 1 -chance

1000

of Subjects

of detecting

to 1 25, less than Let us approach

a ‘ ‘clinically

half of what the same

important

2500

Group

difference’



(CID)

with

it was previously problem differently.

(table 2). From ex-

perience it is possible to estimate how many patients would be available forthis study. What are the chances of detecting a clinically important difference if it exists? In other words, the sample size has been specified; in addition, let the chance of a false-positive error be set at its traditional level of 5% (i.e., if p < 0.05 result is statistically significant). The situation is depicted in figure 1 . As the number of subjects available important level

of

increases, difference ‘

‘clinically

the chance increases. important

of detecting In addition,

difference’



clinically important group are needed. ing a 5% difference

have to be much better than detect. Indeed, the required

CT, so this should sample size drops

cess to the large number certain clinical problems.

values

a clinically the preset from

5%

Note also the example, if the of detecting a

difference of 1 5%, only 61 subjects per Alternatively, if a 95% chance of detectis needed, 1 ,577 subjects per group

are needed. The large circumstances

as

increases

to 1 0% to 1 5%, the sample size decreases. wide variability in the sample sizes. For investigator specifies only a 50% chance

to 1 5%; that is, the NMR economcially only if it could Here the NMR machine would to

2000

for Each

known sample size. This graph refers only to situation described in text. Details of other problems would be different, but principles would be same.

tant difference’ ‘ is increased machine would be justified detect 95% of liver tumors.

be easier dramatically

1500

Available

of sample

is unfortunate.

sizes Few

required investigators

under

certain have

ac-

of subjects required to address However, the laws of probability

PHILLIPS

206

cannot

be

show some

that the choices

altered.

If pretrial

sample-size

required number have to be made.

Downloaded from www.ajronline.org by 37.44.207.175 on 01/19/17 from IP address 37.44.207.175. Copyright ARRS. For personal use only; all rights reserved

and/or

ence.” 3. Do both 1 and Combine

4.

with

sample

size

Do

study

anyway.

the

differ-

other

investigators

and/or

increase

length

data

analysis

If the

of study. shows

a

unacceptably small (i.e., high chance of false-negative nor), then the conclusion must be ‘ ‘ answer is uncertain.” (See discussion of confidence intervals which follows.) 6. Do not do the study. In general, if an investigator and/or false-negative errors,

will accept or if only

the beginning of this mathematical parameters’ cited which were needed At

ence’ ‘ conclusion. What be clear. The acceptable value

of the

specified. Here are two





sample

between them was this mean? It should

important that

now and

were

not

difference’

are of interest

not statistically be clear that

value of the clinically important able level of false-negative error constitutes an adequate A valid conclusion would

sizes,

should error ‘

because

are seen often in the literature and/or are heard ences. The first: ‘ ‘We investigated two groups; ence does

significant. it is ambiguous.

difference and the were not specified,

sample size cannot be ‘ ‘the relationship

What The

‘ ‘

acceptso what

be answered. between the

cally important difference is specified; it is zero. This alone is enough to determine the required sample size-an infinite number of subjects in each group would be required! Finally, the concept of ‘ ‘confidence intervals’ ‘ should mentioned [1 ]. Suppose a study has been completed results

are

not

significant.

The

considerations

are

we have

should

not

available. tell

Now

us something.

what?

Surely

Do they

the

suggest

for

liver

tumors

that

is better race, but

the

true

lies

between

80%. Since the 80%, we could better than CT

is the catch: The confi80%; therefore we can-

detection

rate

than 80%. In other the finish line has not

of NMR

for

words, NMR been crossed.

is

medical

literature,

‘ ‘

no significant

Authors and readers of this material may question has been answered when in

difference’



as

actions of radiologists optimal patient care to the significance

is necessary

Should further information short list of references has

ethical

often

scientific feel that

fact it has not. When the enced by misinformation, delivered. Greater attention

an



This

problem. a particular

constitutes

difference’

is misinterpreted.

to alleviate

be desired been provided

the

well

as

a

are influcannot be of ‘ ‘no sigsituation.

by the reader, [1 -3, 5-7].

a

ACKNOWLEDGMENTS

We thank Joyce script preparation.

DePnizio

and Susan

Phillips

for help

in manu-

of

data that

REFERENCES 1 . Freiman JA, Chalmers TC, Smith H, Kueblen AR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. Survey of 71 “negative” trials. N Engl J Med 1978;299:69O-694 2. Reed JF, Slaichert W. Statistical proof in inconclusive ‘negative” trials. Arch Intern Med 1981 1 41 :1307-1310 3. Feinstein A. Clinical biostatistics-XXXIV. The other side of . statistical significance’ ‘: alpha, beta, delta, and the calculation of sample size. C/in Pharmacol Ther 1975;1 8:491-505 4. Phillips WC, Kattapuram SV, Doseretz DE, et al. Primary lymphoma

be and ac-

ceptable false-positive/false-negative errors and ‘ ‘ clinically important difference’ ‘ show that more subjects are needed for a valid ‘ ‘ no difference’ ‘ conclusion. These additional subjects

NMR

of liver tumors. ‘ ‘ Here (78%-94%) includes certain

subjects

they

at conferthe differ-

two groups is uncertain. ‘ ‘ A second statement would be, ‘ ‘The two groups are statistically identical. ‘ ‘ Here the clini-

the

be 95%

liver tumors winning the

nificant

were they? The answer levels of a false-negative

the investigated

94%. ‘ ‘ We know that CT detects of this confidence interval is above ‘Our data suggest that NMR is indeed

In the

large false-positive a large difference

increase



problems

This concept us to estimate

Discussion

article, we stated that ‘ ‘pertinent ‘ were not included in the papers to justify a ‘ ‘no significant differ-

‘clinically

statements

will



1983

calculations of a confidence ‘There is a 95% chance that the

and

for detection dence interval not

of

These

intervals. and allows

between

July

er-

between groups is of clinical interest, the required sample size would be small. Conversely, decreasing the acceptable levels of false-positive and/or false-negative errors or important difference’ dramatically.

rate

false?

confidence deviation

difference

detection

majority state,

to increase

statistically significant difference, all is well. If the analysis is not statistically significant, calculate what chance there was of detecting a clinically important difference. If this is

‘ ‘clinically sometimes

the true

true

2.

forces

is true-or

lies. For example, suppose interval allowed us to say, 78%

available 5.

false-

important

hypothesis

can be addressed using is very similar to standard where

Increase the acceptable false-positive negative rate. 2. Increase the value of the ‘ ‘clinically 1 .

the

AJR:141,

alternative

considerations

of subjects is not available, Among them are the follow-

ing:

the

ET AL.

that the

of bone:

relationship

of radiographic

appearance

and

prognosis. Radiology 1 982; 1 44 : 285-290 5. Fleiss J. Statistical methods for rates and proportions (Wiley Series in Probability and Mathematical Statistics). New York: Wiley, 1973:178-194 6. Aleong J, Bartlett DE. Improved graphs for calculating sample sizes when comparing two independent binomial distributions. Biometrics 1 979;35 :875-881 7. Altman DG. Statistics and ethics in medical research. Ill. How large a sample? Br Med J 1980;281 :1336-1338

Downloaded from www.ajronline.org by 37.44.207.175 on 01/19/17 from IP address 37.44.207.175. Copyright ARRS. For personal use only; all rights reserved

This article has been cited by: 1. Raghu Amaravadi, Marc S. Levine, Stephen E. Rubesin, Igor Laufer, Regina O. Redfern, David A. Katzka. 2005. Achalasia with Complete Relaxation of Lower Esophageal Sphincter: Radiographic-Manometric Correlation1. Radiology 235:3, 886-891. [CrossRef] 2. Alice S. Ha, Marc S. Levine, Stephen E. Rubesin, Igor Laufer, Hans Herlinger. 2004. Radiographic Examination of the Small Bowel: Survey of Practice Patterns in the United States. Radiology 231:2, 407-412. [CrossRef] 3. Michael BorensteinThe Shift from Significance Testing to Effect Size Estimation 313-349. [CrossRef]

Suggest Documents