Downloaded from www.ajronline.org by 37.44.207.175 on 01/19/17 from IP address 37.44.207.175. Copyright ARRS. For personal use only; all rights reserved
203
Statistics
for Diagnostic Procedures
II. The Significance of “No Significance”: Statistical Test Really Means Warren
C.
Phillips,1’2
James
Scott,1
A.
and
George
groups
are
identical.
pIes, of
‘
not on the
Often
this
sample size. well established
investigator’s
‘no significant
outcome The
intuition.
difference’
‘
is merely
adequacy mathematical The
substantiating
evidence
the
tests
often
are
used
negative
meaning
is discussed.
in the
medical
literature
results
Basic
errors
in the interpretation
of negative
tests in the nonradiologic literature [1 , statistical test is the result of insufficient
statistical negative
2].
Often a sample
size [1 -3]. When sample size considerations were included retrospectively in the analysis of these papers, the conclusions of the authors often were weakly substantiated by
article; tween
one of us (W. C. P.) reviewed the major articles Radiology from January to December of 1980. consisted of reading the abstracts of each
if this groups,
indicated a lack then the contents
of statistical correlation were examined to find
Editor’s note-This 1
Department
is the second of a three-part series of papers of Radiology, Harvard Medical School, Massachusetts
two
are
Present
address:
3
Elscint,
Inc., Boston,
AJR
141:203-206,
Department
July
of Radiology,
Milford
Memorial
0361-803X/83/141
1-0203
to
mean
that
there
is no
illustrates conclusions.
what
parameters
influence
bewhat
Milford,
© American
Roentgen
groups
are being
compared,
statistical
the
different. obtaining assuming “p value”
One then proceeds to calculate the probability of a difference equal to or greater than that observed, that the null hypothesis is true. This is the familiar (p stands for probability). If this is found unlikely
groups hypothesis,
0.05),
then hypothesis
by defining states that
two there
under investigation. states that the
the
significant. that patients
eventually
relapsed
had
than those investigators
who did divided
not [4]. To support the patients into
concepts Boston,
DE 19963.
Ray Society
hypotheses. The is no difference
The groups
second, the are indeed
null hypothesis is rejected and accepted. Thus the results
‘ ‘statistically concluded
on basic statistical General Hospital,
Hospital,
or more
between alternative
MA 02215. 1983
interpreted
usually begins null hypothesis,
Office. 2
conclu-
evaluation first, the
(say p < alternative
their data. Recently, in AJR and This review
this
Principles
When
difference, ‘ ‘ many investigators erroneously conclude that this means that there is no difference between the groups. This is substantiated by several recent papers that have frequent
to support
to
compare two or more groups. Usually the results of these tests show that a statistically significant difference is indeed present. However, when the tests show ‘ ‘ no significant
documented
given
difference between the groups, then the readers of most of this material would have been unintentionally misinformed. This article explains the meaning of ‘ ‘significant’ ‘ and ‘ ‘not significant’ ‘ and the reliability of these
Statistical
was
sion. There were 1 1 such articles. In each instance, the pertinent mathematical parameters that would have been needed to justify this conclusion were not specified. If these
of sample princi-
real
a Negative
Blasczcynski3
Statistical tests are often used to compare two or more groups. When the tests show “no significant difference,” many authors/readers conclude that this means that the result of inadequate size depends on
What
‘ ‘ For example, a recent with primary lymphoma
more
positive
that are often applied to radiologic MA 021 1 4. Address reprint requests
the are
publication of bone who
radiographic
signs
this two
the those
statement, groups:
diagnostic procedures. to Radiology Research
204
PHILLIPS
who
were
disease-free
at 5 years
and
those
who
Downloaded from www.ajronline.org by 37.44.207.175 on 01/19/17 from IP address 37.44.207.175. Copyright ARRS. For personal use only; all rights reserved
mean groups
number could
of positive be explained
radiographic by random
Alternative hypothesis: not be explained reasonably Assuming the null hypothesis such
skewed
in
results
was
calculated
and
foregoing wishes is true,
The
results
were
declared
to
be
‘
Types
unfamiliar
with
seem
classic
confusing.
statistical First,
two
found
Ha
to
. ..
True False
the
an investigator
to present evidence that the alternative hypothesis that is, that the groups are different. This is why the
alternative
=
is a difference
sis. The significant
hypothesis.
The alternative
decision result.
had
received
a larger
also
could
have
accounted
(i.e. , number
items
for
of positive
unlikely, the results cant. Two negatives
to be statistically signifiinvolved for a statistically
significant result to be present; the ‘ ‘no difference hypothesis’ ‘ is not likely to be true. This type of mental approach needed by statistics may seem like ‘ ‘reverse thinking’ ‘ to many. Clearly there are two types of errors an investigator could commit. The alternative hypothesis could be accepted erroneously when it is false, a ‘ ‘false-positive error’ ‘ ; or it could be rejected falsely when it is true, a ‘ ‘false-negative name given a false-negative
a ‘ ‘type II error. ‘ ‘ The probability given by the familiar p value. For mentioned had been 1 1 articles
‘ ‘
case it was no significant previously
to a false-positive error is called
of a false-positive example, in the
The following paragraphs discuss lead to a false-positive or false-negative
greater
is
2%. What if the results of the test difference, ‘ ‘ as was stated by the sited? What is the probability that
these decisions were wrong? What factors likelihood of an incorrect negative conclusion?
discussion implications
error previously
the
is related to the false-negative often are misinterpreted
influenced
factors error.
and
the
that Most
could of the
error because because of
its its
complexity.
False-Positive
Errors
False-positive
errors
modeling
error;
journals The
to publish first cause
conclusions. it be related
(2)
groups,
observed
difference.
primary
can
be caused
chance;
and
negative of error
(3)
results. clearly
any
lymphoma
or
all
of bone
factors:
be
lead
to
responsible
suppose who
remained
(1)
that
erroneous what could differences for
the
the patients disease-free
second
asserts
that
hypothestatistically
dose
than
those
the
difference.
‘ ‘confounded. circumstances.
possibility
These
signs No clear
‘ ‘
is chance
or
‘
and
two radia-
conclusions
‘bad
luck.
‘ ‘
The
due solely to chance is arbiusually at the 5% or 1 % level. the investigator is saying, in
effect, ‘ ‘I am willing to accept a 5% chance of stating that the alternative hypothesis is true when it really is not. ‘ ‘ If the p value, which has been calculated from the experimental data,
is
less
acceptable
than
5%,
levels.
the
Thus,
the
chance
of
investigator
an
error
states
is the
below results
are statistically significant. This calculation is based on the assumption that the difference between the groups under investigation is zero (i.e. , the null hypothesis). What the p value really indicates is the probability of obtaining a difference equal to or greater than that observed when there are really no differences between the groups under investigation. The third authors/journals investigator
and
final factor to publish takes a 5% (1
is an apparent reluctance of negative results. Briefly, if an in 20) risk of a false-positive
error, then of every 20 times an experiment is performed, one would be expected to be positive on the basis of chance alone. This result is probably more likely to be published than
one
of
the
19
negative
studies,
causing
the
false-
positive error rate in the published literature to be greater than 5%. When the previously mentioned AJR-Radiology literature was reviewed, the proportion of positive to negastatistical
results
appeared
even 1 0 to 1 . This suggests less likely to be submitted explanation for this high excellent
of authors/
is found, consistent
could
For example,
by three reluctance will
If a significant difference to? If there are several
between with
The
tive
Errors
radiation
radiographic
possibility of making an error trarily set by the investigator, If the 5% level is selected,
‘ (table 1 ). The technical is a ‘ ‘type I error, ‘ ‘ and
which
as a ‘researcher’s that show a
who relapsed. It would not be possible to conclude that the number of radiographic signs was a significant predictor of prognosis, because the varying amounts of radiation therapy
tion dose) have been are possible in these
error’ error
hypothesis,
between groups, may be characterized that it is true is based on calculations
alternative hypothesis is sometimes referred to as the ‘ ‘ researcher’s hypothesis. ‘ ‘ However, in order to perform a statistical test, the opposite must be assumed, that is, that the null hypothesis is true. Then, if this turns out to be are declared thus are
False-positive error No error
False-negative error
Note-Ha there
False
No error
.
‘statistically
reasoning,
usually
Errors
True
the
the
About
for 5 years may
of Statistical
July 1983
Ha is really: Decision
be less than 2% (i.e. , p < 0.02). This is low, so the null hypothesis was rejected and the alternative hypothesis was accepted. significant.” To those
1:
two
The observed difference could by random variation. to be true, the probability of
2.
observing
signs variation
AJR:141,
TABLE
relapsed
within 5 years despite apparent curative therapy. Then hypotheses were erected: 1 . Null hypothesis: The observed difference between
ET AL.
pretrial
intuition
to experimentation cally significant
that result.
False-Negative
Errors
False-negative errors, significant relationship
to be at least
5 to 1 , perhaps
that negative results are indeed and/or published. An alternative positive-to-negative ratio is that on the
has
part
a high
of investigators probability
that is, declaring was not present
that when
leads
of a statisti-
a statistically it was, also
AJR:141,
may
be caused
eling
error;
FOR
STATISTICS
July 1983
by several
factors.
(2) inadequate
These
sample
size;
include: and
(1
DIAGNOSTIC
TABLE
) mod-
Downloaded from www.ajronline.org by 37.44.207.175 on 01/19/17 from IP address 37.44.207.175. Copyright ARRS. For personal use only; all rights reserved
is the major This selection
tal modeling. ing warrants
In this further
considerations. is an integral The central
then, experimental modeltogether with sample-size
needed to be reasonably difference if it exists?’
this question items, such
in any given situation as the type of sample
details
cern;
new context discussion
of these
only
the
various
‘
situations
principles
are
Note.-FP
do
luck”
for
not
‘
a clinically estimate of
NMR
machines
are
quite
error
(i.e. , stating
NMR
cannot
detect
10%
data,
a sample-size
need
305
patients
chance of false-negative of avoiding this error); and (2)
=
table in each
may group,
Armed
1 0%.
be consulted or a total
[5].
(3)
with
these
We
would
of 61 0 patients.
Suppose the three parameters are changed: What would happen to the required sample size? Using common sense, is it possible to predict if the required sample size will increase or decrease? If the acceptable levels of falsepositive/false-negative cal to expect the we
are
vestigation
demanding will
errors are decreased, sample size to increase. greater
be required.
precision, As expected,
it seems logWhy? Because
a more
FN - acceptable size.
Since
false-negative
we are examining
error;
two groups,
be needed.
CID=:
.,
.i 70
S
50
U C
S .c
500
error = (90% chance
difference
305 369 367 125
100
it is believed
of false-positive error = 1 0%
important
S5
10 10 10 15
0
when it really can). How many patients are answers to the three questions are: (1 ) chance
clinically
OD
10 10 5 10
6o so
a study
more lesions needed? The
5%;
actually
FN
5 2 5 5
CID=15%
In some data, the
with CT shows is about 80%.
expensive,
error;
FP
90
that NMR should detect at least 1 0% more (i.e., 90% of the lesions) or its purchase would not be justified clinically. The investigator is willing to take a 5% chance of a false-positive error (i.e. , stating NMR can detect 1 0% more lesions when it really cannot), but also wishes a 90% chance of avoiding a false-negative
would
false-positive
difference; 55 - sample of subjects per group.
Size
many
comparing the efficacy of computed tomography (CT) versus nuclear magnetic resonance (NMR) for detection of liver tumors. Two different sets of patients will be examined, one
Because
as many
introductory
important difference. the variability of the
one with NMR. Experience of liver tumors detected
on Sample
be of con-
an
‘standard deviation, ‘ ‘ also must be supplied. For example, suppose an investigator is planning
group on CT and that the proportion
important
Sample size is number
twice
discussion such as this. The factors that are of relevance in a general situation, and which must be specified by the investigator, are: (1 ) acceptable chance of false-positive error; (2) acceptable chance of false-negative error; and (3) what constitutes instances, an
acceptable
-
clinically
= .
on a number of form of the data.
need
relevant
Factors
Reference problem (see text) Decrease FP to 2% Decrease FN to 5% Increase CID to 1 5%
sure of finding a real The precise answer to
depends and the
of Various Situation
is neglected by medical researchshould be part of the experimen-
In addition, the role of chance or ‘ ‘ bad part of sample-size calculations. question to be addressed is, ‘ ‘ How
subjects are and important
The
item that process
2: Effect
(3) chance.
Modeling errors were briefly discussed previously and need not be discussed further here. The selection of sample size ers.
205
PROCEDURES
thorough
the required
insam-
pIe size is larger than 305 if the acceptable chances of errors are decreased (table 2). What if the ‘ ‘clinically impor-
Number Fig. 1 -chance
1000
of Subjects
of detecting
to 1 25, less than Let us approach
a ‘ ‘clinically
half of what the same
important
2500
Group
difference’
‘
(CID)
with
it was previously problem differently.
(table 2). From ex-
perience it is possible to estimate how many patients would be available forthis study. What are the chances of detecting a clinically important difference if it exists? In other words, the sample size has been specified; in addition, let the chance of a false-positive error be set at its traditional level of 5% (i.e., if p < 0.05 result is statistically significant). The situation is depicted in figure 1 . As the number of subjects available important level
of
increases, difference ‘
‘clinically
the chance increases. important
of detecting In addition,
difference’
‘
clinically important group are needed. ing a 5% difference
have to be much better than detect. Indeed, the required
CT, so this should sample size drops
cess to the large number certain clinical problems.
values
a clinically the preset from
5%
Note also the example, if the of detecting a
difference of 1 5%, only 61 subjects per Alternatively, if a 95% chance of detectis needed, 1 ,577 subjects per group
are needed. The large circumstances
as
increases
to 1 0% to 1 5%, the sample size decreases. wide variability in the sample sizes. For investigator specifies only a 50% chance
to 1 5%; that is, the NMR economcially only if it could Here the NMR machine would to
2000
for Each
known sample size. This graph refers only to situation described in text. Details of other problems would be different, but principles would be same.
tant difference’ ‘ is increased machine would be justified detect 95% of liver tumors.
be easier dramatically
1500
Available
of sample
is unfortunate.
sizes Few
required investigators
under
certain have
ac-
of subjects required to address However, the laws of probability
PHILLIPS
206
cannot
be
show some
that the choices
altered.
If pretrial
sample-size
required number have to be made.
Downloaded from www.ajronline.org by 37.44.207.175 on 01/19/17 from IP address 37.44.207.175. Copyright ARRS. For personal use only; all rights reserved
and/or
ence.” 3. Do both 1 and Combine
4.
with
sample
size
Do
study
anyway.
the
differ-
other
investigators
and/or
increase
length
data
analysis
If the
of study. shows
a
unacceptably small (i.e., high chance of false-negative nor), then the conclusion must be ‘ ‘ answer is uncertain.” (See discussion of confidence intervals which follows.) 6. Do not do the study. In general, if an investigator and/or false-negative errors,
will accept or if only
the beginning of this mathematical parameters’ cited which were needed At
ence’ ‘ conclusion. What be clear. The acceptable value
of the
specified. Here are two
‘
‘
sample
between them was this mean? It should
important that
now and
were
not
difference’
are of interest
not statistically be clear that
value of the clinically important able level of false-negative error constitutes an adequate A valid conclusion would
sizes,
should error ‘
because
are seen often in the literature and/or are heard ences. The first: ‘ ‘We investigated two groups; ence does
significant. it is ambiguous.
difference and the were not specified,
sample size cannot be ‘ ‘the relationship
What The
‘ ‘
acceptso what
be answered. between the
cally important difference is specified; it is zero. This alone is enough to determine the required sample size-an infinite number of subjects in each group would be required! Finally, the concept of ‘ ‘confidence intervals’ ‘ should mentioned [1 ]. Suppose a study has been completed results
are
not
significant.
The
considerations
are
we have
should
not
available. tell
Now
us something.
what?
Surely
Do they
the
suggest
for
liver
tumors
that
is better race, but
the
true
lies
between
80%. Since the 80%, we could better than CT
is the catch: The confi80%; therefore we can-
detection
rate
than 80%. In other the finish line has not
of NMR
for
words, NMR been crossed.
is
medical
literature,
‘ ‘
no significant
Authors and readers of this material may question has been answered when in
difference’
‘
as
actions of radiologists optimal patient care to the significance
is necessary
Should further information short list of references has
ethical
often
scientific feel that
fact it has not. When the enced by misinformation, delivered. Greater attention
an
‘
This
problem. a particular
constitutes
difference’
is misinterpreted.
to alleviate
be desired been provided
the
well
as
a
are influcannot be of ‘ ‘no sigsituation.
by the reader, [1 -3, 5-7].
a
ACKNOWLEDGMENTS
We thank Joyce script preparation.
DePnizio
and Susan
Phillips
for help
in manu-
of
data that
REFERENCES 1 . Freiman JA, Chalmers TC, Smith H, Kueblen AR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. Survey of 71 “negative” trials. N Engl J Med 1978;299:69O-694 2. Reed JF, Slaichert W. Statistical proof in inconclusive ‘negative” trials. Arch Intern Med 1981 1 41 :1307-1310 3. Feinstein A. Clinical biostatistics-XXXIV. The other side of . statistical significance’ ‘: alpha, beta, delta, and the calculation of sample size. C/in Pharmacol Ther 1975;1 8:491-505 4. Phillips WC, Kattapuram SV, Doseretz DE, et al. Primary lymphoma
be and ac-
ceptable false-positive/false-negative errors and ‘ ‘ clinically important difference’ ‘ show that more subjects are needed for a valid ‘ ‘ no difference’ ‘ conclusion. These additional subjects
NMR
of liver tumors. ‘ ‘ Here (78%-94%) includes certain
subjects
they
at conferthe differ-
two groups is uncertain. ‘ ‘ A second statement would be, ‘ ‘The two groups are statistically identical. ‘ ‘ Here the clini-
the
be 95%
liver tumors winning the
nificant
were they? The answer levels of a false-negative
the investigated
94%. ‘ ‘ We know that CT detects of this confidence interval is above ‘Our data suggest that NMR is indeed
In the
large false-positive a large difference
increase
‘
problems
This concept us to estimate
Discussion
article, we stated that ‘ ‘pertinent ‘ were not included in the papers to justify a ‘ ‘no significant differ-
‘clinically
statements
will
‘
1983
calculations of a confidence ‘There is a 95% chance that the
and
for detection dence interval not
of
These
intervals. and allows
between
July
er-
between groups is of clinical interest, the required sample size would be small. Conversely, decreasing the acceptable levels of false-positive and/or false-negative errors or important difference’ dramatically.
rate
false?
confidence deviation
difference
detection
majority state,
to increase
statistically significant difference, all is well. If the analysis is not statistically significant, calculate what chance there was of detecting a clinically important difference. If this is
‘ ‘clinically sometimes
the true
true
2.
forces
is true-or
lies. For example, suppose interval allowed us to say, 78%
available 5.
false-
important
hypothesis
can be addressed using is very similar to standard where
Increase the acceptable false-positive negative rate. 2. Increase the value of the ‘ ‘clinically 1 .
the
AJR:141,
alternative
considerations
of subjects is not available, Among them are the follow-
ing:
the
ET AL.
that the
of bone:
relationship
of radiographic
appearance
and
prognosis. Radiology 1 982; 1 44 : 285-290 5. Fleiss J. Statistical methods for rates and proportions (Wiley Series in Probability and Mathematical Statistics). New York: Wiley, 1973:178-194 6. Aleong J, Bartlett DE. Improved graphs for calculating sample sizes when comparing two independent binomial distributions. Biometrics 1 979;35 :875-881 7. Altman DG. Statistics and ethics in medical research. Ill. How large a sample? Br Med J 1980;281 :1336-1338
Downloaded from www.ajronline.org by 37.44.207.175 on 01/19/17 from IP address 37.44.207.175. Copyright ARRS. For personal use only; all rights reserved
This article has been cited by: 1. Raghu Amaravadi, Marc S. Levine, Stephen E. Rubesin, Igor Laufer, Regina O. Redfern, David A. Katzka. 2005. Achalasia with Complete Relaxation of Lower Esophageal Sphincter: Radiographic-Manometric Correlation1. Radiology 235:3, 886-891. [CrossRef] 2. Alice S. Ha, Marc S. Levine, Stephen E. Rubesin, Igor Laufer, Hans Herlinger. 2004. Radiographic Examination of the Small Bowel: Survey of Practice Patterns in the United States. Radiology 231:2, 407-412. [CrossRef] 3. Michael BorensteinThe Shift from Significance Testing to Effect Size Estimation 313-349. [CrossRef]