REVIEW ARTICLE THE LOGICAL STRUCTURE AND VALIDITY

BIOPHARMACEUTICS& DRUG DISPOSITION, VOL. 10 331-351 (1989) REVIEW ARTICLE THE LOGICAL STRUCTURE AND VALIDITY OF EXPERIMENTAL DESIGNS IN PHARMACOKINET...
1 downloads 0 Views 1MB Size
BIOPHARMACEUTICS& DRUG DISPOSITION, VOL. 10 331-351 (1989)

REVIEW ARTICLE THE LOGICAL STRUCTURE AND VALIDITY OF EXPERIMENTAL DESIGNS IN PHARMACOKINETICS AND CLINICAL PHARMACOLOGY EMANUEL J. MASON*t

Department of Educational & Counseling Psychology, University of Kentucky, Lexington, Kentucky, USA AND MEIR BIALER

Department of Pharmacy, School of Pharmacy, Hebrew University, Jerusalem, Israel

ABSTRACT Much of the literature on research design in clinical pharmacology and pharmacokinetics emphasizes statistical concerns, thus suggesting that a primary ingredient of a valid research design is an appropriate plan for statistical analysis of data. However, statistical validity is only one of several ways to evaluate an experimental study. The present paper reviews the underlying logic and sources of invalidity of experimental drug research suggesting influences and factors which may deceive or lure an experimenter into erroneous conclusions. KEY WORDS

Experimental design Experimental validity Research design

INTRODUCTION Much laboratory and clinical research in pharmacokinetics and clinical pharmacology can be classified as experimental. For the purpose of the present discussion, experimental research is defined as scientific investigationsinvolving the comparison of one or more treatment(s) and at least one control condition. The comparisons are done to determine the degree and nature of relationships *Currently Lady Davis Visiting Professor in the School of Pharmacy, Hebrew University, Jerusalem, Israel.

t Addressee for correspondence. 0142-2782/ 89/04O331-22$11 .OO 0 1989 by John Wiley & Sons, Ltd.

Received 16 May 1988

332

E. J. MASON AND M. BIALER

existing between independent variables (treatment and control conditions) and dependent variables (criteria). For example, a treatment may involve use of a new drug or treatment plan, while the control condition might be the commonly used treatment or no treatment at all.’ Much has been written about the design of experiments in the pharmaceutical sciences.’-* The emphasis of these writings has been on methods of statistical analysis and interpretation of data.’ The implication of this emphasis is that appropriate statistical design is the primary ingredient of a valid experiment. However, conditions may exist in a research setting that would influence the conclusions of the experimenter, and for which a statistical design would not be sensitive. In the present paper the validity of experiments in pharmacokinetics and drug research is investigated from a perspective that includes but extends beyond statistical design and analysis. AN EXPERIMENT AS A LOGICAL ARGUMENT In logic an argument consists of premises leading to a conclusion. For example, if it is true that Allflowers areplants, and it is also true that A daisy is aflower, then the conclusion that A daisy is aplant is also true. This particular argument is based on a familiar syllogism modus ponens (or affirmation of the antecedent). It is usually represented in the form: P3Q P

.*.Q

(Major premise) (Minor premise) (Conclusion)

where 3 represents implication (i.e., P implies Q, and :. represents ‘therefore’). This argument will always produce true conclusions when the premises are true, and for this reason the structure of the argument is considered valid. However, even in a valid argument, when the premises are false, the conclusion can be problematical. For example, a romantic fellow who is in love with a girl named Daisy might say ‘Daisy is aflower’. If he did, however, the above conclusion that Daisy is a plant would not make sense even though the form of the syllogism has not changed. To put it another way, the conclusion is valid because the argument form is valid, but the conclusion may not be true if the premises leading to it are false (e.g., a girl is not ever a plant). A valid argument leads to true deductions only when the premises are true.” Logical thinking of this kind can be applied to experimental research. One can set up an experiment to test whether a null or alternative hypothesis is supported by experimental observation. When an experiment is designed well, its results will lead logically to the conclusion about the experimental hypothesis. For example, a researcher might propose a hypothesis based on previously established knowledge about how a particular drug distributes, eliminates, and works in the body particularly as it affects hunger. Specifically, one behavioural

333

EXPERIMENTAL DESIGNS

hypothesis under investigation might state that subjects who use preparation A of the drug will leave less on their plates at the end of each meal than those who use preparation B. This hypothesis might serve in a manner analogous to the first (or major) premise of the above syllogism. (Issues regarding the deductive and inductive nature of this reasoning process will be left for another paper.) Then the researcher designs an experiment in which the effects of each preparation on the eating behavior of a select group of subjects are compared over a period of 2 weeks. When they are analyzed (with a repeated measures ANOVA or other appropriate procedure), the data show that subjects taking preparation A clearly had less remaining on their plates after each meal, suggesting that they ate more than preparation B subjects under the preparation B condition. This finding can be considered analogous to the second premise in the modus ponens argument. On the basis of this finding, the researcher concludes that treatment with preparation A should generally be expected to lead to better appetite in a clinical setting than preparation B. To further illustrate the relationship between a logical argument and a scientific experiment, consider the premises displayed in Table 1. This simple experiment is based on very questionable premises which render the conclusion highly suspect. There are many ways that a drug can affect diet behavior to make subjects appear to become more or less hungry, but the effect is really on something other than appetite. For example, a changing of taste sensation might lead to less eating, or a slowing of metabolism and disposition might result in less need for large amounts of food. Further, although the clinical trials seemed to support the hypothesis, this conclusion was based on the assumption that the groups were comparable at the start of the study. This would not have been the case if those treated with preparation A were members Table 1. The relationship between a scientificexperiment and a logical argument Premises in logical argument

Statement

Premise in experiment

Major

Drug A will reduce appetite less than drug B (if subjects eat more under drug A treatment)

Hypothesis

Minor

Drug A subjects were observed to have eaten more than drug B subjects

Direct observations

Conclusion

Drug A reduces appetite less than drug B

Conclusion

334

E. J. MASON AND M. BIALER

of a professional football team and those receiving preparation B were volunteers from the local chamber music society. In addition, the amount, appearance, and type of food served to subjects during the course of the study should have been identical in both treatment groups. Thus, the experimental design, a simple two groups comparison, might have appeared to have been valid to test the hypothesis, but the conclusion could be faulty because the truth of many of the assumptions underlying the premises cannot be verified. These problems would persist even when statistical techniques are appropriate for the data. The remainder of this paper will address the validity of experiments by examining aspects of the underlying premises.

VALIDITY OF EXPERIMENTS The validity of experiments can be examined from several approaches. Each provides a unique set of concerns about how to validly interpret observed differences between treatment and control conditions in an experiment. Four aspects of validity that have been suggested’‘-I4 are internal validity, external validity, construct validity, and statistical conclusion validity. These are explained below. Internal validity This refers to whether the differences between treatment and control conditions can (or cannot) be directly attributable to the treatment. Thus, in the above example, if the observed effects of preparation A and preparation B can be attributed only to the drugs themselves rather than to such factors as differences in characteristics of the subjects in the two groups, the experiment’s validity could not be questioned on these grounds. Internal validity has also been referred to as linkingpower.” This originates in the concept that internal validity deals with the degree to which variables in the study can be connected or linked to produce the results. In other words, linking power is determined by the reasonableness of the connection between the independent and dependent variables in the study. A study must have internal validity if it is to be of any value. External validity This involves the ability to extend the observed results of an experiment to other settings, persons, contexts, and combinations of variables. For example, the findings of an experiment involving asthmatic adolescents in Phoenix, Arizona might not generalize well to adults in New York City who have other respiratory problems. In other words, the findings of research using one rather narrowly determined sample will not have validity in another very different sample. External validity also refers to ecological factors that may affect decisions about generalization. For example, subjects being used in a controlled

EXPERIMENTAL DESIGNS

335

laboratory study of bioavailability would not provide conclusions suitable for generalization to subjects in a less controlled clinical study in a field setting. Construct validity This aspect refers to whether the measures and variables used in the study can be considered as representative of the underlying scientific constructs. Problems with construct validity can occur when an inappropriate or unreliable measure is used, or when the treatment effects are confounded. An example of an inappropriate measure might be the use of hat size to measure effects of a drug being tested for treatment of sinus headaches. Confounding may occur in crossover drug experiments where, carryover effects might have some influence. l6 Construct validity is often considered an element of external validity since it is difficult to generalize confounded treatments to other settings beyond the experimental situation. Statistical conclusion validity This involves the validity of the conclusion about the relationship between independent and dependent variables. Reliability (or instability) of results might contribute to misleading conclusions. Statistical conclusion validity can be influenced by such concerns as sample size and power, incorrectness of the underlying model, and failure of the data and subjects pools to reflect certain critical assumptions. This is the kind of experimental validity most discussed in the pharmaceutical literature. While this kind of validity is part of the overall validity of an experiment, it alone is not sufficient to ensure validity of the conclusions. Figure 1 shows the relationship of the four kinds of validity listed above to the overall validity of an experiment. It is temping to suggest that one kind of validity is the most essential but a balanced analysis would reveal that they are

OVERALL VALIDITY OF AN EXPERIMENT

Figure 1. Relationship of four aspects of validity to total validity of an experiment

336

E. J. MASON AND M. BIALER

all important, and that pre-eminence of one kind or another is probably situational. For example, internal validity would seem essential in all comparative studies. Yet, if the purpose of the study is to yield survey information on the widespread application of a treatment, the representativeness of the sample and construct validity might be more germane. Although statistical issues have received considerable attention in the pharmaceutical research literature, the broader aspects of experimental design validity have been recognized5 but have not been systematically analyzed. The need for such analysis is probably greatest in Phase I11 clinical research, in which persons with clinical orientation and training rather than those with more formal research orientation and background typically manage the research end e a ~ o r . ' ~In , ' ~the next section, specific sources of threats to the valid interpretation of experiments are explored. Since statistical and construct validity are somewhat subordinate to internal and external validity, the latter are considered first. THREATS TO INTERNAL VALIDITY The purpose of an experiment is to enable determination of whether the independent variable (or variables) X (i.e. the treatment groups) influenced the dependent variable (or variables) Y (measures, observations, or other criteria). This determination would not be possible if the observed relationship were the result of some extraneous variable(s) Z. Further, since Z variables are extraneous (often called nuisance variables) and are not systematically studied or controlled, when they are present the experimenter typically will be able to say little about them other than to acknowledge the potential for their serving to explain the observed relationship between X and Y. Figure 2 shows some of the ways that Z might be responsible for the relationship between X and Y. The Z variables in Figure 2 represent or result from threats to the internal validity of the experiment. An understanding of the threats to internal validity will help experimenters to design studies that avoid these threats. Several of the recognized threats to internal validity are discussed below. l 2 History This becomes a threat to validity when the observed difference between the experimental and control groups is due to uncontrolled events that occured in the comparison groups prior to or during the experiment. Experimental and control groups can usually be made comparable initially by random assigment of subjects prior to the beginning of the experiment. However, randomization may not insure equality of groups on this dimension over a period of time. For example, in a clinical study held in a hospital setting over a 7-day period, a sample of 20 subjects are randomly assigned to receive either drug A or drug B in a double blind study. However, on the second day one subject in treatment A

EXPERIMENTAL DESIGNS

337

Figure 2 (a), (b), (c), (d). Some possible relationships of 2 (intervening variable(s), to independent (Y) and dependent ( x ) variables in an experiment

undergoes minor surgery, and another gets a change in diet which includes food substances that may react with the drugs being tested. Each of these events could have influenced the outcome by masking or enhancing the observed difference in effects between the two treatment groups. Further, in crossover designs in which each subject serves as his own control the problem of history will exist when the situation of the subjects changes in some relevant way between treatment conditions. History effects can be controlled in much Phase I and I1 research in laboratory studies of short duration because the environment can be fairly well isolated from unwanted influences. However, in clincial field trials the problem is more difficult to control, and the researcher must design procedures into the study to deal with this threat. Such procedures might include daily records of patients’ progress and treatment, and random variation of the order of treatment and control conditions in crossover studies in order to insure that experimental conditions have not been contaminated by events.

Maturation When the subjects in treatments grow or change over time within the experimental conditions, physical, physiological, or psychologicalchanges might occur which can result in observed differences not due to the treatment. For example, when the average age of the subjects is considerably different between experimental groups, or the treatment groups are located in different settings, rates of change observed in subjects between experimental groups might be due to these differences rather than treatment effects. This particular threat can often be controlled by equating the treatment groups before the experiment begins, usually by randomly assigning subjects to treatment conditions and then controlling the settings to insure that conditions within the experimental groups are identical in every way except for the treatments under study. However, the

338

E. J. MASON AND M. BIALER

longer the duration of the experiment, the more difficult it tends to be to maintain control over an experimental setting.

Testing When subjects must be tested repeatedly, the results of the study may be contaminated by the physical and psychological influences of continued testing (e.g. boredom, fatigue, soreness, venal collapse, increased tolerance or dependence). To illustrate, in a study of psychoactive medications in which some of the dependent measures are behavioral, subjects might learn over repeated trials to behave differently as a result of the testing. For example, if subjects are given a complex manual task to do after being administered a drug designed to reduce sleepiness, the subejcts might become more skilled at the task over repeated testing. Thus, the observed differences between the pre-and post-test would not be due to treatments but rather to practice. Instrumentation This is a threat to validity when conclusions about experimental comparisons can be influenced by changes in measurement instruments that occur during the course of an experiment. This could be due to physical changes in measurement devices resulting from such external influences as humidity, failure to use the instrument properly, or failure to recalibrate or service an instrument as necessary. In addition, many drug experiments rely on the judgment of the researcher or technician to record observations accurately. If in the course of collecting data, the basis for this human judgment changes due to fatigue of the observer, a rushed procedure to meet a deadline, or different persons collecting the data during the course of the experiment, an instrumentation problem can occur. This can be a particularly important problem in studying patients’ reports of side-effects (such as headaches, changes in hunger, thirst, sex drive, etc.), and in research on psychoactive drug effects.I8 Methods for controlling instrumentation effects include training those collecting the data to be consistent, avoiding protocols which require long and tedious judgments to be performed by an observer without sufficient rest periods, using several observers and randomly assigning them to subjects in a double blind fashion, providing clear standards and criteria for observers, and insuring that a single observer has not made more of the judgements in one of the treatment conditions than in others. In addition, attention to measurement devices, test materials and procedures, scoring standards, and observational practice is necessary to insure against unwanted instrumentation effects. Also the conditions in which the observations are made should be similar in each treatment group. These conditions can vary when different treatments are being tested in different laboratories, hospitals, wards, or other clinical settings. Statistical regression This threat occurs when subjects are assigned to extreme groups on the basis of a relatively unreliable pre-test. For example, consider a study of the effects of

EXPERIMENTAL DESIGNS

339

a medication on psychological depression. Subjects are selected for one of the treatment groups on the basis of high scores on a psychological test of depression while others are selected to be in the comparison group on the basis of low scores. One way to look at an observed score (the score the person gets) on such psychological tests is to consider the score to be the sum of two components, true score and error, i.e.,

Xi = Xt, + Error The true score is a theoretical value similar to the central parameter value in a sampling distribution. In other words, the true score is the score around which a person's observed scores vary over repeated testing (with the same test, assuming no serial dependence to test scores). The less reliable a test, the larger the influence of the error component. Figure 3 shows a symmetrical sample distribution of observed scores, the range of error around each score and the true score contained in this range, and the score on the post-test. Notice that some of the subjects selected to be in the high group had true scores in the middle group, and some of those eliminated from the high group based on their observed scores had true scores that would have qualified them for higher group. This phenomenon' resulted in the likelihood that many of the subjects selected to be in the high group would score lower on the post-test. A similar situation occurs in the lower-scoring group. Regression toward the mean could mislead a researcher to conclude that the experimental antidepressant was effective with the more depressed subjects but increased depression in less depressed subjects. However, the observed changes would have been due to the unreliability of the test scores rather than the drug. Statistical regression can be avoided by declining to select subjects on the basis of high and low scores on a test or rating scale.

MIDDLE GROUP (ELIMINATED)

Figure 3. Ranges of scores for people taking a test with x marking the true score, and o denoting the score that is observed on a single administration as a pretest. The letter p denotes a single post-test score within each range

340

E. J. MASON AND M. BIALER

Selection This threat confounds the validity of an experiment when subjects in one or more experimental treatment group are systematically different from subjects receiving other treatments. This is a major problem in clinical drug research because of ethical issues involved in withholding treatments, small numbers of subjects that might be available, the cost of doing a study, and an interest in maintaining the real characteristics of a field or clinical setting for clinical trial^.^ The preferred manner of handling these problems in drug research seems to involve a combination of crossover designs and matching subjects’ characteristics across groups. However, even when a large number of characteristics are matched (e.g. age, weight, gender, height, body frame, blood type, etc.), these approaches are not always sufficient to insure equality of the subject groups because a large number of characteristics remain for which subjects have not been matched, and because the subject pool often is not large enough for sufficient matching.”l6 Statisticians would prefer to handle the selection factor by randomly assigning subjects to treatment groups. However, in the clinical settings this is not always possible or desirable, particularly when there is an interest in studying the relationships between the characteristics of subjects and the medication. Another statistical approach is to covary the characteristics in the analysis, a procedure that is also not without subtleties and

shortcoming^.'^ Mortality When certain subjects drop out, or in other ways become unavailable to the researchers during the course of the experiment, mortality results. For example, in a crossover study, if only the stronger subjects in the group that receive a particular sequence are living by the end of the experiment, then it would be difficult to interpret the effects of the treatments. Even though small groups can be equated by randomization before the start of the experiment, mortality can render the groups incomparable, and can be similar in effect to selection bias. Interactions with selection and other threats These occur when selection and one or more other threats combine to produce concerns about validity. For example, a selection-maturation problem might result when subjects in each treatment group are chosen from different clinical settings and the experimental treatments are assigned to settings. If the clinical group in one setting is weaker or more ill, the medication might take longer to show effects in that group, misleading the experimenter to conclude that the difference is due to the treatment rather than a combination of maturation and selection. Selection-instrumentation might occur when subjects in a treatment group are unable to respond appropriately to the researcher’s questions, or cannot read the paper-and-pencil questionnaire that they are asked to use to describe how they

EXPERIMENTAL DESIGNS

34 1

feel. Such a problem might arise in studying the behavioral or psychological effects of medication if the subjects are very weak, do not provide a range of observational data, cannot respond easily, or do not speak the language of the researcher well enough to communicate how they feel. A similar set of difficulties might arise when researchers base conclusions on inferences drawn from observed behavior of dogs and other animals. Certain situations require more or different kinds of awareness of interaction threats than others. Maintenance of experimental conditions over time This occurs even in well-designed experiments when the researcher cannot control changes that naturally occur in the experimental treatments over the duration of the experiment. In the course of working with clinical populations, it is not unusual to hear of patients who exchange medicines with friends and relatives, cheat on their diets and medication schedules, change physicians and treatment plans, get rediagnosed, have their regular nurses go on vaction, discuss their diseases and treatments with other patients, and otherwise reduce the purity of their experimental treatments. In addition, well-meaning clinicians may have ethical and humanitarian concerns about giving patients the trial medications in an experiment. Another problem which might contribute to the breakdown of the experimental conditions is the nature of the control or comparison conditions. Such would be the case if the experimental group is given a medication while the control group is given nothing. Even in a crossover design, a no-treatment control condition is not considered an appropriate comparison to a treatment at least partly because of the psychological sideeffects of taking medication or otherwise being treated for illness. Subjects who receive no treatment can become depressed, try self-treatment, or drop out of the experiment, thus deteriorating the equality of treatment and control groups. For this reason, experimental treatments are usually compared with placebo controls. These kinds of validity threats are generally not repairable merely by randomly assigning subjects. Researchers should design their studies so that they can be sure of the degree of comparability of the experimental conditions throughout the study.

THREATS TO EXTERNAL VALIDITY (GENERALIZABILITY OF THE FINDINGS) External validity refers to the degree to which the results of experiments can be generalized. Three concerns seem primary in the transporting of results beyond their experimental origins 1. the target population and the sample studied; 2. the setting in which the experimental data were observed;

3. the operationalization of the constructs in the study (or, ‘Was the right thing measured?).

342

E. J. MASON AND M. BIALER

The first two are discussed below. The third issue is considered in detail in the section on construct validity.

Identity of the population sampled This involves the question of whether the results of an experiment can be generalized to other subjects, animals, patients or persons beyond those directly studied. For example, a study done with asthmatic adolescents might not yield results that can be directly applied to middle-aged incarcerated drug addicts of both sexes. Most research is done on samples of subjects that are tacitly assumed to be representative of larger populations. However, most research studies are conducted with available samples (samples made available by a particular clinic, hospital, university, or other entity), rather than samples that are designed to be statistically representative of specific populations. Further, much of the research tends to give inadequate data about the experimental groups. Thus, when an experimenter reports that a drug was tested on a sample of hypertensive patients (of a certain average age, gender, and degree of hypertension) at XYZ University Hospital, it is difficult to determine to whom such results might be relevant in the broader sense because of the limited information available about the patients. In order to make generalizations one should be able to identify the character of the sample group clearly, and be certain that it is representative of the target population. Researchers tend to look at similarities of treatments in different studies across populations when they integrate findings from several studies. However, the tendency to consider anything beyond the grossest differences in sample characteristics across populations is less pronounced. Yet, comparing studies across samples from different populations is an aspect of generalization, and population characteristics should be part of any such analysis. Specific threats to the validity of a researcher’s generalizations of results usually concern selection acting in combination with other influences. Examples are given below. Interaction of selection and experimental arrangements. This occurs when subjects have been selected because of their tendency to give a certain kind of reaction to the experimental treatments. For example, if highly motivated volunteers are used, there is a greater probability that compliance with regimens and other aspects of treatments will be followed than would be true in the population at large. Results could not be generalized to less motivated subjects. In another illustration, subjects might be selected who are likely to be more homogeneous with respect to the treatment than the population at large. For example, in studying the effectiveness of a treatment for reducing fever, it is likely that subjects running high fevers will be selected in order to maximize the range in which effects can be observed, and to provide a fairly homogeneous sample in which to do the tests. However, the advantages of such a sample will be somewhat offset by the reduced ability to generalize the results to subjects

EXPERIMENTAL DESIGNS

343

with a broader range of fever levels. Use of a homogeneous group of subjects will tend to increase the probability of internally valid findings, but will make generalization to a wider group more difficult. Reactive or interactive effects of testing and selection. These can occur when there is a pre-test that sensitizes the particular group of subjects used to the treatment conditions. For example, the pre-test of a drug being tested for its effects on hypertension might sensitize hypertensive subjects who are participating to the nature of the study sufficientlyto affect the results more than an unpre-tested sample. Even though the pre-test is given to both experimental and control subjects, and thus may not produce questions about internal validity, the results can only be generalized to situations in which a pre-test given to persons using the hypertension medication. Interaction of selection and maturation. This threat may result when subjects are selected for both the experimental and control conditions who may show considerable growth or change during treatment over the course of time. (This effect is different from the selection-maturation interaction discussed in terms of internal validity which dealt with the assignment of subjects with certain characteristics to specific treatment groups.) For example, if the treatment will require several weeks or months, and the subjects are growing children and adolescents, it may be difficult to generalize the results to adults, even when the internal validity of the study is rather good. Experimental settings or conditions These present problems in generalization of results when the uniqueness of the settings or conditions in which the treatments are tested preclude generalizing to other conditions and settings. Examples of this class of threats to external validity are given below. Interaction of experimental arrangements and treatment. This can be a problem when the experimental conditions or setting might impact on the observations recorded in a manner that would not be seen outside of the experimental setting. For example, if the experimental conditions require the monitoring of a drug in the bloodstream on an hourly basis, but this is not done in normal clinical application of the drug, then generalization about how the drug will work outside of the experimental setting should be limited to situations which match the conditions of the experiment. Multiple treatment interJerence. This may occur in an experiment involving several treatments applied to every subject. It can be an especially prevalent difficulty in crossover designs involving several treatment and control conditions.16 Depending on the design, the researcher may only be able to generalize the findings to settings in which the same sequences or combinations are used.

344

E. J. MASON AND M. BIALER

Interaction of pre-testing with treatments. When subjects are pre-tested this might occur. Pre-testing might be done to determine sensitivity, to establish a baseline, or for other comparative purposes. When the pre-test involves using a medication with an insufficient washout period, the pre-tested groups might metabolize or otherwise process treatment medications differently than a group that was not pre-tested. Further, if the pre-test involved survey questions, then subjects’behavior might be different than if they were not pre-tested. Therefore, generalization may justifiably be limited to those situations in which a pre-test is used with the treatment.

THREATS TO CONSTRUCT VALIDITY Scientists form constructs to facilitate communication about the concepts they investigate.” The precise meaning of a particular construct may be difficult to pin down precisely, but constructs can be defined so that their meaning is generally accepted. For example, one might define the construct of ‘drug stability’ as ‘. .. that property which enables it to maintain its physical, chemical, and biological properties when subjected to a variety of challenges, e.g., heat, light, and moisture’.21As a construct, blood stability is not operational. That is, the definition does not include how the construct can be operationalized as a variable. The variable might be defined in terms of changes in concentration over time, and in other ways.21r22 Two concerns involving construct validity are: 1. the appropriateness of the variables used to operationalize the constructs; 2. the validity of the generalizations made about the relationships between the dependent and independent variables of the experiment. For example, in the first instance, if reduction of psychological depression were the effect being tested, it might make a difference whether the psychological construct of depression was defined by a ten-question paper-and-pencil survey to to be completed by the patient, or the clinical judgment of a trained psychologist. When two or more psychologists are used to make these judgments, differences in their theoretical perspectives, values, and clinical sensitivity could influence the validity of the variables representing the construct. The second issue is less explicit. It involves whether the same relationships found between the variables in an experiment can be extended to apply to all aspects of the constructs that the variables were chosen to represent. Thus, if preparation X reduced depression according to reports of a group of experimental subjects, whether a similar relationship would exist for the whole class of drugs which preparation X represents, and for all manner of measures of depression is unknown. Since construct validity is related to generalizing, it is sometimes considered an aspect of external validity.12The threats to construct validity listed below are illustrative of the kinds of problems often encountered in experimental research:

EXPERIMENTAL DESIGNS

345

Inadequate development of underlying constructs This results when constructs are not developed sufficiently because of inadequate research, faulty theory, poor definitions or lack of a generally accepted standard meaning. For example, discomfort is a construct that is difficult to specify. The same discomfort that might mean minor distraction to some people would represent considerable discomfort and even pain to others. If discomfort were the construct operationalized in a study, it would have to be defined and explained very carefully to avoid ambiguity. Inadequate operationalization of constructs This may occur when constructs are inadequately operationalized into independent or dependent variables. For example, a study might compare differences in appetite between heavy smokers and light smokers. Although to the layman a heavy smoker is simply a person who smokes a lot, to a reseacher this definition would be confusing. In addition to the vagaries associated with quantity of smoking, confounding of such variables as age, number of years of smoking prior to the study, kind of smoking done (pipe, cigarettes, etc.), amount of time per day spent smoking, volume of smoke and tar taken into the lungs during an average day, and nicotine content of the tobacco add to the difficulty of interpreting conclusions from such research. Indeed, additional variables might be necessary to clarify how light smokers differ from heavy smokers. To the extent that light and heavy smoking are not defined, the study will be difficult to interpret and will not offer highly generalizable results. In addition, the construct of appetite presents similar problems to the researcher. Interaction of the subjects, experimenters, andlor the experimental setting Such a threat can present obstacles to the construct validity of an experiment in several ways. For example, an experimenter who is familiar with the purposes of the study might collect data in such a way that the test of the hypothesis is biased by his or her views and perceptions. This has been referred to as ‘the experimenter bias effect’.23The subjects’ or the experimenters’interaction with the experimental setting can render the observed results as only applicable in settings in which these effects similarly apply. Another problem occurs when subjects perceive the purpose of the experiment and attempt to provide ‘good’ results. Generally, these kinds of problems are handled through double blind designs in drug research.24However, over the course of lengthy experimental studies, subjects and/ or experimenters may deduce the purpose of the study or some aspects of it, thus reducing the control offering by double-blind procedures in some setting.

STATISTICAL VALIDITY Statistical validity of experiments is probably the most recognized aspect of validity in drug r e s e a r ~ h . ~ ,Statistical * ~ - ~ ~ issues are discussed below from the

346

E. J. MASON AND M. BIALER

perspective of valid interpretation of experimental data revealed by statistical analysis. The position is taken that statistically valid methods are necessary but not sufficient for valid conclusions in scientific drug experimentation. Specific aspects of statistical analysis are discussed below. Power of a statistical test This is the ability of the statistical test to find a difference between treatment conditions when there really is one. Because statistical precision increases with larger sample size, researchers' concerns about power tend to hover around sample size. This focus can lead to questionable validity of interpretation of results. When power is considered only as a problem of sample size needed to reliably reveal differences, clinical importance of the findings may be limited. For example, consider the instance where subjects' ratings of pain is used as a measure of the effectiveness of an experimental analgesic medication. In this study, subjects use a ten-point scale like the one below, and make ratings every half hour for the 4 hours in which the drug is supposed to be active in the body. No pain

1

2

3

4

5

6

7

8

Very extreme pain 9 10

At the same time a control group is administered a placebo and does the same rating. With a large enough sample, a difference of 0.12 (e.g., experimental mean rating of 4.96, control mean rating of 5-08) might be statistically significant. Yet, the clinical importance of this difference would be questionable due to the imprecision and low reliability of the measuring device. For this reason, many statistical references urge determination of clinically important differences before deciding on the sample size needed for acceptable p ~ w e r . ~ In ' , ~other ~ words, too much power in a statistical test can lead to erroneous conclusions about meaningless differences. The term statistical significance refers only to the probability of a difference existing in the population from which the sample was drawn, not to the magnitude of the difference or even its clinical importance. Assumptions about underlying models These should be recognized. Statistical inference is based on assumptions about mathematical models which represent the hypothesized relationships between variables existing in the underlying data. Some analysis techniques are known to be robust to violations of these assumption^^^, yet this robustness does not apply to all assumptions under all conditions. Serious violations of assumptions in an analysis could lead to invalid conclusions. Further, underlying models may not be sensitive to relevant aspects of the data.33 Fishing and the error rate problem Such threats become problematical when repeated statistical tests are performed without regard to the notion that each new inference is based on

EXPERIMENTAL DESIGNS

347

probability. More precisely, each single statistical test carries with it a certain probability that the conclusion is in error. For example, the mean scores from two experimental groups are compared, and the conclusion is reached that the two values are significantly different. This means that if the population parameters are available, they would be expected to be different. This inference, based on the sample values is made at a certain level of certainty expressed as a probability usually referred to as 1 - a (with a being the probability of erroneously rejecting the null hypothesis, or as statisticians say, making a Type I error). Assume that there were multiple scores obtained from each subject, and that the researcher wanted to individually test each pair of them. Every single test would have a probability of incorrectly finding a difference of a. If there are to be C tests performed, then the probability of incorrectly finding a difference between two of the means somewhere in the C tests is equal to: 1 -(1 - a ) c

Thus, assuming a = 0.05 and ten tests (that is, C finding a difference in the data set is:

lo), the probability of falsely

1 - (1 - 0*05)10 = 0.40

In other words, there is a 40 per cent probability that a difference will be incorrectly found among the C comparisons, rather than the 0-05 assumed in each case. This kind of problem can be managed in the design of the analysis (e.g. by using apriori orpost hoc procedures, or by other methods). Failure to do so can lead to faulty premises and mistaken conclusions. Reliability of measures and treatment implementations This can cause invalid conclusions due to inconsistency. Low reliability can particularly affect the tendency to miss a difference that exists in the population (Type I1 error). This is because low reliability measures tend to increase the contribution of random variation to the estimate of the random error component of an experimental design. Random and irrelevant variation in the experimental environment This can contribute to larger error variance estimates in an experimental design and cause a researcher to miss treatment effects (i.e. make a Type I1 error). This can be particularly true in clinical or field trials where the experimenter has little influence over external sources of variation. Random heterogeniety of subjects or experimental units

This can increase the estimate of error variance in a model and contribute to a tendency toward making a Type I1 error. Thus, when clinical or other

348

E. J. MASON AND M. BIALER

experimental subjects represent a wide spectrum of characteristics and conditions, these influences may mask, rather than enhance, systematic variation between treatment groups. QUASI-EXPERIMENTS For a setting in which random assignment of subjects and strict control of the experimental conditions and environment is not possible, a quasi-experiment may be designed. Quasi-experiments are studies in which all the threats to experimental validity cannot be removed, but can be subjected to some control through design features. Quasi-experiments were originally identified as those studies which cannot utilize random assignment of subjects to groups as a way to equalize the groups prior to applying the experimental treatment conditions." However, in recognition of the fact that randomization may not render groups equivalent, particularly in experiments carried out over a lengthy time period, quasi-experimentsare defined in the present paper as those studies which involve comparison of conditions or groups that may differ in more ways than just treatment conditions. Although many laboratory studies may justifiably be considered true experiments, because of the large variety of extraneous influences possible in clinical and field-based research, a large number of these studies may more properly be considered quasi-experimental. The building of a valid quasi-experiment may be approached from the perspective that the various threats to validity that are possible in a given situation can be recognized and controlled. Very acceptable experimental conclusion validity may be obtained by such careful design of quasiexperiments.'* Researchers and evaluators of research findings may use the validity threats listed in this paper to identify sources of alternative or conflicting explanations or problems in drawing valid inferences from experimental results. Once the threats to the validity of an experiment have been identified, features may be built into the design to counter them. For example, when a medication is tested in clinical trials over a period of time, an observer can be utilized to record changes in the subject's treatment setting (e.g., primary physician changes treatment plan, sudden deterioration in patient's health, etc.). Thus, when a true experiment in which all sources of variation can be controlled is impossible, fairly useful quasi-experiments may be designed to account for potential threats to validity of the experimental conclusions. A checklist, containing the twenty potential sources of invalidity previously recognized, has been included as an appendix to this paper. CONCLUSIONS An experiment is analogous to a logical argument designed to test, through observation and analysis, the validity of a conclusion. Just as a logical argument

EXPERIMENTAL DESIGNS

349

may be valid because it contains a structure that would lead to a correct conclusion if the underlying premises are correct (but not necessarily when the underlying premises are incorrect), a valid experiment depends on the correctness of the assumptions and observations upon which the conclusions are based. In designing experiments and quasi-experiments, researchers should consider threats to internal and external validity as primary sources of invalidation. Threats to statistical validity seem more widely recognized in pharmacokinetics and clinical drug research than internal, external validity, and construct validity. Further, the importance of these concerns is greater for field-based or clincial research than research conducted in highly controlled laboratory settings. The checklist provided in the appendix should prove useful to consumers of research findings and to producers of research alike for systematically evaluating the validity of research investigations.

REFERENCES 1. W. W. Hauck and S. Anderson, J. Pharmacokinet. Biopharm., 12,83 (1984). 2. C. R. Buncher and J.-Y. Tsay, in Statistics in the Pharmaceutical Industry, C. R. Buncher and J.-Y. Tsay (Eds), Marcel Dekker, New York, 1981, p. 75. 3. C. W. Dunnett and M. Gent, Biometrics, 37, 213 (1977). 4. M. Gibaldi and D. Pemer, Pharmacokinetics, 2nd edn, Marcel Dekker, New York, 1982. 5. A. A. Nelson (Ed.), Research in Pharmacy Practice: Principles and Methods, American Society of Hospital Pharmacy, Bethesda, Maryland, 1981. 6. B. E. Rodda and R. L. Davis, Clin. Pharmacol. Ther., 25,245 (1979). 7. L. B. Sheiner, Drug Metab. Rev., 15, 153 (1984). 8. W. J. Westlake, in Principles and Perspectives in Drug Bioavailability, J. Blanchard, R. J. Sawchuk and B. B. Brodie (Eds), Karger, Basel, 1979, p. 192. 9. B. Whiting, A. W. Kelman, and J. Grevel, Clin. Pharmacokinet., 11, 387 (1986). 10. I. M. Copi, Introduction to Logic, 2nd edn, Macmillan, New York, 1961. 11. D. T. Campbell and J. C. Stanley, Experimental and Quasi-experimentalDesignsfor Research, Rand McNally, Chicago, Illinois, 1966, p. 1. 12. D. T. Cook and D. T. Campbell, Quasi-experimentation: Design and Analysis for Field Settings, Rand McNally, Chicago, Illinois, 1979. 13. R. E. Kirk, Experimental Design: Procedures for the Behavioral Sciences, 2nd edn, Brooks/Cole, Belmont, California, 1982. 14. E. J. Mason and W. J. Bramble, Understanding and Conducting Research: Applications for Education and the Behavioral Sciences, McGraw-Hill, New York, 1978. 15. D. Kranthwohl, Social and Behavioral Science Research, Jossey-Bass, San Francisco, 1985. 16. A. C. Fisher and S . Wallenstein, in Statistics in the PharmaceuticalIndustry, C. R. Buncher and J.-Y. Tsay (Eds), Marcel Dekker, New York, 1981, p. 139. 17. C. R. Buncher and J. -Y. Tsay, in Statistics in the Pharmaceutical Industry, C. R. Buncher and J.-Y. Tsay (Eds), Marcel Dekker, New York, 1981, p. 1. 18. C. M. Metzler and G. L. Schooley, in Statistics in the Pharmaceutical Industry, C. R. Buncher and J.-Y. Tsay (Eds), Marcel Dekker, New York, 1981, p. 157. 19. J. D. Elashoff, Am. Educ. Res. J., 6,383 (1969). 20. E. Babbe, The Practice of Social Research, Wadsworth, Belmont, California, 1983. 21. 0.L. Davies and H. E. Hudson, in Statistics in the PharmaceuticalIndustry, C. R. Buncher and J.-Y. Tsay (Eds), Marcel Dekker, New York, 1981, p. 355. 22. S. H.Willig, M. M. Tuckerman and W. S . Hitchings, in Good Manufacturing Practicesfor Pharmaceuticals: A Plan for Total Quality Control, Marcel Dekker, New York, 1975, p. 301.

350

E. J. MASON AND M. BIALER

23. R. Rosenthal and R. Rosnow (Eds), Artifact in Behavioral Research, Academic Press, New York, 1969. 24. F. Ederer, Am. J. Med., 58,295 (1975). 25. S. D. Dubey, in Statistics in the Pharmaceutical Industry, C. R. Buncher and J.-Y. Tsay (Eds), Marcel Dekker, New York, 1981, p. 87. 26. J. W. Green, in Statistics in the Pharmaceutical Industry, C. R. Buncher and J.-Y. Tsay (Eds), Marcel Dekker, New York, 1981, p. 189. 27. B. E. Rodda and P. Huber, in Drug Absorption and Disposition, American Pharmaceutical Association Academy of Pharmaceutical Sciences, Washington, D. C., 1980. 28. W. J. Westlake, Int. J. Clin. Pharmacol., 11, 342 (1975). 29. W. J. Westlake, Biometrics, 30, 273 (1979). 30. S. K. Katchigan, Statistical Analysis, Radius, New York, 1986. 31. G. W. Snedecor and W. G. Cochran, StatisticaZMethodr, 6th edn, Iowa State University Press, Ames, Iowa, 1967. 32. B. J. Weiner, Statistical Principles in Experimental Design, McGraw-Hill, New York, 1962. 33. H. Sheffe, The Analysis of Variance, Wiley, New York, 1959. 34. A. Recigno and J. S. Beck, J. Pharmacokinet. Biopharm., 15,327 (1987).

APPENDIX

CHECKLIST FOR DETERMINING VALIDITY OF RESEARCH DESIGNS* Hypothesis: Type of sample and population:

Is random assignment to treatments groups possible? Yes 0 N o 0 Proposed design: Validity

How it operates Effective counter in this case to problem

Internalvalidity 1. History

2. Maturation 3. Testing 4. Instrumentation 5. Statistical regression 6. Selection 7. Mortality 8. Interaction among factors 9. Maintenance of treatment conditions over time

*Adapted from: Figure 5.5 of Mason, E. J., & Bramble, W. J. Understanding and Conducting Research: Applications in Education and the Behavioral Sciences. 0 McGraw-Hill Publishers, New York, 1989 (in press), by permission of authors and publisher.

EXPERIMENTAL DESIGNS

External validity 10. Identity and Representation of Population (a) Interaction of selection and experimental arrangements (b) Reactive or interactive effects of testing and selection (c) Interaction of selection and maturation 11. Experimental settings or conditions (a) Interaction of experimental arrangements and treatments (b) Multiple treatment interference (c) Interaction of pre-test with treatment Construct validity 12. Inadequate development of underlying constructs 13. Inadequate operationalization of constructs 14. Interaction of subjects, experimenters, and/ or Statistical conclusion validity 15. Power of statistical test 16. Assumptions about underlying models 17. Fishing and the error rate problem 18. Reliability of measures and treatment 19. Random and irrelevant variation 20. Random heterogeneity of subjects and experimental units

351

Suggest Documents