6.7 Methodology of intervention trials in individuals
Oxford Textbook of Public Health
Methodology of intervention trials in individuals
Lawrence M. Friedman and Eleanor B. Schron
Ethical issues in intervention studies
Primary and secondary questions
Efficacy and effectiveness
Studies of equivalency
Clinical significance and statistical significance
Interventions versus intervention strategies
Quality of life and cost-effectiveness
Recruitment of participants
Data monitoring techniques
Whom to include
Reporting and interpretation
An intervention trial, or a clinical trial, has been defined in various ways and may be of several kinds. The International Conference on Harmonisation defines a clinical trial as ‘any investigation in human subjects intended to discover or verify the clinical, pharmacological, and/or other pharmacodynamic effects of an investigational product(s), and/or to identify any adverse reactions to an investigational product(s), and/or to study absorption, distribution, metabolism, and excretion of an investigational product(s) with the object of ascertaining its safety and/or efficacy’ (ICH 1996). This definition has the advantage of applying to all phases of a clinical trial. (This chapter will only address issues relating to the so-called phase III trial. For discussions of phase I and phase II studies, see FDA (1997).) The International Conference on Harmonisation definition has the disadvantage of not including trials of non-pharmacological or non-device interventions (that is, surgical procedures, diet, exercise). More generally, a phase III clinical trial may be defined as ‘a prospective study comparing the effects and value of intervention(s) against a control in human beings’ (Friedman et al. 1998).
Clinical trials are needed because only rarely is the precise pattern or outcome of a disease or condition known. It is not yet possible to identify all of the genetic and environmental factors that lead to disease progression, recovery, and relapse. Also rare is the treatment that is so overwhelmingly successful that even with a vague understanding of the course of the disease, it is possible to say that the treatment is obviously beneficial and has few major adverse effects. More often, the treatment, while useful, is less than perfect. Therefore, in order to determine the true balance of potential benefit and harm from a new treatment or intervention, it is necessary to compare people who have received the treatment with those who have not. Ideally, this comparison will be made in an unbiased objective manner so that, at the end, it is possible to say with reasonable assurance that any difference between those treated and those not treated is due to the treatment.
This chapter can only cover some of the key issues in clinical trials. For more extensive discussions, the reader is referred to any of several textbooks (Pocock 1983; Meinert and Tonascia 1986; Piantadosi 1997; Friedman et al. 1998) as well as journals such as Controlled Clinical Trials and Statistics in Medicine.
Ethical issues in intervention studies
The issue of the ethics of conducting clinical trials has generated considerable discussion and debate. Because interventions may be harmful, as well as helpful, and participants are asked to undergo potential hazards, discomforts, and expenditure of time, the question being addressed in any clinical trial must be important. Knowledge of the answer to the question must be worth these possible harms. In addition, there must be what has been termed ‘clinical equipoise’ (Freedman 1987). That is, there must be uncertainty as to the usefulness of the intervention among those knowledgeable about the intervention. Individual investigators or doctors may have personal beliefs about the benefits of a new intervention. Those beliefs may prevent them from participating in or entering participants into a clinical trial. The uncertainty in the medical community at large, however, is used to justify the conduct of the trial.
Informed consent of all study participants is essential. The nature of informed consent may differ in different countries and cultures, but the concept of individual choice to join or not join a trial must be universal (CIOMS/WHO 1993; Levine 1993; World Medical Association Declaration of Helsinki 2000). In addition to informed consent at the beginning of a trial, it is sometimes necessary either to modify the consent and/or to alert participants already in a trial to important new information. This can happen, for example, when an adverse effect that is important, but not so serious that a trial must be stopped, is noted. It may also be necessary to reinform participants when one clinical trial of a similar nature or question or intervention is reported while another is ongoing (DHHS 1991).
Selection of the comparison group raises ethical issues. Clinical trials may compare a new or unproven intervention against standard therapy, against no therapy, against a placebo, or in combination with standard therapy against placebo in combination with standard therapy. Whenever the comparison is against no therapy or placebo, the ethics of not treating someone in the best possible ay are raised. If indeed there is no good treatment, then it is not a problem. But if a treatment known to be beneficial exists, then a control consisting of no therapy or placebo must be carefully justified. This might be possible if there is no appreciable risk to health or discomfort for the time that effective therapy is withheld (Ellenberg and Temple 2000; Temple and Ellenberg 2000). Often, placebo-controlled trials or trials that have no treatment as the control use both the new intervention and the control (placebo or no treatment) in addition to the best known treatment or standard care. In such trials, the intent of evaluating the new intervention is not to replace an existing one, but to add to it. The ethics of this situation are similar to those where there is no known effective therapy.
Even when there is no known effective therapy, the ethics of using a placebo, and indeed of randomization, have been questioned (Hellman and Hellman 1991). Use of a placebo implies deception. The strictures of abiding by a study protocol reduce a clinician’s freedom to do what he or she thinks is in the best interest of the patient. The interests of the individual patient cannot be sacrificed for those of society. Conversely, it has been pointed out that a clinician’s views as to the best treatment are often misguided, that hunches about treatment are not particularly helpful to the patient, and that trials can be designed to take into account patient needs (Passamani 1991).
This last point is crucial. Trial design needs to incorporate the highest ethical standards. Whenever there is a potential conflict between the needs of the patient and those of the study, the interests of the patient must take precedence.
Primary and secondary questions
The most important factor in selection of the study design, population, and outcome measures is the question that is posed. Each intervention study has a primary question that is specified in advance and is used to determine the sample size. As implied by its name, there is typically just one primary question. It is a question that is important to answer and feasible to address. By feasibility is meant the ability to identify and enrol adequate numbers of participants, to employ the intervention in an effective and presumably safe manner, to ensure that there is adequate adherence to the protocol, and to measure the outcome accurately and completely. In addition to the primary question, there may be a variety of secondary questions. Secondary questions may be less important or less feasible to answer. There may be fewer outcomes or the outcomes may be harder to measure. They may be explanatory. That is, they may help the investigator to understand the mechanism of action of the intervention by examining biochemical or physiological processes.
Study outcomes (also termed endpoints or response variables) may be of several sorts. One way of categorizing them is as either discrete or continuous. That is, they may consist of the occurrence of an event, such as a myocardial infarction or survival from cancer, or of a measurement, such as level of blood pressure or number of CD4 lymphocytes. For phase III trials, these outcomes are usually clinically important. That is, they may be fatal or serious non-fatal events, or other clinically meaningful conditions such as alleviation of pain, increased functional status, or change in an important risk factor such as cigarette smoking.
Regardless of the primary outcome, several features pertain (Friedman et al. 1998). Firstly, as mentioned above, it must be specified in advance; written in a protocol. Secondly, it must be capable of being assessed in the same way in all participants. Thirdly, it must be capable of unbiased assessment. Fourthly, it must be assessed in all, or almost all, of the participants. As discussed below, significant amounts of missing data can seriously affect the interpretation of the trial.
Clinical trials can require large numbers of participants, last for years, and be expensive. Trials with a continuous variable as the outcome require fewer participants than do trials with dichotomous outcomes. Also, if the outcome can be assessed before a clinical event has occurred, the study may be shorter in duration. Therefore, there is considerable interest in the use of surrogate outcomes, which are often continuous. A surrogate outcome is one that substitutes for a clinical outcome; it may not, in itself, be important to the participant. An example is blood pressure. Elevated blood pressure is important primarily because it is a risk factor for stroke and heart disease, not because it is generally symptomatic. It has been shown in numerous clinical trials of both diastolic and isolated systolic hypertension that treatment reduces the occurrence of stroke and heart disease (SHEP 1991; Psaty et al. 1997; Staessen et al. 1997). However, not all methods of reducing blood pressure are without risk. As we do not know all of the potential adverse effects of treatment, for some interventions in some people the risk may outweigh the benefits. This question has arisen with the use of calcium-channel-blocking agents and hypertension (Psaty et al. 1997). Similarly, we know that ventricular arrhythmias are associated with increased risk of sudden cardiac death (Bigger 1984). Therefore, for years it made sense to treat people with antiarrhythmic agents to reduce the occurrence of sudden death in those with heart disease and ventricular arrhythmias. Yet, when clinical trials of these agents were conducted, the results were sometimes unfavourable (CAST 1989, 1992; Waldo et al. 1995). Ventricular arrhythmia suppression is not a good surrogate for the clinical outcome of sudden cardiac death. Other examples of inadequate surrogate outcomes have been described (Fleming 1995; Fleming and DeMets 1996). Ideal characteristics of a surrogate outcome have been proposed (Prentice 1989), but these are unlikely to be fulfilled. Therefore, judgement as to the usefulness of a surrogate endpoint must be exercised. For phase II studies which do not attempt to address clinical questions, surrogate outcomes are entirely appropriate. For phase III trials, the kinds of issues that must be considered are the extent of correlation between the surrogate and the clinical event of interest, the ease or difficulty (and cost) of obtaining reliable surrogate outcome measurements on all of the participants, the feasibility of obtaining enough participants to conduct a clinical outcome study, the harm of a possibly wrong answer, and the urgency of obtaining an answer. With regard to the possibility of an incorrect answer if a surrogate outcome is used, this may be justified in certain circumstances. For example, if the disease or condition is life-threatening, doctors and patients may require less evidence of clinical benefit and may be less concerned with possible harm from an intervention. The results of a trial with a surrogate outcome may be sufficiently persuasive to allow use of the new intervention. Similarly, in truly life-threatening situations, getting an early answer using a surrogate outcome may outweigh the interest in getting a better, but delayed, answer using a clinical outcome. Early trials in AIDS used surrogate outcomes. At that time, no proven treatments were available.
Efficacy and effectiveness
Intervention studies are sometimes categorized as efficacy trials and effectiveness trials. An efficacy trial attempts to evaluate whether an intervention works under reasonably optimal circumstances. That is, if the active drug is taken as prescribed by essentially all in the intervention group, and if almost no one in the control group takes the active drug, will the drug alter some clinical outcome? An effectiveness trial allows for non-adherence to the assigned treatment; it resembles what is likely to happen in actual clinical practice. Most efficacy trials will be relatively short, as longer trials would have trouble maintaining optimal adherence (Friedman et al. 1999).
Studies of equivalency
Sometimes called positive control trials, studies of equivalency address whether the new intervention is as good as an agent known to be worthwhile. It is sometimes difficult to know that an agent is worthwhile. Not all agents proven to be beneficial at some previous time will always be so in all circumstances. This is particularly the case with drugs such as antidepressants (Ellenberg and Temple 2000; Temple and Ellenberg 2000). Therefore, simply showing that a new intervention is no worse than the standard one may not truly prove that the new one is better than placebo. Adding to the complexity is that ‘equivalency’ must be defined. It is not the same as failing to show a significant difference between the two agents. That could happen simply because the study has an insufficient number of participants, or because participants failed to adhere adequately to the treatments. Because the two agents cannot be shown to be identical (an infinite sample size would be needed), the new intervention must be shown to fall within some predefined boundary that is sufficiently close to the standard therapy. Defining how close will depend on the risks of inappropriately declaring the new agent to be effective and the feasibility of conducting a trial with a large enough sample size.
Clinical significance and statistical significance
An intervention study should have the ability to detect a clinically important difference between groups, if one exists. Conversely, simply showing that a statistically significant difference exists if the outcome is either not clinically meaningful or is so trivial in magnitude as to be unimportant is not worthwhile. Therefore, in determining the question to be posed, and the outcome to be measured, the issue of clinical significance needs to be considered. Factors that enter into the determination are the seriousness and prevalence of the condition or disease, the risks or cost of the intervention, and the usefulness of existing treatments (Friedman 1998).
Interventions versus intervention strategies
Not all interventions need to consist of single treatments. Sometimes, intervention strategies may be tested. For example, in some trials of hypertension treatment, a stepped care approach was used (Hypertension Detection and Follow-up Program Cooperative Group 1979; SHEP 1991). The intent was to see if successful lowering of elevated blood pressure resulted in reduction of stroke or heart disease. If the first antihypertensive agent did not adequately lower the blood pressure, another drug was used or added. At the end of this kind of study, it may not be possible to say that a particular drug is responsible for the observed benefits, but rather a strategy. Sometimes, the strategy may incorporate non-pharmacological as well as pharmacological approaches (for example, diet as well as drug in order to reduce blood pressure).
In trials that compared coronary artery bypass graft surgery against medical therapy (CASS 1983), the comparison was not really coronary artery bypass graft versus medicine. Because, over the follow-up period, a large proportion of the participants in the medical arm received surgery, it was a strategy of early surgery versus surgery later if needed. Yusuf et al. (1994) reported 5-, 7-, and 10-year results from an overview of seven trials of coronary artery bypass graft surgery. At 5 years, 25 per cent of the participants assigned to the medical arms had received surgery; at 7 years 33 per cent had done so; at 10 years, 41 per cent had undergone surgery.
These kinds of studies can be important and valid, but the objectives need to be clearly stated. Otherwise, the study may be criticized for not truly making the intended comparisons.
Quality of life and cost-effectiveness
Not all study outcomes need be mortality or major morbidity. Many clinical trials have looked at outcomes such as quality of life or the cost-effectiveness of administering a particular intervention as compared with another.
Health-related quality of life is a multidimensional concept that characterizes an individual’s total well being and includes psychological, social, and physical dimensions (Naughton et al. 1996). Table 1 shows the kinds of factors generally measured in health-related quality of life instruments.
Table 1 Dimensions of health-related quality of life
Cost-effectiveness evaluation may be particularly important when interventions are expensive and where the difference between intervention and control on mortality or major morbidity is small. In a comparison of implantable cardiac defibrillators versus antiarrhythmic drugs, cost-effectiveness was a key secondary outcome. Patients with serious ventricular arrhythmias had a significant reduction in mortality from the defibrillator, but the costs were considerably greater (Larson et al. 1997). Hlatky et al. (1997) compared quality of life, employment status, and medical care costs during 5 years of follow-up among patients treated with angioplasty or bypass graft surgery. Those in the surgical group had a better quality of life than those in the angioplasty group. Only in a subset of the participants was the cost lower in the angioplasty group.
Though these outcomes are usually secondary ones, they have occasionally been used as the primary outcome in a trial. Croog et al. (1986) reported the results of a trial comparing quality of life assessment with two antihypertensive agents. Although the results can be dependent on participant selection and dose, in addition to the characteristics of the agents themselves, this trial did show the value of quality of life assessment as an outcome. Quality of life and psychosocial functioning may also be a key predictor of other outcomes, and are therefore important to measure (Ruberman et al. 1984).
A key part of defining the question to be answered is specifying the kinds of people who will be enrolled in the clinical trial. That is done by means of eligibility criteria, of which there are various sorts (Friedman et al. 1998). Firstly, eligible participants must have the potential to benefit from the intervention. That is, they must have the condition that the intervention might affect. Implicit in this is having the degree of severity at a time in the disease process that is modifiable. Also, any change in the condition must be detectable. That is, it cannot be so mild or slowly progressing that to detect a change, the study must be too large or last too long to be feasible. Secondly, participants cannot have known contraindications to the intervention. Thirdly, they should not have other conditions which would make it difficult to detect changes in the condition of interest. An obvious example is someone who has both heart disease and cancer. If a 3-year study of an intervention for the heart disease is planned and the expected survival due to the cancer is less than that, it is unlikely that the person will contribute to answering the question about heart disease. Fourthly, if the study requires participants to return for follow-up visits in order to assess the outcome, people who are unlikely to be able to do so should not be enrolled.
Figure 1 shows how the study participants are derived from the general population. People are excluded at various stages, based on the entry criteria. The final stage indicates that there are identified eligible participants who are not enrolled. This is because participating is strictly voluntary as a result of informed consent. Many people decide that they would prefer not to enrol in the trial.
Fig. 1 Relationship of study sample to study population and population at large (those with and without the condition under study). (Source: Friedman et al. 1998.)
The issue of who is and who is not enrolled in a trial raises the concepts of validity, generalization, and representativeness (Friedman et al. 1998). A properly designed and analysed trial will yield a valid result. That is, it will be possible to say whether or not the intervention is different from or better than the control, in the setting of the kinds of participants who were enrolled.
Depending upon how narrow or broad the study sample is will determine how much the results can be generalized. If the eligibility criteria are highly selective, then the results might only apply to that sort of participant. If the eligibility criteria are broad, with many identifiable kinds of participants, then the results would be more broadly applicable. The reasons for performing one or the other type of study will depend partly on how much is known about the mechanism of action of the intervention. Congestive heart failure may have several aetiologies. If it is known (or surmised) that the intervention only works in heart failure of a non-ischaemic origin, and therefore only such people are enrolled, then the results of the trial would only apply to people with non-ischaemic heart failure. Another reason for a narrowly defined study population might be concern over the risks versus benefits. For example, the first studies of blood pressure reduction were in people with quite elevated pressures (VA 1967, 1970; HDFP 1979). Any benefit from treatment would be easier to find because of the greater likelihood of clinical events in this high-risk group. Also, any adverse effects of the intervention would be more likely to be balanced by the benefits. Not until other trials were conducted in people with lower levels of blood pressure was it possible to say with certainty that such people should be treated. The first studies could not be extrapolated to the lower risk population.
An example of a trial that successfully enrolled a broad population is the Heart Outcomes Prevention Evaluation Study (Heart Outcomes Prevention Evaluation Study Investigators 2000). In that trial, an angiotensin-converting enzyme inhibitor was evaluated in over 9000 participants with either known vascular disease or diabetes plus a risk factor for cardiovascular disease, but without evidence of heart failure. Regardless of the type of patient, the intervention was found to be highly effective in reducing mortality and morbidity.
No clinical trial is truly representative of the population with the condition being studied. Investigators conduct trials in people to whom they have ready access, rather than a random sample of the population. Eligibility criteria exclude some people for study design reasons and not because it is thought that they would not respond to the intervention in the same way as those enrolled. Additionally, there are always differences between volunteers and non-volunteers. If one is rigid, the results would be applied only to people who are identical in all relevant ways to those in the trial. The key word is ‘relevant’. Judgement must be used in deciding to whom the results reasonably apply. Are the characteristics of the patient whom one wishes to treat different in respects likely to alter the effect of the intervention as observed in the trial?
In a parallel design study, participants are allocated to intervention or control and stay in that group until the end of the study. Although the typical study has two groups, one intervention and one control, many have more. Thus, there may be more than one intervention group and even more than one control group. When there are only two groups, the comparison is straightforward. When there are more than two groups, the comparisons can become complicated. For example, if there are three groups, two interventions and a control, there can be up to three main comparisons—each intervention against the control and one intervention against the other. This has implications for the overall type I error and therefore for the sample size. Conservatively, one would correct for the number of comparisons, in this case dividing the a level by 3. Instead of requiring a p value of, for instance, 0.05 for significance, each comparison might require a p value of 0.0167. To maintain adequate power to achieve this level of significance, the sample size will need to increase considerably. The possibly lower event rates in the two intervention groups (assuming benefit from the interventions) will also lead to the need for a larger sample size. If only the comparisons of the two interventions against the control are of interest, there is less penalty, and the three-arm design may be more efficient than initiating two individual studies, as the same control group can be used. Even here, as will be seen in the section on sample size, the control group may need to be larger than if there is only one intervention group. Davis et al. (1996) have reported on the design of such a study, where four different types of antihypertensive agents are being compared.
If there is an interest in studying more than one intervention at a time, a factorial design study may be more efficient than a parallel design. The simplest factorial design is a two-by-two design. This design will have four groups: treatment A plus treatment B, treatment A plus the control for treatment B, control for treatment A plus treatment B, and control for treatment A plus control for treatment B. The last is the only group that has no exposure to either of the interventions being tested. When this design is analysed, there are two primary analyses. One is treatment A plus treatment B and treatment A plus control for treatment B versus control for treatment A plus treatment B and control for treatment A plus control for treatment B. The other is the two groups with treatment B versus the two groups with control for treatment B. Because the study is designed with adequate power for comparing two groups against another two groups, it is unlikely that there will be adequate power to look at one group against another. This would usually only happen if both interventions show differences, and are additive. An example where this happened is the Second International Study of Infarct Survival (ISIS-2 1988). Factorial designs need not be just two by two. There can be more than two groups for each factor, or even more than two factors. In addition to efficiency, an advantage to the factorial design is that one might derive suggestions of differential effect of treatment in the presence or absence of the other treatment. However, this is also a weakness. If these so-called interactions are present, they may make it difficult to discern an overall effect, particularly if they go in opposite directions (Brittain and Wittes 1989). Examples of successful factorial designs, in addition to the International Study of Infarct Survival studies (1988, 1992), are the Physicians’ Health Study (Hennekens and Buring 1989) and the Women’s Health Initiative Study Group (1998). Interestingly, for a three-arm parallel design, it is generally thought appropriate to adjust the a level for the number of comparisons. For factorial design, however, the usual practice is not to make such an adjustment.
In the cross-over design, each participant serves as his or her own control (Friedman et al. 1998). In the simplest case, half of the participants would receive intervention followed by control, and the other half the reverse. The major advantage of this design is the smaller sample size. Because each participant is on both intervention and control, half the number of participants are needed. The sample may be even smaller, because the variability is less than in the standard parallel design. There are disadvantages, however. The most obvious one is that the outcomes must be reversible. A cross-over design is not possible if the primary outcome is mortality or a clinical event. A second disadvantage is that there is an assumption of no carry-over effect from one period to the next. If the effect of the intervention persists into the period when the control is being administered, then the apparent effect may be less than the real one. Often, to minimize the likelihood of carry-over, a washout period is inserted between the actual cross-over periods. Unfortunately, it is difficult to prove that a carry-over effect is absent and a participant has truly returned to baseline.
The optimal way of allocating intervention or control to clinical trial participants is by means of randomization. Randomization does not guarantee balance in all factors between the groups, but the chances of balance are increased. Unknown as well as known and measured characteristics are likely to be comparable when there is randomization. A properly performed randomization procedure also reduces the opportunity for investigator bias in the allocation of intervention or control. Finally, randomization guarantees that statistical tests of significance will be valid.
Randomization does not require a one-to-one allocation to intervention or control, only that the allocation be unpredictable. Alternative assignment or assignment based on day of the month (for example, odd or even) is predictable, and is not equivalent to randomization. Matching on the basis of important characteristics is also not considered randomization. Flipping an unbiased coin to determine whether a participant is assigned to group A or B can be a valid way of randomizing. In practice, however, tables of random numbers or computer-produced random numbers are more often used.
Randomized studies can, by definition, only have a concurrent control group. That is, a historical control study cannot have randomized allocation. This yields another advantage of randomization, namely that the participants are enrolled in the same time period in both the intervention and control groups. Therefore, temporal trends in care or in the nature of the condition being studied are equal in the two groups.
Several procedures for randomly assigning treatments to participants have been developed. The simplest are the fixed allocation procedures. If, for example, 20 participants are needed for a study, a coin may be tossed when each participant is entered. However, the likelihood that the number of participants in the two groups will be different (for instance, 12 to 8 or even more extreme) is about 50 per cent. As the sample size increases, the likelihood of such a large uneven split is reduced. If 100 participants are enrolled, the chance of a 60 to 40 split is only about 5 per cent (Friedman et al. 1998).
Blocked randomization is commonly used because of this problem. In blocked randomization, equal numbers of participants in the groups is guaranteed after every several are enrolled. For example, if the block size is four, and the sample size is 12, then after four, eight, and 12 participants are enrolled, there would be equal numbers in treatments A and B. This would be accomplished by specifying that each block of four would have two participants assigned to A and two to B. The order within the block of four would be randomized. Thus, it could be ABAB, AABB, ABBA, BABA, and so on. The hazard with this approach is that if the block size is known to the investigator and the treatments are not completely blinded, the last one of the block (and sometimes the last two) can be predicted. Therefore, often, the block size, as well as the order within the block, is random, and the investigator entering the participants is kept ignorant of the block size.
Another advantage of blocked randomization would be if participant entry criteria are modified partway through a trial. In the absence of blocking, even if at the end of entering all participants there are more or less equal numbers in the groups, there may be imbalance in numbers when only some of the participants have been entered. If, because of lagging participant entry, the eligibility criteria are loosened, different sorts of participants may enrol later during recruitment than enrolled earlier. As a result, the characteristics of the participants in group A may differ from those in group B. With blocked randomization, equal numbers are ensured throughout the enrolment period, and changes in entry criteria would not lead to imbalances between the groups in type of participant.
Stratified randomization is a special kind of blocked randomization. Here, the investigator wishes to ensure that there is balance between group A and B, not only in numbers of participants, but in kind of participant. If, despite the randomization process, there is concern that there will be imbalance between groups for one or two key highly prognostic variables, randomization can be stratified on those variables. Thus, within decades of age, for example, blocked randomization would occur. If sex is also a key variable, there would be blocked randomization within each age–sex category. The problem is that even with only two or three characteristics, each one having two or more factors, the number of strata can rapidly increase (Friedman et al. 1998). This can lead to unfilled cells unless many participants are being enrolled. If the sample size is large, randomization will generally lead to good balance, making stratification unnecessary. Therefore, stratified randomization should be done judiciously. If, after the trial is over, it is found that there is a major imbalance in a key factor, an adjusted analysis can be performed. In multicentre trials, randomization is usually done by centre, making the centre one of the important strata. This minimizes the chance that different sorts of participants or different medical practices among centres will confound the results.
In addition to the above fixed randomization procedures, there are various adaptive randomization procedures. In baseline adaptive procedures, the likelihood of randomization to A or B changes in order to reduce imbalances in selected characteristics. In response adaptive procedures, the likelihood or randomization to one or another group changes based on the occurrence of study outcomes. Adaptive randomization procedures are not used as frequently as fixed randomization. For more details of these procedures, see Friedman et al. (1998).
Clinical trials should be designed with adequate power to answer the question being posed. That is, by the end of the trial, there should be enough events or, in the case of continuous response variable, sufficient precision of the estimate to say with reasonable assurance that the intervention does or does not have the postulated effect. Several factors are considered in the calculation of sample size. For dichotomous outcome studies, these factors are event rate in the control group, expected benefit from the intervention, level of adherence to the intervention, level of adherence to the control regimen, a level, and power. For continuous outcome studies, the mean and variance of the control and intervention groups, plus the level of adherence, a level, and power, would be the relevant variables.
Various references provide formulas for calculating sample size (Lachin 1981; Lakatos 1986; Wu 1988), as does Chapter 6.13. In essence, the factors that lead to the need for larger sample sizes in the dichotomous outcome situation are lower control group event rate, smaller benefit from the intervention (or lesser difference that one wants to detect), smaller a, greater power to detect a real difference (smaller b), and poorer adherence (or greater cross-over). Alpha is commonly selected to be 0.05 (two-sided); power is typically 0.8 to 0.9.
As discussed below, the preferred method of analysis is by ‘intention to treat’. This means that, in general, participants remain in the randomization group to which they have been assigned, regardless of their future actions or the degree to which they adhere to the assigned regimen. To the extent that those assigned to the intervention group fail to comply with the intervention, for example by not taking their medication (often called ‘drop-out’), the expected benefit from the intervention is reduced. Similarly, to the extent that those assigned to the control group begin taking the intervention (often called ‘drop-in’), the control group event rate is altered. The net effect of this non-adherence is a narrowing of the difference between the groups. This, in turn, leads to a larger sample size in order to maintain the same power to detect a real difference. Non-adherence can have an appreciable effect on sample size. A correction factor proposed by Lachin (1981) multiplies the needed sample size by 1/(1 – R0 – R1)2, where R0 is the drop-out rate and R1 is the drop-in rate. Because the factor is squared, the sample size increases rapidly as soon as the combined non-adherence rate goes over 20 per cent. Even a combined non-adherence rate of only 10 per cent means a sample size increase of almost a quarter. More complicated sample size formulas take into account the fact that most non-adherence is not linear, but is often greater earlier in a trial than later (Lakatos 1986; Wu 1988). Another factor sometimes considered in sample size calculations is the estimated time for an intervention to make the postulated biological changes. For example, if cholesterol-lowering drugs act at least partly by reducing arterial plaque, then the time for that process to occur (so-called ‘lag time’) implies a larger sample size (and a longer study).
As noted in the section above on study question, studies of equivalency may require large sample sizes, depending on what is meant by ‘equivalence’. Because the sample size formula contains the difference to be detected in the denominator, if zero difference is planned, the sample size would be infinite. Therefore, one typically specifies a difference d. If the two treatments show differences less than this, they are considered equal, or at least they have differences that are unimportant. Sample size formulas for such studies are available (Blackwelder and Chang 1984). It should be emphasized that, unlike studies where a difference is being sought, an underpowered study of equivalency will lead to the ‘desired’ outcome. That is, it will confirm the null hypothesis of no difference. Even more, poor adherence will enhance the likelihood of seeing no difference for either the primary outcome or for adverse effects.
The needed sample size is an estimate because factors such as event rate and adherence are rarely known for certain. It may be prudent, therefore, to be conservative in the assumptions that enter into calculating sample size. The disadvantage of being conservative is that increased size or duration leads to increased cost. Also, entering more people into a trial than is necessary to answer the question may put more people at risk than is appropriate.
Recruitment of participants
A rule of thumb for all phase III clinical trials is that participant recruitment is always more difficult than expected. It is the uncommon clinical trial that finishes enrolment on schedule and the even rarer one that can do so without major recruitment strategy changes. Because recruitment is difficult, it is best to employ multiple strategies and to plan for back-up strategies in advance and to monitor progress closely throughout the enrolment period. Depending on the nature of the study population, back-up strategies would include adding sources of participants (for example, clinics, hospital units) or disseminating information about the trial more widely, to both medical personnel and potential participants. If sufficient resources are available, the strategies might include adding staff whose primary responsibilities involve enhancing enrolment or increasing incentives. The latter raises ethical questions if the incentives are inappropriate in amount or kind. Paying so-called ‘finder’s fees’, for example, would not be acceptable.
If participant enrolment remains slow, several options are available. One approach is to extend the time of enrolment. This has the advantage of not changing other study design factors, but the disadvantages of additional cost and delay in answering the question. A second approach is to accept the smaller sample size. Depending upon how large the shortfall is, the reduced power may not be too great. If the power goes from 90 to 85 per cent or even 80 per cent, that can be acceptable. However, if it falls much below 80 per cent, the study is likely to be underpowered. Some have argued that conducting even underpowered trials is useful, as cumulative meta-analyses of similar trials will yield the answers (Antman et al. 1992), but that approach is not recommended here. A third option is to change the entry criteria, so that more people are eligible. Depending on the original criteria, this might be feasible. However, care needs to be taken to make sure that other design assumptions, such as the expected effect of the intervention and the event rate, are not materially changed. In addition, as noted above, a blocked randomization scheme needs to be used to ensure that there is no gross imbalance between groups in participants enrolled before and after the criteria change.
A fourth option is to change the study outcome, so that fewer participants are needed to obtain the same number of events. A study may be originally designed with the outcome of death due to heart disease or non-fatal myocardial infarction. Because of limited resources, some of the other approaches to slow participant accrual are not feasible. It may be decided that the intervention is as likely to affect other important outcomes, such as need for coronary revascularization, as it is to affect myocardial infarction. Adding that event to the primary outcome would increase the event rate considerably, and allow for an answer with fewer participants. In another example, incidence of hypertension may be the outcome in a study looking at prevention of hypertension with weight loss. Instead of using incidence of hypertension as the primary event, mean blood pressure might be used. Going from a dichotomous outcome study to a study using a continuous variable as the outcome will reduce the needed sample size. These sorts of design changes should not be made lightly. They require considerable thought and review. If, because of the changes, the results are not persuasive to the outside community of practising clinicians, there is little point in undertaking them.
As discussed in the section above on sample size, adherence (or compliance) on the part of participants is a key factor in clinical trials. It can reduce the power of a trial, and, if truly bad, can make the study results uninterpretable. Therefore, most investigators take steps when planning a study to minimize poor participant adherence. One is to design the study so that the regimen is as simple as possible. For medications, once-daily dosing is preferable to more frequent doses. For lifestyle interventions such as diet or exercise, simpler, more easily remembered programmes are better. Shorter trials have better chances of maintaining good adherence than longer ones. A second method is to select participants who are more likely to adhere. One way is by means of a run-in phase prior to randomization. Unless study participants must be enrolled immediately, for example at the time of an acute event, a run-in period can be used to determine who adheres to the regimen. Potential participants might be given the active medication for several weeks. At the end of that time, only those who took at least 80 per cent (or some other reasonable amount) of the drug would be enrolled. The participants who could not adhere, even over the short term, would not be randomized. This approach has been successfully used in the Physicians’ Health Study (PHS 1989; Glynn et al. 1994) and the Women’s Health Study (Buring and Hennekens 1992). Angiotensin-converting enzyme inhibitors may cause cough in some people. Therefore, to minimize the drop-out rate after randomization, studies of angiotensin-converting enzyme inhibitors have used a short run-in period to exclude those who might not tolerate the drug (Davis 1998). Excluding potential non-adherers on the basis of other demographic or psychosocial factors has been done, but the evidence that it successfully separates good from poor adherers is unclear (Dunbar-Jacob 1998). Educating potential participants about the trial is not only good practice from an informed consent standpoint, but is likely to lead to the enrolment of participants who are better adherers. Being unduly persuasive in enrolling participants may improve the recruitment figures, but it can lead to worse adherence statistics. Because the analysis is done on an intention-to-treat basis, the study is more harmed by someone who drops out after enrolment than by someone who does not enrol.
A variety of techniques to maintain good adherence have been tried. Those that appear to be useful are frequent contact and reminders, providing easy transportation and access to attractive facilities, providing continuity of care, providing special medication dispensers, such as calendar packs, and involving family members, particularly when the intervention is lifestyle change (Schron and Czajkowski 2001). Other techniques include attention to aspects of the trial regimen, such as single-dose formulation for medication, intervention schedules made similar to those in clinical practice, and the use of specially trained personnel.
Adherence monitoring has two purposes. One is to be able to advise participants who are not complying with their regimen on how they might improve. The second is to be able to interpret the results of the trial more accurately. The first requires knowledge of individual adherence; the latter only requires knowing how the groups are performing.
Monitoring individual adherence is important, but there is considerable debate about how accurately it can be done, except for interventions that take place entirely in clinics or hospitals (surgery, vaccine, periodic medication, food feeding studies). Self-reports are simple, but subject to considerable uncertainty. Participants may not remember accurately, and may have a desire to report better adherence than is truly the case. Assessment of activities such as nutritional intake and physical activity are particularly difficult. The use of diaries or other records may help, but still depend on accurate completion by the participant.
For studies involving medication, there are a variety of ways to assess adherence. Pill count is relatively simple, though there are studies that indicate that it over-reports adherence (Rand and Weeks 1998). Participants may forget to return the partially empty containers or may intentionally discard medication that was not taken. Laboratory measures of drug metabolites can be useful, but also may be misleading, as they do not reflect what was ingested long term or show the true pattern of medication usage. Use of special devices that register when a bottle cap is opened has been advocated (Rand and Weeks 1998). Electronic monitoring of this sort can provide a continuous record of dose taking. This probably provides a more accurate measure of adherence, but it is expensive. Even this technique does not prevent a participant from opening the bottle, removing a pill, and then discarding it.
Physiological or biochemical measures that reflect responses to the intervention can be used in some studies. For example, trials of cholesterol-lowering agents which have heart disease as the outcome would periodically measure lipid levels. These are not foolproof indicators of adherence, as individual responses vary, but they are particularly good at demonstrating that on average, after randomization, the intervention group has a different biochemical profile from the control group. One problem with using these sorts of measures as markers of individual adherence is that they may unmask the group to which a participant has been assigned.
Unless one is willing to go to considerable lengths and spend considerable resources, the simple measures of adherence are probably adequate for most purposes. They will certainly indicate gross problems overall, and allow the investigator to conclude, with reasonable assurance, that the intervention was or was not administered satisfactorily, and that there is or is not a difference between the groups in intermediate response variables or biomarkers. For the purposes of individual counselling, the more sophisticated assessments might be more useful than pill count, for example, but whether they are the best uses of limited funds is questionable.
Data monitoring is an essential part of any clinical trial. If the data become persuasive before the scheduled end of the trial, or if unexpected adverse events occur, the investigator is obligated either to stop the trial or to make necessary design changes. For many trials, the data monitoring function is undertaken by a person or group external to the study investigator structure. For masked studies, this helps to keep the investigator blinded. But more importantly, for all trials, an outside group is less likely to have a bias and less likely to want the study to continue inappropriately because of financial or other reasons. The primary function of this group is to maximize participant safety. Secondarily, it helps ensure the integrity of the trial.
In the process of data monitoring, several kinds of recommendations may be made. Firstly, and most common, would be a recommendation to continue the trial without any change. Secondly, would be a recommendation to modify the protocol in some way. Examples might be changing the participant entry criteria, changing the informed consent to take into consideration important new information, changing the frequency of certain tests to better ensure safety, or even dropping from the study certain types of participants for whom it may no longer be appropriate. Thirdly, there might be reason to recommend extending the trial. This could occur if the participant accrual rate is slower than expected or if the overall event rate is much lower than expected. Fourthly, there might be a recommendation to stop the trial early.
Data monitoring techniques
Regular data monitoring must be performed for ethical reasons. However, this carries a penalty.
If the null hypothesis, H0, of no difference between two groups is, in fact, true, and repeated tests of that hypothesis are made at the same level of significance using accumulating data, the probability that, at some time, the test will be called significant by chance alone will be larger than the significance level selected. That is, the rate of incorrectly rejecting the null hypothesis will be larger than what is normally considered to be acceptable. (Friedman et al. 1998).
Therefore a variety of stopping boundaries or guidelines have been developed that maintain the overall prespecified a level. Biostatistics references can be consulted for the details of these methods. In essence, the methods fall into three categories: classical sequential, group sequential, and curtailed sampling. In the classical sequential approach (Whitehead 1983), there is no fixed sample size. Participant enrolment and the study end when boundaries for benefit or harm are exceeded. A theoretical advantage of the classical sequential approach is that fewer participants might need to be enrolled than in a fixed sample size design. This design requires study outcomes to occur relatively soon after enrolment, however, so that decisions about enrolment of new participants can be made. As a result, it may have limited usefulness.
The most commonly used monitoring techniques are the group sequential methods. Here, after a group of participants have been enrolled, or after a length of time, the data are examined. In order to conserve the overall a level, the study is not stopped early even if the nominal p value (for example, 0.05) is exceeded. More extreme p values are required for early stopping. An example of such boundaries is that developed by O’Brien and Fleming (1979), which calls for very extreme results early in a study which gradually become less extreme towards the end. If the study goes to the expected end, the significance value is essentially what it would be without any interim monitoring. An approach proposed by Haybittle (1971) and Peto et al. (1976) uses a constant extreme value throughout the trial, with the usual p value for significance at the end. Both of these techniques allow the final significance value to be what would be used without monitoring because of the low likelihood of stopping early, given the extreme nature of the boundaries. A modification of these techniques uses what is termed an a spending function (Lan and DeMets 1989). This technique allows for more flexible selection of the times when the data will be monitored.
Another modification of the group sequential methods employs asymmetric boundaries. As noted earlier, for many trials, even those that are not one-sided tests of the hypothesis, it would be inappropriate to continue a trial until the intervention is proven harmful, using the usual p value of 0.05. Therefore, instead of having the monitoring boundary for harm symmetric to the one for benefit, a less extreme monitoring boundary can be developed (DeMets and Ware 1982). Thus, if the one for benefit maintains the overall a at 0.05, the one for harm might maintain it at 0.1, or even less extreme. Even with one-sided tests of hypothesis, an advisory boundary for harm can be implemented, as was the case in the Cardiac Arrhythmia Suppression Trial (Pawitan and Hallstrom 1990).
Curtailed sampling addresses the probability of seeing a significant result if the trial were to continue to its end, given the data at the current time (that is, part way through the trial) (Lan and Wittes 1988). For example, if there is a strongly positive trend with three-quarters of the expected data in hand, one can examine the probabilities of having a statistically significant outcome under various assumptions regarding future data. A reasonable assumption might be that the control group event rate will continue, more or less, as it has been and that the null hypothesis is true. If, under those conditions, the outcome is still significant, there might be reason to stop the study. Conversely, if there is little or no benefit (or a trend towards harm) from the intervention, one might look at how large a benefit would be required from now on to see a significant benefit at the end. If there is little likelihood of that happening, the study might be stopped because continuation would be futile.
Several descriptions of stopping decisions in clinical trials have been published (DeMets et al. 1982, 1984; Cairns et al. 1991; Friedman et al. 1993). These include stopping early for overwhelming benefit, clear harm, or futility. A decision to stop a trial early is irrevocable; therefore, such a decision must be made carefully. Whenever a recommendation is made to stop a study early, factors other than whether or not the monitoring boundaries are crossed need to be considered (Friedman et al. 1998). Might the results be due to imbalance in baseline characteristics or to bias in ascertainment of the outcome? Might poor adherence or differential use of concomitant therapy be important? What might be the impact of outcomes other than the primary one on the interpretation of the conclusions? Are there major unexpected adverse effects that need to be considered? How might other ongoing research affect the results? Will the results be persuasive to others? The issues are not just statistical in nature. If they were, there would be little need for a monitoring committee. Instead, an algorithm could be created which would make the decision. But because the decisions depend on a complex interaction of statistics, understanding of the biological mechanism of action of the intervention, knowledge of other research findings, and judgement as to how the results will be received and interpreted, decisions to recommend continuation or stopping are rarely easy and often second-guessed.
Whom to include
A purpose of assigning intervention or control by means of randomization is to ensure, as far as possible, that there is balance between the groups on both measured and unmeasured factors. Anything that alters that balance, such as removing from analysis some data from some participants, can induce bias. The reason is that it may be difficult to prove that the cause for the exclusion is unrelated to either a key baseline factor or to the intervention or control. Therefore, the general guideline for analysis is called ‘intention to treat’. That is, once randomization has taken place, the data from all participants should be included and counted in the group to which they are assigned.
There are several reasons why one would want to withdraw participants or data from the analysis. Firstly, it may be discovered, after enrolment, that a participant is not truly eligible for the trial. Therefore, that person would not contribute meaningfully to answering the question and, in fact, might confuse the issue by providing incorrect information. Also importantly, it might be hazardous for that person to be taking the intervention. Withdrawing such people from the study and the analysis might seem to be straightforward, but often it is not. If the decision that a person is ineligible is made after adverse effects or a clinical event has occurred, it might be viewed as an effort to manipulate the data. Eligibility criteria are commonly subjective and even in blinded trials there may be clues regarding the group to which the participant was assigned. Therefore, if participants are withdrawn from the trial because they are found to be ineligible, it must be done as soon as possible, before any events have occurred or follow-up measurements performed, and without knowledge of the treatment group. If that does not happen, then the best policy is to leave the participant in the study and analyse the data as if he or she is eligible. If it is possibly dangerous for the participant to be on the intervention, that can be discontinued without removing the person from the trial. If the percentage of ineligible people is small, that should not unduly affect the conclusions. If the percentage is large, such that the study integrity might be affected, then there is clearly a larger problem with the conduct of the trial.
A second reason for withdrawal of participants after randomization is poor adherence. As discussed in the sample size section, incomplete adherence to the protocol is best handled by increasing the number of people in the trial. Sometimes, however, it is decided to remove non-adherent participants from the analysis. The argument for doing this is that if they have not taken the intervention, there is no way that they can provide information as to its usefulness. The counter-argument is that lack of adherence may be a reflection of not being able to tolerate one or another of the treatments. Therefore, withdrawing poor adherers leads to an underestimate of the adverse effects. It also biases the analysis because those removed from one group are likely to be different from those removed from the other group. There have been attempts to adjust for non-adherence, but these approaches are questionable.
The classic example of how withdrawing poor adherers from analysis can lead to strange results is from the Coronary Drug Project, a trial of lipid lowering in heart disease patients (CDP 1975). As expected, those assigned to the active medication group who did not take it fared worse than those who did. But those assigned to placebo who did not take the placebo also fared worse than those who did (CDP 1980). This outcome could not be accounted for by measured differences between the adherers and non-adherers. Therefore, unknown confounding factors must have been present, as the difference is not attributable to an inert substance. It is best to include the data from all participants, regardless of level of adherence, in the analysis. If, despite the best efforts of the investigator, adherence is so poor as to compromise the integrity of the trial, then that itself says something about the usefulness of the intervention.
Poor quality or, in the extreme case, missing data is a third reason for withdrawing participants from analysis. Every effort must be made to minimize these. If participants are lost to follow-up, or do not return for key outcome measurements, the data will be missing. To the extent that this constitutes more than a few per cent of the total data, the study is severely impaired. The reason, again, is that there is no assurance that the missing data are independent of the treatment. If participants do not return to the clinic because one of the treatments makes them feel unwell, the data in that group will only be from those who are healthier or better able to tolerate the treatment.
Various statistical methods have been proposed to take into account missing data, but none is perfect (Liang and Zeger 1989; Espeland et al. 1992; Proschan et al. 2001). In general, they use prior data from the individual or some average from the group to which that individual is assigned to impute the most likely values for the missing data. These techniques can be useful, but as with simply censoring missing data, are limited if the missing data are strongly related to treatment.
The same factors apply to poor-quality data or outliers. Statistical techniques exist for deciding if unusual values are truly outliers (Dixon 1953). Ideally though, all analyses should be performed with and without including the outliers, as there may be important reasons for the apparently strange data that should not be ignored. One practice that is not encouraged is substituting data such as prior measurements from an individual for data that are thought to be incorrect or outlying.
Despite best efforts, study groups may turn out to have imbalances in important factors at baseline. In such cases, it is tempting to adjust for these imbalances. Unless the imbalance is large and the factor is highly correlated with the outcome, however, adjustment is unlikely to make a major difference. Simply showing that there is a statistically significant difference in a baseline covariate is not sufficient reason to adjust on that factor. Conversely, large and potentially important differences may not be statistically significant because of small numbers. Furthermore, there may be several covariates that are imbalanced in a similar direction. Individually, they may not be important, but in the aggregate, they may lead to enough of an imbalance that adjustment is useful. In summary, adjustment for baseline imbalances is legitimate, and should be explored if there are apparent differences. Mostly, this is unnecessary. Certainly, if adjustment converts a non-significant result to a significant one, it needs to be interpreted cautiously.
Conversely, adjustment for post-randomization variables is strongly discouraged. Level of adherence is one example of such a variable. Others might be biomarkers or similar interim measures of the effect of the intervention, as well as concomitant therapy. Because such variables are, or may be related to, the intervention, unlike the baseline factors, adjustment for them can lead to misleading interpretations. Response to an intervention can indicate better prognosis, even in the absence of the intervention (CDP 1980; Anderson et al. 1983). Adjustment on such a variable can make an intervention appear beneficial when it is not.
In every clinical trial it is tempting to look at the effects of the intervention in subgroups of participants. This is particularly the case with trials that show no significant difference overall. Even without overall benefit, there might be some subsets that indeed benefit. The problem is that with enough creativity, one can almost always find a group that benefits (and a group that is harmed) from the intervention. Even in trials that have significant overall differences, there is a desire to find the types of participants who benefit the most.
It is generally the case that qualitative interactions are uncommon (Peto 1995). That is, an intervention is unlikely to be beneficial in one subgroup and harmful in another. Conversely, it is quite plausible that there are differential relative effects. Some kinds of people are indeed likely to be helped more than other kinds of people. The problem is that unless the subgroups of interest are specified in advance, it is likely that most of the observed differences are due to chance. The best way of confirming that subgroup differences are real is to examine an independent dataset, usually from another trial of the same question. Somewhat weaker is using independent data from the same trial. This can be done if, during data monitoring, a possible subgroup difference is identified. The data accrued during the remaining period of the trial on participants who have not yet had the event can be confirmatory. Other approaches, such as looking at trends in subgroups defined by continuous variables, especially where there is biological plausibility, can also be used.
As noted, with enough imagination, apparent subgroup differences can be uncovered. ‘Fishing’, or ‘data-dredging’ is a natural activity, as unexpected subgroup findings can be important sources of new information and new hypotheses. As opposed to raising new questions, however, conclusions should almost never be drawn from subgroups that are not prespecified. The examples of differences based on signs of the zodiac (ISIS-2 1988) or similar characteristics are cautionary.
A separate chapter is devoted to meta-analyses (Chapter 6.12). Therefore, only a brief summary is provided here. Meta-analyses can be important ways of synthesizing data. They enable researchers to incorporate multiple studies of the same question. Because of the added numbers of participants, they provide better estimates of intervention effects in subgroups. They allow one to put together several small studies to see if a larger study should be conducted to address a question more clearly. They do have potential limitations, however. Most important is the effort expended in collecting all of the relevant studies and the judgement that must go into selection of the studies to be combined. Studies that show benefit from an intervention are more likely to be published (Dickerson et al. 1987). Therefore, if meta-analyses are not done carefully with clear criteria, biases can be introduced. Another limitation is that only some outcomes can be used. Typically, mortality or major morbidity is the outcome of interest. When deciding on whether or not an intervention is useful, other outcomes, such as adverse effects and quality of life, may be important, but are rarely incorporated into published meta-analyses. The ability to perform meta-analyses easily may lead to several inadequately powered studies yielding an overall statistically significant p value. Examples of probably misleading conclusions from these meta-analyses have been observed, once the single large trial was conducted (LeLorier et al. 1997). Any discouragement of the conduct of properly sized trials because of meta-analyses is unfortunate.
Reporting and interpretation
Several guidelines to the proper reporting of clinical trial results have been published (Asilomar Working Group on Recommendations for Reporting of Clinical Trials in the Biomedical Literature 1996; Begg et al. 1996). In essence, they call for objective recording of all pertinent aspects. It is recognized that space limitations restrict the amount of information that can be included in a publication. The advent of journal websites, however, allows for the dissemination of supplementary material. Ideally, all of the following should be included in a clinical trial report (Friedman et al. 1998).
Background and rationale.
Specification of the primary question and the response variables used to assess it.
Prespecified secondary questions.
Nature of the study population, including eligibility criteria, major reasons why people were not entered, and the fact that informed consent was obtained.
Sample size calculations and the assumptions used for that calculation.
Basic study design features and allocation procedures.
Data collection procedures, including efforts to minimize bias, quality control, and event classification.
Presentation of key baseline characteristics, by group.
Process measures, such as adherence, concomitant therapy usage, performance to procedures, amount of missing or poor quality data, and numbers or participants lost to follow-up.
Results for the primary outcome, secondary outcomes (prespecified and other), and adverse events. The statistics and tabulations should reflect the original intent and indicate whether the effects of repeated tests have been taken into account. Confidence intervals, relative risk reduction (or increase), and absolute risk reduction should be presented. Also noted should be where the data were analysed.
Special analyses, such as subgroups, covariate adjustment, and data-derived hypotheses.
Interpretation, implications, and conclusions in the context of both study data and information external to the trial.
A structured abstract that accurately reflects the body of the paper.
It is not easy to conduct well-designed clinical trials, and there are ethical issues that must be considered. Nevertheless, there is no substitute for good clinical trials in providing important information for clinical use and public health about the possible benefits of interventions. As a result of the development of clinical trial technologies over the past several decades, more clinical decisions are evidence based. Improvements in trial design and analysis are continuing and will have further impact, as will increasing knowledge about genetics, better understanding of disease aetiologies and processes, and pharmacology.
Anderson, J.R., Cain, K.C., and Gelber, R.D. (1983). Analysis of survival by tumor response. Journal of Clinics in Oncology, 1, 710–19.
Antman, E.M., Lau, J., Kupelnick, B., et al. (1992). A comparison of results of meta-analyses of randomized control trials and recommendations of clinical experts. Journal of the American Medical Association, 268, 240–8.
Asilomar Working Group on Recommendations for Reporting of Clinical Trials in the Biomedical Literature (1996). Checklist of information for inclusion in reports of clinical trials. Annals of Internal Medicine, 124, 741–3.
Begg, C., Cho, M., Eastwood, S., et al. (1996). Improving the quality of reporting of randomized controlled trials: the CONSORT statement. Journal of the American Medical Association, 276, 637–9.
Bigger, J.T. Jr (1984). Identification of patients at high risk for sudden cardiac death. American Journal of Cardiology, 54, 3–8D.
Blackwelder, W.C. and Chang, M.A. (1984). Sample size graphs for ‘proving the null hypothesis’. Controlled Clinical Trials, 5, 97–105.
Brittain, E. and Wittes, J. (1989). Factorial designs in clinical trials: the effects of non-compliance and subadditivity. Statistics in Medicine, 8, 161–71.
Buring, J.E. and Hennekens, C.H. (1992). The Women’s Health Study: summary of the study design. Journal of Myocardial Ischemia, 4, 27–39.
Cairns, J., Cohen, L., Colton, T., et al. (1991). Issues in the early termination of the aspirin component of the Physicians’ Health Study. Annals of Epidemiology, 1, 395–405.
CASS (Coronary Artery Surgery Study) Principal Investigators and their Associates (1983). Coronary Artery Surgery Study (CASS): a randomized trial of coronary artery bypass surgery: survival data. Circulation, 68, 939–50.
CAST (Cardiac Arrhythmia Suppression Trial) Investigators (1989). Preliminary report: effect of encainide and flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. New England Journal of Medicine, 321, 406–12.
CAST (Cardiac Arrhythmia Suppression Trial) II Investigators (1992). Effect of the antiarrhythmic agent moricizine on survival after myocardial infarction. New England Journal of Medicine, 327, 227–33.
CDP (Coronary Drug Project) Research Group (1975). Clofibrate and niacin in coronary heart disease. Journal of the American Medical Association, 231, 360–81.
CDP (Coronary Drug Project) Research Group (1980). Influence of adherence to treatment and response of cholesterol on mortality in the Coronary Drug Project. New England Journal of Medicine, 303, 1038–41.
CIOMS/WHO (Council for International Organizations of Medical Science/World Health Organization) (1993). International ethical guidelines for biomedical research involving human subjects. CIOMS/WHO, Geneva.
Croog, S.H., Levine, S., Testa, M.A., et al. (1986). The effects of antihypertensive therapy on the quality of life. New England Journal of Medicine, 314, 1657–64.
Davis, B.R., Cutler, J.A., Gordon, D.J., et al. (1996). Rationale and design for the Antihypertensive and Lipid Lowering Treatment to Prevent Heart Attack Trial (ALLHAT). ALLHAT Research Group. American Journal of Hypertension, 9, 342–60.
Davis, C.E. (1998). Prerandomization compliance screening: a statistician’s views. In The handbook of health behavior change (ed. S.A. Shumaker, E.B. Schron, J.K. Ockene, and W.L. McBee), pp. 485–90. Springer, New York.
DeMets, D.L. and Ware, J.H. (1982). Asymmetric group sequential boundaries for monitoring clinical trials. Biometrika, 69, 661–3.
DeMets, D.L., Williams, G.W., Brown, B.W. Jr, et al. (1982). A case report of data monitoring experience: the Nocturnal Oxygen Therapy Trial. Controlled Clinical Trials, 3, 113–24.
DeMets, D.L., Hardy, R., Friedman, L.M., and Lan, K.K.G. (1984). Statistical aspects of early termination in the Beta-Blocker Heart Attack Trial. Controlled Clinical Trials, 5, 362–72.
DHHS (Department of Health and Human Services) (1991). National Institutes of Health (NIH) and Office of Protection from Research Risks (OPRR) Reports. Protection of human subjects, Title 45, Code of Federal Regulations Part 46, pp. 4–17 (http://ohrp.osophs.dhhs.gov/humansubjects/guidance/45cfr46.htm).
Dickerson, K., Chan, S., Chalmers, T.C., et al. (1987). Publication bias and clinical trials. Controlled Clinical Trials, 8, 343–53.
Dixon, W.J. (1953). Processing data for outliers. Biometrics, 9, 74–89.
Dunbar-Jacob, J. (1998). Predictors of patient adherence: patient characteristics. In The handbook of health behavior change (ed. S.A. Shumaker, E.B. Schron, J.K. Ockene, and W.L. McBee), pp. 491–511. Springer, New York.
Ellenberg, S.S. and Temple, R. (2000). Placebo-controlled trials and active-control trials in the evaluation of new treatments. Part 2: Practical issues and specific cases. Annals of Internal Medicine, 133, 464–70.
Espeland, M.A., Byington, R.P., Hire, D., et al. (1992). Analysis strategies for serial multivariate ultrasonographic data that are incomplete. Statistics in Medicine, 11, 1041–56.
FDA (Food and Drug Administration) (1997). Department of Health and Human Services (DHHS). Federal Register, 17 December, pp. 66 113–19 (http://www.fda.gov/cder/guidance/1857fnl.pdf).
Fleming, T.R. (1995). Surrogate markers in AIDS and cancer trials. Statistics in Medicine, 13, 1423–35.
Fleming, T.R. and DeMets, D.L. (1996). Surrogate end points in clinical trials: are we being misled? Annals of Internal Medicine, 125, 605–13.
Freedman, B. (1987). Equipoise and the ethics of clinical research. New England Journal of Medicine, 317, 141–5.
Friedman, L. (1998). Clinical significance vs. statistical significance. In Encyclopedia of biostatistics (ed. P. Armitage and T. Colton), pp. 676–8. Wiley, Chichester.
Friedman, L.M., Furberg, C.D., and DeMets, D.L. (1998). Fundamentals of clinical trials (3rd edn). Springer, New York.
Friedman, L.M., Simons-Morton, D.G., and Cutler, J.A. (1999). Comparative features of primordial, primary, and secondary prevention trials. In Clinical trials in cardiovascular disease (ed. C.H. Hennekens). W.B. Saunders, Philadelphia, PA.
Glynn R.J., Buring J.E., Manson J.E., et al. (1994). Adherence to aspirin in the prevention of myocardial infarction. Archives of Internal Medicine, 154, 2649–57.
Haybittle, J.L. (1971). Repeated assessment of results in clinical trials of cancer treatment. British Journal of Radiology, 44, 793–7.
Heart Outcomes Prevention Evaluation Study Investigators (2000). Effects of an angiotensin-converting enzyme inhibitor, ramipril, on cardiovascular events in high-risk patients. New England Journal of Medicine, 242, 145–53.
Hellman, S. and Hellman, D.S. (1991). Of mice but not men: problems of the randomized clinical trial. New England Journal of Medicine, 324, 1585–9.
Hennekens, C.H. and Buring, J.E. (1989). Methodologic considerations in the design and conduct of randomized trials: the US Physicians’ Health Study. Controlled Clinical Trials, 10, 142S–50S.
Hlatky, M.A., Rogers,W.J., Johnstone, I., et al. (1997). Medical care costs and quality of life after randomization to coronary angioplasty or coronary bypass surgery. Bypass Angioplasty Revascularization Investigation (BARI) Investigators. New England Journal of Medicine, 336, 92–9.
Hypertension Detection and Follow-up Program Cooperative Group (1979). Five-year findings of the hypertension detection and follow-up program. I. Reduction in mortality of persons with high blood pressure, including mild hypertension. Journal of the American Medical Association, 242, 2562–71.
ICH (International Conference on Harmonisation) (1996). Guidance for industry. E6 good clinical practice: consolidated guidance. April (http://www.fda.gov/cder/guidance/9595nl.pdf).
ISIS-2 (Second International Study of Infarct Survival) Collaborative Group (1988). Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17 187 cases of suspected acute myocardial infarction: ISIS-2. Lancet, ii, 349–60.
ISIS-3 (Third International Study of Infarct Survival) Collaborative Group (1992). ISIS-3: a randomised comparison of streptokinase vs tissue plasminogen activator vs anistreplase and of aspirin plus heparin vs aspirin alone among 41 299 cases of suspected acute myocardial infarction. Lancet, 339, 753–70.
Lachin, J.M. (1981). Introduction to sample size determination and power analysis for clinical trials. Controlled Clinical Trials, 2, 93–113.
Lakatos, E. (1986). Sample size determination in clinical trials with time-dependent rates of losses and noncompliance. Controlled Clinical Trials, 7, 189–99.
Lan, K.K.G. and DeMets, D.L. (1989). Group sequential procedures: calendar versus information time. Statistics in Medicine, 8, 1191–8.
Lan, K.K.G. and Wittes, J. (1988). The B-value: a tool for monitoring data. Biometrics, 44, 579–85.
Larson, G.C., McAnulty J.H., and Hallstrom A. (1997). Hospitalization charges in the Antiarrhythmics Versus Implantable Defibrillators (AVID) Trial: the AVID economic analysis study. Circulation, 96, 1–77.
LeLorier, J., Gregoire, G., Benhaddad, A., et al. (1997). Discrepancies between meta-analyses and subsequent large andomized, controlled trials. New England Journal of Medicine, 337, 536–42.
Levine, R.J. (1993). New international ethical guidelines for research involving human subjects. Annals of Internal Medicine, 119, 339–41.
Liang, K.Y. and Zeger, S.L. (1989). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22.
Meinert, C.L. and Tonascia, S. (1986). Clinical trials design, conduct, and analysis. Oxford University Press, New York.
Naughton M.J., Shumaker, S.A., Anderson, R., and Czajkowski, S.M. (1996). Psychological aspects of health-related quality of life measurement: tests and scales. In Quality of life and pharmacoeconomics in clinical trials (2nd edn) (ed. B. Spilker), pp. 117–53. Lippincott-Raven, Philadelphia, PA.
NIH (National Institutes of Health) (1994). Women’s Health Initiative Study Protocol.
O’Brien, P.C. and Fleming, T.R. (1979). A multiple testing procedure for clinical trials. Biometrics, 35, 549–56.
Passamani, E. (1991). Clinical trials—are they ethical? New England Journal of Medicine, 324, 1589–92.
Pawitan, Y. and Hallstrom, A. (1990). Statistical interim monitoring of the Cardiac Arrhythmia Suppression Trial. Statistics in Medicine, 9, 1081–90.
Peto, R. (1995). Clinical trials. In Treatment of cancer (3rd edn) (ed. P. Price and K. Sikora), pp. 1039–43. Chapman & Hall, London.
Peto, R., Pike, M.C., Armitage, P., et al. (1976). Design and analysis of randomized clinical trials requiring prolonged observations of each patient. I. Introduction and design. British Journal of Cancer, 34, 585–612.
PHS (Physicians’ Health Study) Steering Committee of the Research Group (1989). Final report on the aspirin component of the ongoing Physicians’ Health Study. New England Journal of Medicine, 321, 129–35.
Piantadosi, S. (1997). Clinical trials. A methodologic perspective. Wiley, New York.
Pocock, S.J. (1983). Clinical trials. A practical approach. Wiley, New York.
Prentice, R.L. (1989). Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in Medicine, 8, 431–40.
Proschan, M.A., McMahon, R.P., Shih, J.H., et al. (2001). Sensitivity analysis using an imputation method for missing binary data in clinical trials. Journal of Statistics Planning and Inference, in press.
Psaty, B.M., Smith, N.L., Siscovick, D.S., et al. (1997). Health outcomes associated with antihypertensive therapies used as first-line agents. A systematic review and meta-analysis. Journal of the American Medical Association, 277, 739–45.
Rand, C.S. and Weeks, K. (1998). Measuring adherence with medication regimens in clinical care and research. In The handbook of health behavior change (ed. S.A. Shumaker, E.B. Schron, J.K. Ockene, and W.L. McBee), pp. 114–32. Springer, New York.
Ruberman, W., Weinblatt, E., Goldberg, J.D., et al. (1984). Psychosocial influence on mortality after myocardial infarction. New England Journal of Medicine, 311, 552–9.
Schron, E.B. and Czajkowski, S.M. (2001). Clinical trials. In Compliance in healthcare and research (ed. L.E. Burke and I.S. Ockene). Futura, New York.
SHEP (Systolic Hypertension in the Elderly Program) Cooperative Research Group (1991). Prevention of stroke by hypertensive drug therapy in older persons with isolated systolic hypertension: final results of the systolic hypertension in the elderly program. Journal of the American Medical Association, 265, 3255–64.
Staessen, J.A., Fagard, R., Thijs, L., et al. (1997). Randomised double-blind comparison of placebo and active treatment for older patients with isolated systolic hypertension. Lancet, 350, 757–64.
Temple, R. and Ellenberg, S.S. (2000). Placebo-controlled trials and active-control trials in the evaluation of new treatments. Part 1: Ethical and scientific issues. Annals of Internal Medicine, 133, 455–63.
VA (Veterans Administration) Cooperative Study Group on Antihypertensive Agents (1967). Effects of treatment on morbidity in hypertension: results in patients with diastolic blood pressures averaging 115 through 129 mmHg. Journal of the American Medical Association, 202, 1028–34.
VA (Veterans Administration) Cooperative Study Group on Antihypertensive Agents (1970). Effects of treatment on morbidity in hypertension: II. Results in patients with diastolic blood pressures averaging 90 through 114 mmHg. Journal of the American Medical Association, 213, 1143–52.
Waldo, A.L., Camm, J.A., de Ruyter, H., Friedman, P.L., MacNeil, D.J., and Pitt, B. (1995). Survival with oral D-sotalol in patients with left ventricular dysfunction after myocardial infarction: rationale, design, and methods (the SWORD Trial). Journal of the American College of Cardiology, 75, 1023–27.
Whitehead, J. (1983). The design and analysis of sequential clinical trials. Halstead Press, New York.
Women’s Health Initiative Study Group (1998). Design of the Women’s Health Initiative clinical trial and observation study. Controlled Clinical Trials, 19, 61–109.
World Medical Association Declaration of Helsinki (2000). Ethical principles for medical research involving human subjects. http://www.wma.met/e/policy/17-c_e.html
Wu, M. (1988). Sample size for comparison of changes in the presence of right censoring caused by death, withdrawal, and staggered entry. Controlled Clinical Trials, 9, 32–6.
Yusuf, S., Zucker, D., Peduzzi, P., et al. (1994). Effect of coronary artery bypass graft surgery on survival: overview of 10-year results from randomised trials by the Coronary Artery Bypass Graft Surgery Trialists Collaboration. Lancet, 334, 563–70.