1 Comment

6.12 Systematic reviews and meta-analysis

6.12 Systematic reviews and meta-analysis
Oxford Textbook of Public Health

Systematic reviews and meta-analysis

Matthias Egger, George Davey Smith, and Jonathan A. C. Sterne

Systematic review, overview, or meta-analysis?
The scope of meta-analysis
Historical notes
Why do we need systematic reviews?

A patient with myocardial infarction in 1981

Limitations of a single study

Limitations of traditional narrative reviews

Systematic reviews—a more transparent appraisal

The epidemiology of results

What was the evidence in 1981?
Steps in carrying out systematic reviews

Developing a review protocol

Objectives and eligibility criteria

Literature search

Selection of studies, assessment of methodological quality, and data extraction
Meta-analysis: presenting, synthesizing, and interpreting data

Measures of treatment effect


Graphical display for meta-analysis

Heterogeneity between study results

Random-effects meta-analysis

Cumulative meta-analysis

Bayesian meta-analysis

Deriving absolute measures of effect
Sources of bias in systematic reviews and meta-analysis

Publication bias

Bias in location of studies

Biased inclusion criteria
Investigating and dealing with bias and heterogeneity

Sensitivity analysis

Funnel plots

Statistical methods to detect and correct for bias
Spurious precision? Meta-analysis of observational studies

Confounding, residual confounding, and bias

Plausible but equally spurious findings?

Exploring sources of heterogeneity
Chapter References

The volume of data that needs to be considered by practitioners and researchers is constantly expanding. In many areas it has become impossible for the individual to read, critically evaluate, and synthesize the state of current knowledge, let alone keep updating this on a regular basis. Reviews have become essential tools for anybody who wants to keep up with the new evidence that is accumulating in his or her field of interest. Reviews are also required to identify areas where the available evidence is insufficient and further studies are required. However, since Mulrow (1987) and Oxman and Guyatt (1988) drew attention to the poor quality of conventional narrative reviews it has become clear these are an unreliable source of information. In response there has, in recent years, been increasing focus on formal methods of systematically reviewing studies, to produce explicitly formulated, reproducible, and up-to-date summaries of the effects of health-care interventions. This is illustrated by the sharp increase in the number of reviews that used formal methods to synthesize evidence (Fig. 1).

Fig. 1 Number of publications concerning systematic reviews and meta-analysis 1986 to 1998. Results from Medline search using text word and medical subject (MESH) heading ‘meta-analysis’ and text word ‘systematic review’.

This chapter discusses terminology and scope, provides some historical background, and examines the potentials and pitfalls of systematic reviews and meta-analysis.
Systematic review, overview, or meta-analysis?
A number of terms are used concurrently to describe the process of systematically reviewing and integrating research evidence, including ‘systematic review’, ‘meta-analysis’, ‘research synthesis’, ‘overview’, and ‘pooling’. Chalmers and Altman (1995) defined systematic review as a review that has been prepared using a systematic approach to minimizing biases and random errors which is documented in a materials and methods section. A systematic review may, or may not, include a meta-analysis, which is a statistical analysis of the results from independent studies, which generally aims to produce a single typical estimate of a treatment effect (Huque 1988). The distinction between systematic review and meta-analysis is important because it is always appropriate and desirable to review a body of data systematically, but it may sometimes be inappropriate, or even misleading, to pool results statistically from separate studies (O’Rourke and Detsky 1989).
The scope of meta-analysis
A clear distinction should be made between meta-analysis of randomized controlled trials and meta-analysis of epidemiological studies. Trials of high methodological quality that examined the same intervention in comparable patient groups will provide unbiased estimates of the underlying treatment effect and the variability between trials can confidently be attributed to random variation. Meta-analysis of these trials will provide an equally unbiased estimate of the treatment effect, with an increase in the precision of this estimate. A fundamentally different situation arises in the case of epidemiological studies. As discussed in detail below, due to the effects of confounding and bias, observational studies may produce estimates of associations that deviate from the truth beyond what can be attributed to chance. Combining a set of epidemiological studies will thus often provide spuriously precise, biased estimates of associations. Davis (1992) has written:
Meta-analysis begins with scientific studies, usually performed by academics or government agencies, and sometimes incomplete or disputed. The data from the studies are then run through computer models of bewildering complexity, which produce results of implausible precision.
While systematic reviews have clear advantages over conventional reviews, it is crucial to understand the limitations of meta-analysis and the importance of exploring sources of heterogeneity and bias. Much emphasis will be given to these issues in this chapter.
Historical notes
Efforts to compile summaries of research for medical practitioners who struggle with the amount of information that is relevant to medical practice are not new. Chalmers and Tröhler (2000) drew attention to two journals published in the eighteenth century in Leipzig and Edinburgh, Comentarii de Rebus in Scientia Naturali et Medicina Gestis and Medical and Philosophical Commentaries, which published critical appraisals of important new books in medicine, including, for example, Withering’s now classic An Account of the Foxglove and Some of its Medical Uses (1785) on the use of digitalis for treating heart disease.
The statistician Pearson was, in 1904, probably the first medical researcher reporting the use of formal techniques to combine data from different studies. The rationale for pooling studies put forward by Pearson in his account on the preventive effect of serum inoculations against enteric fever, is still one of the main reasons for undertaking meta-analysis today: ‘Many of the groups … are far too small to allow of any definite opinion being formed at all, having regard to the size of the probable error involved’ (Pearson 1904). Noticeably, Pearson’s conclusions did not go unchallenged, and heated correspondence in the British Medical Journal followed his publication (Susser 1977). Meta-analysis has continued to attract controversy since Pearson’s time.
In the following decades meta-analytical techniques were developed and applied mainly in the social sciences, in particular psychology and educational research. In 1976 the psychologist Glass coined the term ‘meta-analysis’ in a paper entitled ‘Primary, secondary and meta-analysis of research’ (Glass 1976). Three years later the physician and epidemiologist Cochrane drew attention to the fact that people who want to make informed decisions about health care do not have ready access to reliable reviews of the available evidence (Cochrane 1979). During the 1980s meta-analysis became increasingly popular in medicine, particularly in the fields of cardiovascular disease (Yusuf et al. 1985), oncology (Early Breast Cancer Trialists’ Collaborative Group 1988), and perinatal care (Chalmers et al. 1989). Meta-analysis of epidemiological studies (Greenland 1987) and ‘cross design synthesis’ (General Accounting Office 1992), the integration of observational data with the results from meta-analyses of randomized clinical trials were also advocated. In recent years the Cochrane Collaboration (Box 1) facilitated numerous developments in the science of research synthesis, many of which are discussed in this chapter.
Box 1 The Cochrane Collaboration

The Cochrane Collaboration logo (Fig. 2)depicts the results of a systematic review of seven placebo-controlled trials of a short inexpensive course of a corticosteroid given to women about to give birth too early. A schematic representation of the forest plot (see text) is shown. The first of these trials was reported in 1972, and the last in 1980. The diagram summarizes the evidence that would have been revealed had the available trials been reviewed systematically: it shows strong evidence that corticosteroids reduce the risk of babies dying from the complications of immaturity. Because no systematic review of these trials was published until 1989, most obstetricians practising during the 1980s did not realize that the treatment was so effective, reducing the odds of neonatal and postnatal death by 30 to 50 per cent. As a result, tens of thousands of premature babies have probably suffered and died unnecessarily. By 1991, seven more trials had been reported, and the evidence had become still stronger.

Fig. 2 The logo of the Cochrane Collaboration.

The aim of the Cochrane Collaboration is to help people make well-informed decisions about health care by preparing, maintaining, and promoting the accessibility of systematic reviews in all areas of health care. The Collaboration was founded in 1993 and registered as a charity soon after, and more than 4000 health professionals, scientists, and people using the health services had participated in it by the end of the twentieth century. The main work is done by about 50 Collaborative Review Groups that take on the central task of preparing and maintaining Cochrane reviews. The members of these groups have come together because they share an interest in ensuring the availability of reliable up-to-date summaries of evidence relevant to particular health problems. The work of Collaborative Review Groups and other entities is co-ordinated and supported by 15 Cochrane Centres around the world. The Cochrane Library, which may be purchased on CD-ROM or subscribed to on the internet, is the main output of the Collaboration. The Library contains the Cochrane Database of Systematic Reviews, a rapidly growing collection of regularly updated Cochrane reviews and the Cochrane Controlled Trials Register, a bibliography of over 250 000 controlled trials. See Chalmers and Haynes (1994), Bero and Rennie (1995), Dickersin and Manheimer (1998), or Oxman (2001) or http://www.cochrane.org/ for more information on the Cochrane Collaboration.

Why do we need systematic reviews?
A patient with myocardial infarction in 1981
A probable scenario in the early 1980s, when discussing the discharge of a patient who had suffered an uncomplicated myocardial infarction, is as follows. A keen junior doctor asks whether the patient should receive a b-blocker for secondary prevention of a future cardiac event. After a moment of silence the consultant states that this is a question which should be discussed in detail at the Journal Club on Thursday. The junior doctor (who now regrets that she asked the question) is told to assemble and present the relevant literature. Her Medline search identifies four clinical trials (Table 1).

Table 1 Conclusions from four randomized controlled trials of b-blockers in secondary prevention after myocardial infarction

When reviewing the conclusions from these trials the doctor finds them to be rather confusing and contradictory ((Table 1). Her consultant points out that the sheer amount of research published makes it impossible to keep track of and critically appraise individual studies. He recommends a good review article. Back in the library the junior doctor finds an article which the British Medical Journal published in 1981 in a ‘Regular Reviews’ section (Mitchell 1981). This narrative review concluded that ‘Thus, despite claims that they reduce arrhythmias, cardiac work, and infarct size, we still have no clear evidence that beta-blockers improve long-term survival after infarction despite almost 20 years of clinical trials’ (Mitchell 1981).
The junior doctor is relieved. She presents the findings of the review article, the Journal Club is a success, and the patient is discharged without a b-blocker.
Limitations of a single study
Sampling variability means that treatment effect estimates will vary, even between studies performed in exactly the same way in identical populations. The smaller the study, the larger will be the sampling variability. Because the number of patients included in trials is often inadequate (Freiman et al. 1992), a single study often fails to detect, or exclude with certainty, a modest but important difference in the effects of two therapies. A trial may thus show no statistically significant treatment effect when in reality a clinically important effect exists—it may produce a false-negative result. A recent examination of 1941 trials relevant to the treatment of schizophrenia showed that only 58 (3 per cent) studies were large enough to detect an important improvement (Thornley and Adams 1998). In some cases the required sample size may be difficult to achieve. A drug that reduces the risk of death from myocardial infarction by 10 per cent could, for example, delay many thousands of deaths each year in the United Kingdom alone. In order to detect such an effect with 90 per cent certainty over 10 000 patients in each treatment group would be needed (Collins et al. 1992).
The meta-analytical approach appears to be an attractive alternative to such a large, expensive, and logistically problematic study. Data from patients in trials evaluating the same or a similar drug in a number of smaller, but comparable, studies are considered. In this way the necessary number of patients may be reached, and relatively small effects can be detected or excluded with confidence. Systematic reviews can also contribute to considerations regarding the applicability of study results. The findings of a particular study might be felt to be valid only for a population of patients with the same characteristics as those investigated in that study. If trials have been done in different groups of patients, with similar results being seen in the various trials, then it can be concluded that the effect of the intervention under study has some generality. By putting together all available data, meta-analyses are also better placed than individual trials to answer questions regarding whether or not an overall study result varies among subgroups—for example, among men and women, older and younger patients, or participants with different degrees of severity of disease.
Limitations of traditional narrative reviews
Traditional narrative reviews have a number of disadvantages that systematic reviews may overcome. Firstly, the conventional narrative review is subjective and therefore prone to bias and error (Teagarden 1989). Mulrow (1987) showed that among 50 reviews published in the mid-1980s in leading general medicine journals, 49 reviews did not specify the source of the information and failed to perform a standardized assessment of the methodological quality of studies. Our junior doctor could have consulted another review of the same topic, published in the European Heart Journal in the same year. This review concluded that ‘it seems perfectly reasonable to treat patients who have survived an infarction with timolol’ (Hampton 1981). Without guidance by formal rules, reviewers will inevitably disagree about issues as basic as what types of studies it is appropriate to include and how to balance the quantitative evidence they provide.
The tendency for selective inclusion of studies that support the author’s view is illustrated by the observation that the frequency of citation of clinical trials is related to their outcome, with studies in line with the prevailing opinion being quoted more frequently than unsupportive studies (Gøtzsche 1987; Ravnskov 1992). Once a set of studies has been assembled a common way to review the results is to count the number of studies supporting various sides of an issue and to choose the view receiving the most votes. This is clearly unsound, since it ignores sample size, treatment effect size, and research design. It is thus not surprising that reviewers using traditional methods often reach opposite conclusions (Mulrow 1987) and miss small, but potentially important, differences (Cooper and Rosenthal 1980). In controversial areas the conclusions drawn from a given body of evidence may be associated more with the specialty of the reviewer than with the available data (Chalmers et al. 1990). By systematically identifying, scrutinizing, tabulating, and perhaps statistically combining all relevant studies, systematic reviews allow a more objective appraisal.
Systematic reviews—a more transparent appraisal
An important advantage of systematic reviews is that they render the review process transparent. In traditional narrative reviews it is often not clear how the conclusions follow from the data examined. In an adequately presented systematic review it should be possible for readers to replicate the quantitative component of the argument. To facilitate this, it is valuable if the data included in meta-analyses are either presented in full or made available to interested readers by the authors. The increased openness and clarity required should lead to the replacement of unhelpful descriptors such as ‘some evidence of a trend’, ‘a weak relationship’, and ‘a strong relationship’ (Rosenthal 1990).
The epidemiology of results
The tabulation, exploration, and evaluation of results are important components of systematic reviews. As discussed in more detail below, this can be taken further to explore sources of heterogeneity and test new hypotheses that were not posed in individual studies. This has been termed the ‘epidemiology of results’ where the findings of an original study replace the individual as the unit of analysis (Jenicek 1989). Systematic reviews can thus lead to the identification of the most promising or the most urgent research question, and may permit a more accurate calculation of the sample sizes needed in future studies. This is illustrated by an early meta-analysis of four trials that compared different methods of monitoring the fetus during labour (Chalmers 1979). The meta-analysis led to the hypothesis that, compared with intermittent auscultation, continuous fetal heart monitoring reduced the risk of neonatal seizures. This hypothesis was subsequently confirmed in a single randomized trial of almost seven times the size of the four previous studies combined (MacDonald et al. 1985).
What was the evidence in 1981?
What conclusions would our junior doctor have reached if she had had access to a systematic review of the b-blocker trials? A total of 13 such trials had in fact been published by the end of 1981 (Table 2, discussed in more detail below). Using meta-analysis (see below) to combine the results of these 13 trials, the relative risk of mortality comparing patients treated with b-blocker with those treated with placebo is estimated as 0.78 (95 per cent confidence intervals 0.69–0.88, p < 0.001). Thus conclusive evidence of the life-saving potential of this treatment, though available, was ignored.

Table 2 Characteristics of long-term trials comparing b-blockers with controls

Steps in carrying out systematic reviews
Developing a review protocol
Systematic reviews should be viewed as observational studies of the evidence. The steps involved, summarized in Box 2, are similar to any other research undertaking: formulation of the problem to be addressed, collection and analysis of the data, and interpretation of the results. Likewise, a detailed study protocol which clearly states the question to be addressed, the subgroups of interest, and the methods and criteria to be employed for identifying and selecting relevant studies and extracting and analysing information should be written in advance. This is important to avoid bias being introduced by decisions that are influenced by the data. For example, studies which produced unexpected or undesired results may be excluded by post hoc changes to the inclusion criteria. Similarly, unplanned data-driven subgroup analyses are likely to produce spurious results (Oxman and Guyatt 1992; Gelber and Goldhirsch 1993). The review protocol should ideally be conceived by a group of reviewers with expertise in both the content area and the science of research synthesis.
Box 2 Steps in conducting a systematic review

Formulate the review question

Define inclusion and exclusion criteria:


interventions and comparisons


study designs and methodological quality

Locate studies and develop a search strategy considering the following sources:

the Cochrane Controlled Trials Register (CCTR)

electronic databases and trials registers not covered by CCTR

checking of reference lists

hand-searching of key journals and conference abstracts

personal communication with experts in the field

Select studies:

have eligibility checked by more than one observer

develop a strategy to resolve disagreements

keep a log of excluded studies, with reasons for exclusions

Assess study quality:

consider assessment by more than one observer

use simple checklists rather than quality scales

always assess concealment of treatment allocation, blinding, and handling of patient attrition

consider blinding of observers to authors, institutions, and journals

Extract data:

design and pilot data extraction form

consider data extraction by more than one observer

consider blinding of observers to authors, institutions, and journals

Analyse and present results:

tabulate results from individual studies

examine the forest plot

explore possible sources of heterogeneity

consider meta-analysis of all trials or subgroups of trials

perform sensitivity analyses and examine funnel plots

make a list of excluded studies available to interested readers

Interpret results:

consider limitations, including publication and related biases

consider the strength of evidence

consider applicability

consider numbers needed to treat to benefit/harm

consider economic implications

consider implications for future research
NB: Points 1 – 7 should be addressed in the review protocol.

Objectives and eligibility criteria
The formulation of detailed objectives is at the heart of any research project. This should include the definition of study participants, interventions, outcomes, and settings. As with patient inclusion and exclusion criteria in clinical studies, eligibility criteria can then be defined for the type of studies to be included. They relate to the quality of trials and to the combinability of patients, treatments, outcomes, and lengths of follow-up. As discussed below, formulating assessments regarding study quality can be a subjective process, however, especially since the information reported is often inadequate for this purpose (Moher et al. 1996a). It is therefore generally preferable to define only basic inclusion criteria, to assess the methodological quality of component studies, and to perform a thorough sensitivity analysis, as illustrated below.
Literature search
The search strategy for the identification of the relevant studies should be clearly delineated. Identifying controlled trials for systematic reviews has become more straightforward in recent years (Lefebvre and Clarke 2001). Appropriate terms to index randomized trials and controlled trials were introduced in the widely used bibliographic databases Medline and Embase by the mid-1990s. However, tens of thousands of trial reports had been included prior to the introduction of these terms. In a painstaking effort the Cochrane Collaboration (Box 1) checked the titles and abstracts of almost 300 000 Medline and Embase records which were then re-indexed as clinical trials if appropriate. It was important to examine both Medline and Embase because the overlap in journals covered by the two databases is only about 34 per cent (Smith et al. 1992). The majority of journals indexed in Medline are published in the United States whereas Embase has better coverage of European journals (Lefebvre and Clarke 2001). Finally, thousands of reports of controlled trials have been identified by manual searches of journals, conference proceedings, and other sources.
All trials thus identified have been included in The Cochrane Controlled Trials Register which is available in The Cochrane Library (Anonymous 1997) on CD-ROM or online. This register currently includes over 250 000 records and is clearly the best single source of controlled trials for inclusion in systematic reviews. However, searches of Medline and Embase are still required to identify trials that were published recently (see http://www.cochrane.org/cochrane/hbappend.htm for search strategies). Specialized databases, conference proceedings, and the bibliographies of review articles, monographs, and the located studies should be scrutinized as well. Finally, hand-searching of key journals should be considered, keeping in mind that many journals are already being searched by the Cochrane Collaboration.
The search should be extended to include unpublished studies, as their results may systematically differ from published trials. A systematic review which is restricted to published evidence may produce distorted results due to publication bias (see below). Registration of trials at the time they are established (and before their results become known) would eliminate the risk of publication bias (Dickersin 1994). A number of such registers have been set up in recent years and access to these has improved, for example through the Cochrane Collaboration’s Register of Registers (see http://www.cochrane.org/cochrane/hbappend) or the internet-based Meta-Register of Controlled Trials which has been established by the publisher Current Science (see http://www.controlled-trials.com/). Colleagues, experts in the field, contacts in the pharmaceutical industry, and other informal channels can also be important sources of information on unpublished and ongoing trials.
Selection of studies, assessment of methodological quality, and data extraction
Decisions regarding the inclusion or exclusion of individual studies often involve some degree of subjectivity even if clear inclusion and exclusion criteria were formulated in the protocol. It is therefore useful to have two observers checking eligibility of candidate studies, with disagreements being resolved by discussion or a third reviewer.
Randomized controlled trials provide the best evidence of the efficacy of medical interventions but they are not immune to bias. Studies relating methodological features of trials to their results have shown that trial quality influences effect sizes (Jüni et al. 2001). Inadequate concealment of treatment allocation, resulting, for example, from the use of open random number tables, is on average associated with larger treatment effects (Chalmers et al. 1983; Schulz et al. 1995; Moher et al. 1998). Larger effects were also found if trials were not double-blind (Schulz et al. 1995). In some instances effects may also be overestimated if some participants, for example, those not adhering to study medications, were excluded from the analysis (Sackett and Gent 1979; May et al. 1981; Peduzzi et al. 1993). Although widely recommended, the assessment of the methodological quality of clinical trials is a matter of ongoing debate (Moher et al. 1996a). This is reflected by the large number of different quality scales and checklists that are available (Moher et al. 1995). Empirical evidence (Jüni et al. 1999) and theoretical considerations (Greenland 1994) suggest that although summary quality scores may in some circumstances provide a useful overall assessment, scales should not generally be used to assess the quality of trials in systematic reviews. Rather, the relevant methodological aspects should be identified a priori, and assessed individually.
It is important that two observers extract the relevant data independently, so errors can be avoided. A standardized record form is needed for this purpose. Data extraction forms should be carefully designed, piloted, and revised if necessary. Electronic data collection forms have a number of advantages, including the combination of data abstraction and data entry in one step, and the automatic detection of inconsistencies between data recorded by different observers. However, the complexities involved in programming and revising electronic forms should not be underestimated.
Meta-analysis: presenting, synthesizing, and interpreting data
Once studies have been selected and critically appraised, and data extracted, the characteristics of included studies should be presented in tabular form. (Table 2 shows the characteristics of the long-term trials that were included in the systematic review of the effect of b-blockade in secondary prevention after myocardial infarction. Freemantle et al. (1999) included all parallel group randomized trials that examined the effectiveness of b-blockers versus placebo or alternative treatment in patients who had had a myocardial infarction. The authors searched 11 bibliographic databases, including dissertation abstracts and grey literature databases, examined existing reviews, and checked the reference lists of each identified study. They identified 31 trials of at least 6 months’ duration which contributed 33 comparisons of b-blocker with control groups (Table 2).
Measures of treatment effect
The results (treatment effects) from individual studies have to be measured in the same way to allow comparison between studies. If the endpoint is binary (for example disease versus no disease, or dead versus alive) then relative risks or odds ratios are often calculated (see Box 3 for definitions). The two measures will be similar when the outcome is uncommon (say, less than 20 per cent), but will differ increasingly as the outcome becomes more common (Box 3). Sackett et al. argue that relative risks should be preferred over odds ratios because they are more intuitively comprehensible to most people (Sackett et al. 1996). However, as the outcome becomes common the range of the relative risk is constrained while the odds ratio is not. The odds ratio has the further advantage that the odds ratio for non-occurrence of the outcome is exactly the inverse of the odds ratio for the outcome. Difference measures such as the absolute risk reduction or the number of patients needed to be treated for one person to benefit (Laupacis et al. 1988) are helpful when applying results in clinical practice (see below). In most meta-analyses these should be derived from the summary ratio measure of treatment effect.
Box 3 Odds ratio or relative risk?

Odds ratios are often used in order to measure treatment effects in trials with binary endpoints. What is an odds ratio and how does it relate to the relative risk? The risk is defined as the proportion of patients who experience a given endpoint, while the odds is defined as the number of patients who experience the endpoint divided by the number of those who do not. For example, if four of a group of 10 patients experience diarrhoea during treatment with an antibiotic (risk = 4/10 = 0.4) then the odds of diarrhoea are 4 to 6 (4 with diarrhoea divided by 6 without, 0.67). If one out of 10 experience diarrhoea in the control group (risk = 1/10 = 0.1) then the odds are 1 to 9 (0.11). A bookmaker (a person who takes bets, for example on horse-races, calculates odds, and pays out winnings) would, of course, refer to this as nine to one. The relative risk or risk ratio comparing treated patients with those in the control group: in this example the risk ratio is 0.4/0.1 = 4. The odds ratio is the odds in the treated group divided by the odds in the control group; in this example the odds ratio is 6 (0.67 divided by 0.11).
As shown in Fig. 3, the odds ratio will be close to the relative risk if the endpoint occurs relatively infrequently, say in less than 20 per cent. If the outcome is more common, as in the diarrhoea example, then the odds ratio will differ increasingly from the relative risk, and will always be further away from 1 (the null value) than the relative risk. For a relative risk of 2 the maximum possible risk in the control group is 0.5 since the maximum possible risk is 1 = 2 × 0.5.

Fig. 3 Odds ratio or relative risk?

If the outcome is continuous and measurements are made on the same scale (for example blood pressure measured in millimetres of mercury) the mean difference between the treatment and control groups is used. If trials measured outcomes in different ways, differences may be presented in standard deviation units, rather than as absolute differences.
It will not always be desirable to combine the results from the different studies to produce a single estimate of the treatment effect. Indeed, there will be situations where the calculation of a combined effect estimate is inappropriate or even misleading: for example the trials of bacille Calmette–Guérin (BCG) vaccination against tuberculosis (see below). Careful consideration of the combinability of the studies in question is an important step in systematic reviews (Box 2).
Two principles are important in combining treatment effect estimates from different studies. Firstly, simply pooling the data from different studies and treating them as one large study would fail to preserve the randomization and could introduce bias. For example, a recent review and ‘meta-analysis’ of the literature on the role of male circumcision in HIV transmission concluded that the risk of HIV infection was lower in uncircumcised men (Van Howe 1999). However, the analysis was performed by simply pooling the data from 33 diverse studies. A reanalysis stratifying the data by study found that an intact foreskin was in fact associated with an increased risk of HIV infection (O’Farrell and Egger 2000). Confounding by study thus led to a change in the direction of the association (a case of Simpson’s paradox (Last 2001)). The study as unit of analysis must therefore always be maintained when combining data.
Secondly, simply calculating an arithmetic mean of treatment effect estimates would be inappropriate. The results from small studies are more subject to the play of chance and should, therefore, be given less weight. Let us suppose that we have k studies, and have derived a treatment effect estimate yi (which might be a log odds ratio, log relative risk, risk difference, or mean difference) for each study (i = 1 to k). The ‘fixed-effects’ model for meta-analysis considers the variability between these treatment effect estimates as exclusively due to random variation (Yusuf et al. 1985), so that if all the studies were infinitely large they would give identical results. To derive a summary treatment effect estimate we calculate a weighted average of the treatment effect estimates in the individual studies:

The subscript F denotes the fixed-effects assumption. Use of a weighted average accords with our first principle because individuals are only compared with other individuals in the same study. The usual choice of weight wi for study i, which minimizes the variability of the summary treatment effect estimate, is inverse variance weight wi = 1/vi, where vi is the variance of the treatment effect estimate. This accords with our second principle because the larger the study, the smaller will be the variance of the treatment effect estimate from that study. The variance of the summary treatment effect estimate yF is

This can be used to derive confidence intervals, a z statistic, and hence a p value for the null hypothesis that the true treatment effect is zero. An alternative weighting scheme, which may be preferable when data are sparse, is to use Mantel–Haenszel weights to combine relative risks or odds ratios. More details on statistical methods for meta-analysis are given by Deeks et al. (2001).
Graphical display for meta-analysis
Results from each trial are usefully displayed together graphically as the summary treatment effect estimate in a ‘forest plot’, a form of presentation developed in the 1980s by Richard Peto’s group in Oxford. Figure 4 represents the forest plot for the trials of b-blockers in secondary prevention after myocardial infarction listed in (Table 2. Each study is represented by a black square whose centre corresponds to the treatment effect estimate, and a horizontal line representing the 95 per cent confidence interval for the treatment effect estimate. The area of the black square is proportional to the weight of the study in the meta-analysis: early plots which included the same size symbol for each study were found to draw attention to the widest confidence intervals, which correspond to the smallest studies. The solid vertical line corresponds to no effect of treatment (relative risk 1.0). If the confidence interval includes 1, then the difference in the effect of experimental and control therapy is not statistically significant at conventional levels (p > 0.05). The confidence interval of most studies crosses this line.

Fig. 4 Forest plot of controlled trials of b-blockers in secondary prevention of mortality after myocardial infarction. The centre of the black square and horizontal line corresponds to the relative risk and 95 per cent confidence intervals. The area of the black squares is proportional to the weight each trial contributes to the meta-analysis. The diamond at the bottom of the graph represents the combined relative risk and its 95 per cent confidence interval, indicating a 20 per cent reduction in the risk of death. The solid vertical line corresponds to no effect of treatment (relative risk 1.0), the dotted vertical line to the combined relative risk (0.8). The relative risk, 95 per cent confidence interval and weights are also given in tabular form. The graph was produced using Stata software (Sterne et al. 2001). (Adapted from Freemantle et al. 1999.)

The diamond at the bottom of the graph displays the results of the meta-analysis: the centre of the diamond corresponds to the summary treatment effect estimate while its width corresponds to the 95 per cent confidence interval. The broken line also corresponds to the summary treatment effect estimate and is included to allow a visual assessment of the variability of the individual studies around the summary estimate.
A logarithmic scale was used for plotting the relative risk in Fig. 4. There are a number of reasons why ratio measures are best plotted on logarithmic scales (Galbraith 1988). Most importantly, the value of a risk ratio and its reciprocal, for example 0.5 and 2, which represent risk ratios of the same magnitude but opposite directions, will be equidistant from 1.0. Studies with relative risks below and above 1.0 will take up equal space on the graph and thus visually appear to be equally important. Also, confidence intervals will be symmetrical around the point estimate.
Heterogeneity between study results
The thoughtful consideration of heterogeneity between study results is an important aspect of systematic reviews (Thompson 1994). As explained above, this should start when writing the review protocol, by defining potential sources of heterogeneity and planning appropriate subgroup analyses. Once the data have been assembled, simple inspection of the forest plot is informative. The results from the b-blocker trials are fairly homogeneous, clustering between a relative risk of 0.5 and 1.0, with widely overlapping confidence intervals (Fig. 4). In contrast, trials of BCG vaccination for prevention of tuberculosis (Colditz et al. 1994) are clearly heterogeneous (Fig. 5). The findings of the British trial, which indicate substantial benefit of BCG vaccination, are not compatible with those from the Madras or Puerto Rico trials which suggest little effect or only a modest benefit. There is no overlap in the confidence intervals of the three trials.

Fig. 5 Forest plot of trials of BCG vaccine to prevent tuberculosis. Trials are ordered according to the latitude of the study location, expressed as degrees from the equator. No meta-analysis is shown. (Adapted from Colditz et al. 1994.)

The fixed-effects estimate is based on the assumption that the true effect does not differ between studies, and statistical tests of homogeneity (also called tests of heterogeneity) assess the evidence against this. The null hypothesis is that individual study results reflect a single underlying effect, so that the differences between treatment effect estimates in individual studies are a consequence of sampling variation and simply due to chance. The test statistic is

which is compared with the c2 distribution on (k – 1) degrees of freedom. The greater the average distance between the individual study effect and the summary effect, the more evidence against the null hypothesis of a common fixed effect for all studies.
The test of homogeneity gives p = 0.25 for the b-blocker trials but p < 0.001 for the BCG trials. The BCG trials are an extreme example, however, and a major limitation of statistical tests of homogeneity is their lack of power—they often fail to reject the null hypothesis even if substantial between-study differences exist. Reviewers should therefore not assume that a non-significant test of homogeneity excludes important heterogeneity. Heterogeneity between study results should not be seen as purely a problem for systematic reviews, since it also provides an opportunity for examining why treatment effects differ in different circumstances, as discussed below.
Random-effects meta-analysis
Meta-analytical methods can be broadly classified into ‘fixed-effects’ and ‘random-effects’ models (Berlin et al. 1989). Random-effects models (DerSimonian and Laird 1986) allow for between-study heterogeneity by assuming that the treatment effect varies between studies, and take into consideration this additional source of variation. The summary treatment effect from random-effects meta-analysis then estimates the mean about which the treatment effect in different studies is assumed to vary and should thus be interpreted differently from the results of a fixed-effects meta-analysis. In practice, random-effects estimates are derived simply by modifying the weights from the fixed-effects analysis. This leads to relatively more weight being given to smaller studies: this may be undesirable given that such studies are more subject to publication and other bias (see below). Because they assume an extra source of variability, random-effects estimates have wider confidence intervals than fixed-effects estimates.
While neither of the two models can be said to be ‘correct’, a substantial difference in the combined effect calculated by fixed- and random-effects models will be seen only if studies are markedly heterogeneous, as in the case of the BCG trials (Table 3). Combining trials using a random-effects model indicates that BCG vaccination halves the risk of tuberculosis, whereas fixed-effects analysis indicates that the risk is only reduced by 35 per cent. This is essentially explained by the different weight given to the large Madras trial which showed no protective effect of vaccination (41 per cent of the total weight with fixed-effects model, 10 per cent with random-effects model) ((Table 3).

Table 3 Meta-analysis of trials of BCG vaccination to prevent tuberculosis using a fixed-effects and random-effects model (inverse variance method) (note the differences in the weight allocated to individual studies and the combined relative risks)

The interpretation of random-effects meta-analyses is problematic, because the treatment effect in a particular population will differ by an unknown amount from the assumed mean treatment effect. Rather than simply ignoring heterogeneity after allowing for it in a statistical model, a better approach is to scrutinize and attempt to explain it (Bailey 1987; Thompson 1994). As shown in Fig. 5, BCG vaccination appears to be effective at higher latitudes but not in warmer regions, possibly because exposure to certain environmental mycobacteria acts as a ‘natural’ BCG inoculation in warmer regions (Fine 1995). In this situation it is more meaningful to quantify how the effect varies according to latitude than to calculate an overall estimate of effect which will be misleading, independent of the model used.
Cumulative meta-analysis
A useful way to show the accumulation of evidence over time is to perform a cumulative meta-analysis (Lau et al. 1992). Cumulative meta-analysis is defined as the repeated performance of meta-analysis whenever a new relevant trial becomes available for inclusion. This allows the retrospective identification of the point in time when a treatment effect first reached conventional levels of statistical significance.
Based on the systematic review by Freemantle et al. (1999), Fig. 6 shows mortality results from a cumulative meta-analysis of trials of b-blockers in secondary prevention after myocardial infarction. A clear beneficial effect (p < 0.001) was evident by the end of 1981 (Fig. 6). Subsequent trials in a further 15 000 patients simply confirmed this result. Similarly, Lau et al. showed that for the trials of intravenous streptokinase in acute myocardial infarction, a statistically significant (p = 0.01) combined difference in total mortality was achieved by 1973 (Lau et al. 1992). The results of the subsequent 25 studies which included the large Gruppo Italiano per lo Studio della Streptochinasi nell’Infarto Miocardico-1 (GISSI-1) (GISSI 1986) and ISIS-2 trials (ISIS-2 Collaborative Group 1988) and enrolled over 34 000 additional patients reduced the significance level to p = 0.001 in 1979, p = 0.0001 in 1986, and to p < 0.00001 when the first mega-trial appeared, narrowing the confidence intervals around an essentially unchanged estimate of about 20 per cent reduction in the risk of death. This situation has been taken to suggest that further studies in large numbers of patients may be at best superfluous and costly if not unethical (Murphy et al. 1994), once a statistically significant treatment effect is evident from meta-analysis of the existing smaller trials.

Fig. 6 Cumulative meta-analysis of controlled trials of b-blockers in secondary prevention after myocardial infarction. A clear (p < 0.001) reduction of mortality was evident by 1981. The graph was produced using Stata software (Sterne et al. 2001). (Adapted from Freemantle et al. 1999.)

Another application of cumulative meta-analysis has been to correlate the accruing evidence with the recommendations made by experts in review articles and textbooks. Antman et al. (1992) showed for thrombolytic drugs that recommendations for routine use first appeared in 1987, 14 years after a statistically significant (p = 0.01) beneficial effect became evident in cumulative meta-analysis. Conversely, the prophylactic use of lidocaine (lignocaine) continued to be recommended for routine use in myocardial infarction despite the lack of evidence for any beneficial effect, and the suggestion of a harmful effect when results were combined in a meta-analysis.
Bayesian meta-analysis
Some feel that a Bayesian approach to meta-analysis is more appropriate than the ‘classical’ approaches described above. Bayesian statisticians express their belief about the size of an effect by specifying some prior probability distribution before seeing the data—and then update that belief by deriving a posterior probability distribution, taking the data into account (Lilford and Braunholtz 1996). This is done by using Bayes theorem, named after the eighteenth century English clergyman Thomas Bayes (Spiegelhalter et al. 1999). Bayesian models are available in both a fixed- and random-effects framework but published applications have usually been based on the random-effects assumption. The confidence interval (or more correctly in Bayesian terminology: the 95 per cent credible interval which covers 95 per cent of the posterior probability distribution) will be slightly wider than that derived from using the conventional models (Su and Li Wan Po 1996).
Bayesian approaches to meta-analysis can integrate other sources of evidence, for example findings from observational studies or expert opinion and are particularly useful for analysing the relationship between treatment benefit and underlying risk (Song et al. 1998). Bayesian techniques are, however, controversial because the definition of prior probability will often involve subjective assessments and opinion which runs against the principles of systematic review. Furthermore, analyses are complex to implement and time-consuming.
Deriving absolute measures of effect
The amount of between-study heterogeneity is usually lower for ratio than difference measures of treatment effect, so that meta-analyses are usually done using ratio measures (for an exception see Deeks and Altman (2001)). However, the absolute reduction in risk is a useful measure of the impact of treatment. For example the relative risk of death associated with the use of b-blockers after myocardial infarction is 0.80 (95 per cent confidence interval, 0.74–0.86) (Fig. 4). The relative risk reduction, obtained by substracting the relative risk from 1 and expressing the result as a percentage, is 20 per cent (95 per cent confidence interval, 14–26 per cent). However, these relative measures ignore the underlying risk of death among patients who have survived the acute phase of myocardial infarction, which varies widely. For example, among patients with three or more cardiac risk factors, the probability of death at 2 years after discharge ranged from 24 to 60 per cent (Multicenter Postinfarction Research Group 1983). Conversely, 2-year mortality among patients with no risk factors was less than 3 per cent.
The absolute risk reduction, or risk difference, reflects both the underlying risk without therapy and the risk reduction associated with therapy. Taking the reciprocal of the risk difference gives the number of patients who need to be treated to prevent one event, which is abbreviated to NNT or NNTbenefit (Laupacis et al. 1988). The number of patients who need to be treated to harm one patient, denoted as NNH or, more appropriately, NNTharm (Altman 1998) can also be calculated. It will usually be informative to calculate the risk difference, NNT or NNH for a range of baseline risks reflecting the range in the component studies of the meta-analysis.
For a baseline risk of 1 per cent per year, the absolute risk difference indicates that two deaths are prevented per 1000 treated patients (Table 4). This corresponds to 500 patients (1 divided by 0.002) treated for 1 year to prevent one death. Conversely, if the risk is above 10 per cent, less than 50 patients have to be treated to prevent one fatal event. Many clinicians would probably decide not to treat patients at very low risk, considering the large number of patients who would have to be exposed to the adverse effects of b-blockade to prevent one death. Appraising the NNT from a patient’s estimated risk without treatment, and the relative risk reduction with treatment, is a helpful aid when making a decision in an individual patient. A nomogram to determine NNTs is available (Chatellier et al. 1996), confidence intervals can be calculated (Altman 1998), and the concept has recently been extended to the number of healthy people needed to be screened to prevent one adverse outcome (Rembold 1998).

Table 4 Beta-blockade in secondary prevention after myocardial infarction: absolute risk reductions and NNTbenefit for different levels of control group mortality

Combining absolute effect measures in meta-analysis is often inappropriate because the combined risk difference (and the NNT calculated from it) will be applicable only to patients at levels of risk corresponding to the typical control group risk of the trials analysed. It is generally more meaningful to use relative effect measures when summarizing the evidence while considering absolute measures when applying it to a specific clinical or public health situation (Egger et al. 1997c; Smeeth et al. 1999).
Sources of bias in systematic reviews and meta-analysis
That there are limitations to the process of systematic review and meta-analysis is illustrated by the fact that reviews of the same issue have reached opposite conclusions. Examples include assessments of low molecular weight heparin in the prevention of perioperative thrombosis (Hayes et al. 1990; Nurmohamed et al. 1992) and second-line antirheumatic drugs in the treatment of rheumatoid arthritis (Felson et al. 1990; Gøtzsche et al. 1992). In the following sections potential sources of bias are discussed in some detail.
Publication bias
The most obvious problem is brought about by the fact that some studies may never be published. If the reasons why studies remain unpublished are associated with their outcome, then the result of a meta-analysis could be seriously biased. Hypothetically, with a putative therapy which has no actual effect on a disease, it is possible that studies which suggested a beneficial treatment effect would be published, while an equal mass of data pointing the other way would remain unpublished. In this situation, a meta-analysis of the published trials would identify a spurious beneficial treatment effect. In the field of cancer chemotherapy this has indeed been demonstrated by comparing the results from studies identified in a literature search with those contained in an international trials registry (Simes 1987) (Box 4).
Box 4 A demonstration of publication bias

Studies with statistically significant results are more likely to be published than those with non-significant results. The inclusion of a study in a trials register can be assumed not to be influenced by its results because registration generally takes place before completion of the study. The studies enlisted in a register are therefore likely to constitute a more representative sample of all the studies which have been performed in a given area than a sample of published studies. This has been examined for trials of different cancer chemotherapies by comparing the results from meta-analysis of trials identified in a literature search and of trials registered with the International Cancer Research Data Bank (Simes 1987). As shown in Fig. 7, the analysis of published clinical trials indicates considerably better survival of patients with advanced ovarian cancer treated with combination chemotherapy as compared with alkylating agent monotherapy. However, an analysis of the registered trials failed to confirm this.

Fig. 7 Demonstration of publication bias.

Several studies have investigated the importance of publication bias in the medical literature by following up research proposals approved by ethics committees or institutional review boards. The factors associated with publication or non-publication of results could thus be examined. For example, of 285 studies approved by the Central Oxford Research Ethics Committee between 1984 and 1987 which had been completed and analysed, 138 (48 per cent) had been published by 1990 (Easterbrook et al. 1991). Studies with statistically significant (p < 0.05) results were more likely to have been published than those with non-significant results. A meta-analysis (Egger and Davey Smith 1998) of five such studies showed that this is a consistent finding: the odds of publication were three times greater if results were statistically significant (combined odds ratio 3.0, 95 per cent confidence intervals 2.3–3.9). Interestingly, studies continued to appear in print many years after approval by the ethics committee. Among proposals submitted to the Royal Prince Alfred Hospital Ethics Committee in Sydney, around 85 per cent of studies with significant results as compared to 65 per cent of studies with null results had been published after 10 years (Stern and Simes 1997). The median time to publication was 4.8 years for studies with significant results and 8.0 years for studies with null results.
The source of funding was associated with publication or non-publication independent of study results. Pharmaceutical industry sponsored studies were less likely to be published than those supported by the government or by voluntary organizations, with investigators citing the data management by these companies as a reason for non-publication (Easterbrook et al. 1991; Dickersin et al. 1992). This is in agreement with a review of publications of clinical trials which separated them into those which were sponsored by the pharmaceutical industry and those supported by other means (Davidson 1986). The results of 89 per cent of published industry supported trials favoured the new therapy, as compared to 61 per cent of the other trials. Similar results have been reported from an overview of non-steroidal anti-inflammatory drug trials (Rochon et al. 1994). The implication is that the pharmaceutical industry has discouraged the publication of negative studies which it has funded.
Finally, multicentre studies were more likely to be published than studies from a single centre (Dickersin et al. 1992). Of note was that high-quality trials were not more likely to be published than clinical trials of lower quality (Easterbrook et al. 1991).
Bias in location of studies
While publication bias has long been recognized (Sterling 1959) and much discussed, other factors can contribute to biased inclusion of studies in meta-analyses. Indeed, among published studies, the probability of identifying relevant trials for meta-analysis is also influenced by their results. These biases have received much less consideration than publication bias, but their consequences could be of equal importance.
Meta-analyses are often exclusively based on trials published in English. For example, among 36 meta-analyses reported in leading English-language general medicine journals 1991 to 1993, 26 (72 per cent) had restricted their search to studies reported in English (Grégoire et al. 1995). However, investigators working in a non-English speaking country will publish some of their work in local journals. Indeed, many clinical studies, including randomized controlled trials, are reported in journals published in other languages than English (Dickersin et al. 1994). It is conceivable that authors are more likely to report in an international English-language journal if results are positive whereas negative findings are published in a local journal (Grégoire et al. 1995). Language bias could thus be introduced in meta-analyses exclusively based on English-language reports (Moher et al. 1996b; Egger et al. 1997b).
In locating studies, searches of computerized databases are often supplemented by contacting experts in the field and checking the reference lists of other studies and reviews. In the latter case, citation bias could play an important role. In the field of cholesterol lowering, it has been shown that trials which are supportive of a beneficial effect are cited more frequently than unsupportive trials, regardless of the size and quality of the studies involved (Ravnskov 1992). Thus the use of reference lists would be more likely to locate supportive studies which could bias the findings of a meta-analysis (Box 5).
Box 5 A demonstration of inclusion and citation bias

A meta-analysis of seven trials of cholesterol lowering after myocardial infarction (Rossouw et al. 1990) defined its inclusion criteria as those single-factor randomized trials with at least 100 participants per group, with at least 3 years of follow-up and without the use of hormone treatment to lower cholesterol. The pooled results for all-cause mortality indicated a favourable trend (odds ratio 0.91, 95 per cent confidence interval 0.82 to 1.02) (Fig. 8). One trial met all the entry criteria but was not included (Woodhill et al. 1978). In this study, the odds ratio for overall mortality was an unfavourable 1.60 (0.95 to 2.70). For the trials included in the analysis the mean annual citation count per study for the period up to 5 years after publication was 20; for the study which was not included it was less than 1 (Ravnskov 1992). Eleven other secondary prevention trials were available at the time but did not meet the somewhat arbitrary inclusion criteria (Davey Smith et al. 1993). The pooled odds ratio for all-cause mortality for these trials is 1.14 (1.03 to 1.26). Thus inclusion bias and citation bias may have influenced the conclusions of this meta-analysis.

Fig. 8 Demonstration of inclusion and citation bias.

The production of duplicate publications from single studies can lead to bias in a number of ways (multiple publication bias) (Huston and Moher 1996). Studies with significant results are more likely to lead to multiple publications and presentations (Easterbrook et al. 1991), which makes it more likely that they will be located and included in a meta-analysis. The inclusion of duplicated data may therefore lead to overestimation of treatment effects, as recently demonstrated for trials of ondansetron for prevention of postoperative nausea and vomiting (Tramèr et al. 1997). It is not always obvious that multiple publications come from a single study, and one set of study participants may thus be included in an analysis twice. This is a particular problem in multicentre trials (Leizorovicz et al. 1992). Indeed, it may be difficult if not impossible for meta-analysts to determine whether two papers represent duplicate publications of one trial or two separate trials, since examples exist where two articles reporting the same trial do not share a single common author (Felson 1992; Huston and Moher 1996; Tramèr et al. 1997).
Studies which are published in journals not indexed in one of the major literature databases means that these data are difficult to locate for reviewers and meta-analysts. For example, only about 30 journals indexed in Medline are published in India, despite the fact that India is the developing country with the largest research output (Singh and Singh 1994) and medical research is published in English. It is possible that trials with statistically significant results are more likely to be published in an indexed journal whereas trials with null results are published in non-indexed journals (database bias).
Biased inclusion criteria
Once studies have been located and data obtained, there is still potential for bias in setting the inclusion criteria for a meta-analysis. If, as is usual, the inclusion criteria are developed by an investigator familiar with the area under study, they can be influenced by knowledge of the results of the set of potential studies. Manipulating the inclusion criteria could lead to selective inclusion of positive studies and exclusion of negative studies. For example, some meta-analyses of trials of cholesterol-lowering therapy (Peto et al. 1985; MacMahon 1992) have excluded certain studies on the grounds that the treatments used appear to have had an adverse effect which is independent of cholesterol lowering itself. However, these meta-analyses have included trials of treatments which are likely to favourably influence risk of coronary heart disease, independent of cholesterol lowering. Clearly such an asymmetrical approach introduces the possibility of selection bias, with the criteria for inclusion into the meta-analysis being derived from the results of the studies (Box 5).
Investigating and dealing with bias and heterogeneity
As discussed above, there will often be diverging opinions on the correct method for performing a particular systematic review or meta-analysis. It is therefore important to examine the robustness of the findings to different assumptions and methods in thorough sensitivity analysis.
Sensitivity analysis
Sensitivity analysis is illustrated in Fig. 9 for the b-blocker after myocardial infarction meta-analysis (Freemantle et al. 1999). Firstly, the overall effect was calculated by different statistical methods, both using a fixed- and a random-effects model. It is evident from the figure that the overall estimate is virtually identical and that confidence intervals are only slightly wider when using the random-effects model. This is explained by the relatively small amount of heterogeneity present in this meta-analysis.

Fig. 9 Sensitivity analyses examining the robustness of the meta-analysis of trials of b-blockers in secondary prevention after myocardial infarction (see text). Mortality results are shown. The dotted vertical line corresponds to the combined relative risk from the fixed-effects model (0.8).

Methodological quality was assessed in terms of concealment of allocation of study participants to b-blocker or control groups and blinding of patients and investigators (Freemantle et al. 1999). Figure 9 shows that the estimated treatment effect was similar for studies with and without concealment of treatment allocation. The eight studies that were not double-blind indicated more benefit than the 25 double-blind trials but confidence intervals overlap. Publication bias is more likely to affect small studies (see below), and may thus be examined by stratifying the analysis by study size. If publication bias is present, it is expected that of published studies, the larger ones will report the smaller effects. In the present example the 11 smallest trials (25 deaths or less) show the largest effect; however, confidence intervals overlap and exclusion of the smaller studies has little effect on the overall estimate (Fig. 9). Studies varied according to length of follow-up but this again had little effect on estimates. Finally, two trials were terminated earlier than anticipated on the grounds of the results from interim analyses. Estimates of treatment effects from trials which were stopped early because of a significant treatment difference are liable to be biased away from the null value. Bias may thus be introduced in a meta-analysis which includes such trials (Green et al. 1987). However, exclusion of these trials again affects the overall estimate only marginally.
The sensitivity analysis thus shows that the results from this meta-analysis are robust to the choice of the statistical method and to the exclusion of trials of lesser quality or of studies terminated early. It also suggests that publication bias is unlikely to have distorted its findings.
Funnel plots
Funnel plots—scatter plots in which the treatment effects estimated from individual studies on the horizontal axis are plotted against a measure of study size on the vertical axis—have been proposed as a means of detecting publication bias (Light and Pillemer 1984). In the absence of bias, the plot should resemble a symmetrical inverted funnel with the results of smaller studies being more widely scattered than those of the larger studies. If the plot shows an asymmetrical shape, publication bias may be present. This usually takes the form of a gap in the wide part of the funnel which indicates the absence of negative small studies. The funnel plot for the meta-analysis of the trials of b-blockade in secondary prevention after myocardial infarction is shown in the upper panel of Fig. 10. The plot is fairly symmetrical. In contrast, the funnel plot of controlled trials of magnesium infusion in acute myocardial infarction (lower panel of Fig. 10) is clearly asymmetrical. This is a well-known example where publication bias may explain the discrepancy between meta-analyses which showed a clear treatment effect and the subsequent large ISIS-4 trial which showed no effect (Egger et al. 1997a).

Fig. 10 Funnel plots of trials of b-blockers in secondary prevention after myocardial infarction (upper panel) and of trials of magnesium infusion in acute myocardial infarction (lower panel). The relative risk is plotted on a logarithmic scale, to ensure that effects of the same magnitude but opposite directions will be equidistant from 1.0. Plotting against the standard error of the treatment effect emphasizes differences between the smaller studies among which publication and other biases are most likely to occur (Sterne and Egger 2001). The vertical line shows the summary estimate from the fixed-effects model, diagonal lines show the expected 95 per cent confidence intervals around the summary estimate.

Funnel plot asymmetry should not be considered to be proof of publication bias in a meta-analysis. Firstly, other types of bias can lead to asymmetry. For example, smaller studies are, on average, conducted and analysed with less methodological rigour than larger studies. Trials of low quality also tend to show larger effects (Jüni et al. 2001). Heterogeneity between the treatment effects in different trials may lead to funnel plot asymmetry if the true treatment effect is larger in the smaller trials. For example, some interventions may have been implemented less thoroughly in larger trials, thus explaining the more positive results in smaller trials. This is particularly likely in trials of complex interventions in chronic diseases, such as rehabilitation after stroke or multifaceted interventions in diabetes mellitus. Thus the funnel plot should be seen as a generic means of examining ‘small study effects’ (the tendency for the smaller studies in a meta-analysis to show larger treatment effects). Other graphical representations, discussed in detail elsewhere, are useful to investigate bias and heterogeneity. These include Galbraith plots (Galbraith 1988) and L’Abbé plots (L’Abbé et al. 1987; Song 1999; Deeks and Altman 2001).
Statistical methods to detect and correct for bias
A number of authors (Iyengar and Greenhouse 1988; Dear and Begg 1992) have proposed methods to detect publication bias, based on the assumption that an individual study’s results (for example the p value) affect its probability of publication. These methods model the selection process that determines which results are published and which are not, and hence are known as ‘selection models’. They can be extended to estimate treatment effects corrected for the estimated publication bias (Vevea and Hedges 1995; Givens et al. 1997). However, the complexity of the statistical methods, and the large number of studies needed, mean that selection models have not been widely used in practice. More recently, Copas has argued that it is not possible to identify the precise mechanism for publication bias (Copas 1999), and advocated sensitivity analyses to assess the range of possible treatment effects according to the severity of the publication bias (Copas and Shi 2000).
An approach which does not attempt to define the selection process leading to publication or non-publication, is to use statistical methods to examine associations between study size and estimated treatment effects, thus extending the graphical approach of the funnel plot. Begg and Mazumdar (1994) proposed an adjusted rank correlation method while Egger et al. (1997a) introduced a linear regression approach. Examined in detail elsewhere (Sterne et al. 2000), these methods acknowledge that publication bias is only one of many possible mechanisms that can lead to associations between treatment effects and study size. An extension to these methods is to consider a measure of study size as one of a number of different possible explanations for between-study heterogeneity in a multivariable ‘meta-regression’ model. For example, the effects of study size, adequacy of randomization, and type of blinding might be examined simultaneously. Thompson and Sharp (1998) review methods for meta-regression. Meta-regression should be interpreted with caution: the number of data points (studies in the meta-analysis) may be small, and associations are observational and may be confounded by other measured or unmeasured trial characteristics.
Spurious precision? Meta-analysis of observational studies
The randomized controlled trial is the principal research design in the evaluation of medical interventions. However, aetiological hypotheses, for example those relating common exposures to the occurrence of disease cannot generally be tested in randomized experiments. Does breathing other people’s tobacco smoke propagate the development of lung cancer, drinking coffee cause coronary heart disease, and eating a diet rich in unsaturated fat induce breast cancer? Studies of such ‘menaces of daily life’ (Feinstein 1988) employ observational designs, or examine the presumed biological mechanisms in the laboratory. In these situations the risks involved are generally small, but once a large proportion of the population is exposed, the potential public health implications of these associations—if they are causal—can be striking.
Analyses of observational data also have a role in medical effectiveness research (Black 1996). The evidence that is available from clinical trials will rarely answer all the important questions. Most trials are conducted to establish efficacy and safety of a single agent in a specific clinical situation. Owing to the limited size of such trials, less common adverse effects of drugs may only be detected in case–control studies, or in analyses of databases from postmarketing surveillance schemes. Also, because follow-up is generally limited, adverse effects occurring many years later will not be identified. If years later established interventions are incriminated with adverse effects, there will be ethical, political, and legal obstacles to the conduct of a new trial. Recent examples for such situations include the controversy surrounding intramuscular administration of vitamin K to newborns and the risk of childhood cancer (Brousson and Klein 1996) and oral contraceptive use and breast cancer (Collaborative Group on Hormonal Factors in Breast Cancer 1996).
Meta-analysis, by promising a precise and definite answer when the magnitude of the underlying risks are small, or when the results from individual studies disagree, appears an attractive proposition both in aetiological studies and in observational effectiveness research.
Confounding, residual confounding, and bias
Meta-analysis of randomized trials is based on the assumption that each trial provides an unbiased estimate of the effect of an experimental treatment, with the variability of the results between the studies being attributed to random variation. The overall effect calculated from a group of sensibly combined and representative randomized trials will provide an essentially unbiased estimate of the treatment effect, with an increase in the precision of this estimate. A fundamentally different situation arises in the case of observational studies. Such studies yield estimates of association which may deviate from true underlying relationships beyond the play of chance. This may be due to the effects of confounding factors, the influence of biases, or both.
Those exposed to the factor under investigation may differ in a number of other aspects that are relevant to the risk of developing the disease in question. Consider, for example, smoking as a risk factor for suicide. Virtually all cohort studies have shown a positive association, with a dose–response relationship being evident between the amount smoked and the probability of committing suicide. Figure 11 illustrates this for four prospective studies of middle-aged men, including the massive cohort of screenees for the Multiple Risk Factors Intervention Trial (MRFIT) study. Based on over 390 000 men and almost 5 million years of follow-up, a meta-analysis of these cohorts produces very precise and statistically significant estimates of the increase in suicide risk that is associated with smoking different daily amounts of cigarettes: relative rate for 1 to 14 cigarettes 1.43 (95 per cent confidence intervals, 1.06–1.93), for 15 to 24 cigarettes 1.88 (95 per cent confidence intervals, 1.53–2.32), and for 25 or more cigarettes 2.18 (95 per cent confidence intervals, 1.82–2.61).

Fig. 11 Adjusted relative rates of suicide among middle-aged male smokers compared with non-smokers. Results from four cohort studies adjusted for age, and income, race, cardiovascular disease, diabetes (MRFIT), employment grade (Whitehall I), alcohol use, serum cholesterol, systolic blood pressure, and education (North Karelia and Kuopio). Meta-analysis by fixed effects model. (Adapted from Egger et al. 1998.)

Based on established criteria (Bradford Hill 1965), many would consider the association to be causal—if only it were more plausible. Indeed, it is improbable that smoking is causally related to suicide (Davey Smith et al. 1992). Rather, it is the social and mental states predisposing to suicide that are also associated with the habit of smoking. Factors that are related to both the exposure and the disease under study, confounding factors, may thus distort results. If the factor is known and has been measured, the usual approach is to control for its influence in the analysis. For example, any study assessing the influence of coffee consumption on the risk of myocardial infarction should control for smoking, since smoking is generally associated with drinking larger amounts of coffee and smoking is a cause of coronary heart disease (Leviton et al. 1994). However, even if adjustments for confounding factors have been made in the analysis, residual confounding remains a potentially serious problem in observational research. Residual confounding arises whenever a confounding factor cannot be measured with sufficient precision—a situation which often occurs in epidemiological studies (Phillips and Davey Smith 1991). Confounding is the most important threat to the validity of results from cohort studies whereas many more difficulties, in particular selection biases, arise in case–control studies (Sackett 1979).
Plausible but equally spurious findings?
Implausibility of results, like in the case of smoking and suicide, rarely protects us from reaching misleading claims. It is generally easy to produce plausible explanations for the findings from observational research (Egger et al. 1998). For example, observational studies have consistently shown that people eating more fruits and vegetables, which are rich in b-carotene, and people having higher serum b-carotene concentrations, have lower rates of cardiovascular disease and cancer (Jha et al. 1995). Beta-carotene has antioxidant properties and could thus plausibly be expected to prevent carcinogenesis and atherogenesis by reducing oxidative damage to DNA and lipoproteins (Jha et al. 1995). Contrary to many other associations found in observational studies, this hypothesis could be, and was, tested in experimental studies.
The authors performed a meta-analysis of the findings for cardiovascular mortality, comparing the results from six observational studies with those from four randomized trials (Egger et al. 1998). For observational studies results relate to a comparison between groups with high and low b-carotene intake or serum b-carotene level, whereas in trials participants randomized to b-carotene supplements were compared with participants randomized to placebo. Using a fixed-effects model, the meta-analysis of the cohort studies shows a significantly lower risk of cardiovascular death (relative risk reduction 31 per cent, 95 per cent confidence intervals 41–20 per cent, p < 0.0001) (Fig. 12). The results from the randomized trials, however, indicate a moderate adverse effect of b-carotene supplementation (relative increase in the risk of cardiovascular death 12 per cent, 95 per cent confidence intervals 4–22 per cent, p = 0.005). Similarly discrepant results between epidemiological studies and trials were observed for cancer incidence and mortality. This example illustrates that in meta-analyses of observational studies, the analyst may well be simply producing narrow confidence intervals around spurious results.

Fig. 12 Meta-analysis of the association between b-carotene intake and cardiovascular mortality: the results from observational studies are compared with the findings of four large trials. The observational studies indicate considerable benefit whereas the findings from randomized controlled trials show an increase in the risk of death. (Adapted from Egger et al. 1998.)

Exploring sources of heterogeneity
Some observers suggest that meta-analysis of observational studies should be abandoned altogether (Shapiro 1994). The authors disagree, but the statistical combination of studies should not generally be a prominent component of reviews of observational studies. The thorough consideration of possible sources of heterogeneity between observational study results will provide more insights than the mechanistic calculation of an overall measure of effect, which will often be biased.
Two examples are depicted in Fig. 13. The first relates to diet and breast cancer. The hypothesis from ecological analyses (Armstrong and Doll 1975) that higher intake of saturated fat could increase the risk of breast cancer generated much observational research, often with contradictory results. A comprehensive meta-analysis (Boyd et al. 1993) showed an association for case–control but not for cohort studies (odds ratio 1.36 for case–control studies versus relative rate 0.95 for cohort studies comparing highest with lowest category of saturated fat intake, p = 0.0002 for difference, upper panel of Fig. 13). This discrepancy was also shown in two separate large collaborative pooled analyses of cohort and case–control studies (Howe et al. 1990; Hunter et al. 1996). The most likely explanation for this situation is that biases in the recall of dietary items, and in the selection of study participants, have produced a spurious association in the case–control comparisons (Hunter et al. 1996). That differential recall of past exposures may introduce bias is also evident from a meta-analysis of case–control studies of intermittent sunlight exposure and melanoma (Nelemans et al. 1995) (lower panel of Fig. 13). When combining studies in which some degree of blinding to the study hypothesis was achieved, only a small and statistically non-significant effect (odds ratio 1.17, 95 per cent confidence intervals 0.98–1.39) was evident. Conversely, in studies without blinding, the effect was considerably greater and statistically significant (odds ratio 1.84, 95 per cent confidence intervals 1.52–2.25, p = 0.0004).

Fig. 13 Examples of heterogeneity in published observational meta-analyses: saturated fat intake and cancer (upper panel), intermittent sunlight and melanoma (lower panel).

Systematic review including, if appropriate, a formal meta-analysis is clearly superior to the narrative approach to reviewing research. Systematic reviews involve structuring the processes through which a thorough review of previous research is carried out. The issues of the completeness of the evidence identified, the quality of component studies, and the combinability of evidence are made explicit. The unprecedented effort to inject scientific principles into the process of research synthesis, which has taken place over the past decade, has improved the quality of reviews published in recent years (McAlister et al. 1999). However, there is much room for further improvements: less than a quarter of reviews published in six general medical journals in 1996 described how the evidence was identified, evaluated, or integrated (McAlister et al. 1999). Other shortcomings of systematic review meta-analysis are a consequence of a more general failing with respect to the dissemination of research findings. Although various initiatives mean that the identification of trials for systematic reviews has become an easier task and the quality of reports has improved, this process continues to be highly dependent on the publication of study results in peer-reviewed English-language journals. Finally, the suggestion that formal meta-analysis of observational studies can be misleading and that insufficient attention is often given to heterogeneity does not mean that a return to the previous practice of highly subjective narrative reviews is called for. Many of the principles of systematic reviews remain: a study protocol should be written in advance, complete literature searches should be carried out, and studies selected and data extracted in a reproducible and objective fashion. This allows for differences and similarities of the results found in different settings to be inspected, hypotheses to be formulated, and the need for future studies, including randomized controlled trials, to be defined.
Chapter References
Altman, D.G. (1998). Confidence intervals for the number needed to treat. British Medical Journal, 317, 1309–12.
Anonymous (1997). The Cochrane Controlled Trials Register. In The Cochrane Library. The Cochrane Collaboration; Issue 1. Update Software, Oxford.
Antman, E.M., Lau, J., Kupelnick, B., Mosteller, F., and Chalmers T.C. (1992). A comparison of results of meta-analyses of randomized control trials and recommendations of clinical experts. Journal of the American Medical Association,268, 240–8.
Armstrong, B. and Doll, R. (1975). Environmental factors and cancer incidence and mortality in different countries with special reference to dietary practices. International Journal of Cancer, 15, 617–31.
Baber, N.S., Wainwright Evans, D., Howitt, G., et al. (1980). Multicentre post-infarction trial of propranolol in 49 hospital in the United Kingdom, Italy and Yugoslavia. British Heart Journal, 44, 96–100.
Bailey, K. (1987). Inter-study differences: how should they influence the interpretation and analysis of results? Statistics in Medicine, 6, 351–8.
Begg, C.B. and Mazumdar, M. (1994). Operating characteristics of a rank correlation test for publication bias. Biometrics, 50, 1088–99.
Berlin, J., Laird, N.M., Sacks, H.S., and Chalmers, T.C. (1989). A comparison of statistical methods for combining event rates from clinical trials. Statistics in Medicine,8, 141–51.
Bero, L. and Rennie, D. (1995). The Cochrane Collaboration. Preparing, maintaining, and disseminating systematic reviews of the effects of health care. Journal of the American Medical Association, 274, 1935–8.
Black, N. (1996). Why we need observational studies to evaluate the effectiveness of health care. British Medical Journal,312, 1215–18.
Boyd, N.F., Martin, L.J., Noffel, M., Lockwood, G.A., and Tritchler, D.L. (1993). A meta-analysis of studies of dietary fat and breast cancer. British Journal of Cancer, 68, 627–36.
Bradford Hill, A. (1965). The environment and disease: association or causation? Proceedings of the Royal Society of Medicine, 58, 295–300.
Brousson, M.A. and Klein, M.C. (1996). Controversies surrounding the administration of vitamin K to newborns: a review. Canadian Medical Association Journal, 154, 307–15.
Chalmers, I. (1979). Randomised controlled trials of fetal monitoring 1973–1977. In Perinatal medicine (ed. O. Thalhammer, K. Baumgarten, and A. Pollak), p. 260. Thieme, Stuttgart.
Chalmers, I. and Altman, D. (1995). Systematic reviews. BMJ Publishing, London.
Chalmers, I. and Haynes, B. (1994). Reporting, updating and correcting systematic reviews of the effects of health care. British Medical Journal, 309, 862–5.
Chalmers, I. and Tröhler, U. (2000). Medical and Philosophical Commentaries, 1773–1795: A 200-year old response to the challenge of keeping abreast of the medical literature. Annals of Internal Medicine, 133, 238–43.
Chalmers, T.C., Celano, P., Sacks, H.S., and Smith, H. (1983). Bias in treatment assignment in controlled clinical trials. New England Journal of Medicine, 309, 1358–61.
Chalmers, I., Enkin, M., and Keirse, M. (1989). Effective care during pregnancy and childbirth. Oxford University Press,.
Chalmers, T.C., Frank, C.S., and Reitman, D. (1990). Minimizing the three stages of publication bias. Journal of the American Medical Association, 263, 1392–5.
Chatellier, G., Zapletal, E., Lemaitre, D., Menard, J., and Degoulet, P. (1996). The number needed to treat: a clinically useful nomogram in its proper context. British Medical Journal, 312, 426–9.
Cochrane, A.L. (1979). 1931–1971: a critical review, with particular reference to the medical profession.In Medicines for the year 2000. Office of Health Economics, London.
Colditz, G.A., Brewer T.F., Berkley C.S., et al. (1994). Efficacy of BCG vaccine in the prevention of tuberculosis. Journal of the American Medical Association, 271, 698–702.
Collaborative Group on Hormonal Factors in Breast Cancer (1996). Breast cancer and hormonal contraceptives: collaborative reanalysis of individual data on 53297 women with breast cancer and 100239 women without breast cancer from 54 epidemiological studies. Lancet, 347, 1713–27.
Collins, R., Keech, A., Peto, R., et al. (1992). Cholesterol and total mortality: need for larger trials. British Medical Journal, 304, 1689
Cooper, H. and Rosenthal, R. (1980). Statistical versus traditional procedures for summarising research findings. Psychological Bulletin,87, 442–9.
Copas, J. (1999). What works?: selectivity models and meta-analysis. Journal of the Royal Statistical Society (Series A), 162, 95–109.
Copas, J.B. and Shi, J.Q. (2000). Reanalysis of epidemiological evidence on lung cancer and passive smoking. British Medical Journal, 320, 417–18.
Davey Smith, G., Phillips, A.N., and Neaton, J.D. (1992). Smoking as ‘independent’ risk factor for suicide: illustration of an artifact from observational epidemiology. Lancet, 340, 709–11.
Davey Smith, G., Song, F., and Sheldon, T.A. (1993). Cholesterol lowering and mortality: the importance of considering initial level of risk. British Medical Journal, 306, 1367–73.
Davidson, R.A. (1986). Source of funding and outcome of clinical trials. Journal of General Internal Medicine, 1, 155–8.
Davis, B. (1992). What price safety? Risk analysis measures need for regulation, but it’s no science. Wall Street Journal, 6 August, 1.
Dear, K.B. and Begg, C.B. (1992). An approach for assessing publication bias prior to performing a meta-analysis. Statistical Science, 7, 237–45.
Deeks, J.J. and Altman, D.G. (2001). Effect measures for meta-analysis of trials with binary outcomes. In Systematic reviews in health care: meta-analysis in context (ed. M. Egger, G. Davey Smith, and D.G. Altman), p. 313. BMJ Books, London.
Deeks, J.J., Altman, D.G., and Bradburn, M.J. (2001). Statistical methods for examining heterogeneity and combining results from several studies in meta-analysis. In Systematic reviews in health care: meta-analysis in context (ed. M. Egger, G. Davey Smith, and D.G. Altman), p. 285. BMJ Books, London.
DerSimonian, R. and Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7, 177–88.
Dickersin, K. (1994). Research registers. In The handbook of research synthesis (ed. H. Cooper and L.V. Hedges). Russell Sage Foundation, New York.
Dickersin, K. and Manheimer, E. (1998). The Cochrane Collaboration: evaluation of health care and services using systematic reviews of the results of randomized controlled trials. Clinics in Obstetrics and Gynecology, 41, 315–31.
Dickersin, K., Min, Y.L., and Meinert, C.L. (1992). Factors influencing publication of research results. Follow-up of applications submitted to two institutional review boards.Journal of the American Medical Association, 267, 374–8.
Dickersin, K., Scherer, R., and Lefebvre, C. (1994). Identifying relevant studies for systematic reviews. British Medical Journal, 309, 1286–91.
Early Breast Cancer Trialists’ Collaborative Group (1988). Effects of adjuvant tamoxifen and of cytotoxic therapy on mortality in early breast cancer. An overview of 61 randomized trials among 28896 women. New England Journal of Medicine, 319, 1681–1892.
Easterbrook, P.J., Berlin, J., Gopalan, R., and Matthews, D.R. (1991). Publication bias in clinical research. Lancet, 337, 867–72.
Egger, M. and Davey Smith, G. (1998). Meta-analysis: bias in location and selection of studies. British Medical Journal, 316, 61–6.
Egger, M., Davey Smith, G., Schneider, M., and Minder, C.E.. (1997a). Bias in meta-analysis detected by a simple, graphical test. British Medical Journal, 315, 629–34.
Egger, M., Zellweger-Zähner, T., Schneider, M., Junker, C., Lengeler, C., and Antes, G. (1997b). Language bias in randomised controlled trials published in English and German. Lancet, 350, 326–9.
Egger, M., Davey Smith, G., and Phillips, A.N. (1997c). Meta-analysis: principles and procedures. British Medical Journal, 315, 1533–7.
Egger, M., Schneider, M., and Davey Smith, G. (1998). Spurious precision? Meta-analysis of observational studies. British Medical Journal, 316, 140–5.
Feinstein, A.R. (1988). Scientific standards in epidemiological studies of the menace of daily life. Science,242, 1257–63.
Felson, D.T. (1992). Bias in meta-analytic research. Journal of Clinical Epidemiology, 45, 885–92.
Felson, D.T., Anderson, J.J., and Meenan, R.F. (1990). The comparative efficacy and toxicity of second-line drugs in rheumatoid arthritis. Arthritis and Rheumatology, 33, 1449–61.
Fine, P.E.M. (1995). Variation in protection by BCG: implications of and for heterologous immunity. Lancet, 346, 1339–45.
Freemantle, N., Cleland, J., Young, P., Mason, J., and Harrison, J. (1999). Beta blockade after myocardial infarction: systematic review and meta regression analysis. British Medical Journal, 318, 1709–74.
Freiman, J.A., Chalmers T.C., Smith H., and Kuebler R.R. (1992). The importance of beta, the type II error, and sample size in the design and interpretation of the randomized controlled trial. In Medical uses of statistics (ed. J.C. Bailar and F. Mosteller), p. 357. NEJM Books, Boston, MA.
Galbraith, R. (1988). A note on graphical presentation of estimated odds ratios from several clinical trials. Statistics in Medicine, 7, 889–94.
Gelber, R.D. and Goldhirsch, A. (1993). From the overview to the patient: how to interpret meta-analysis data. Recent Results in Cancer Research, 127, 167–76.
General Accounting Office (1992). Cross design synthesis: a new strategy for medical effectiveness research. GAO, Washington, DC.
GISSI (Gruppo Italiano per lo Studio della Streptochinasi nell’Infarto Miocardico) (1986). Effectiveness of intravenous thrombolytic treatment in acute myocardial infarction. Lancet, i, 397–402.
Givens, G.H., Smith, D.D., and Tweedie, R.L. (1997). Publication bias in meta-analysis: a Bayesian data-augmentation approach to account for issues exemplified in the passive smoking debate. Statistical Science, 12, 221–50.
Glass, G.V. (1976). Primary, secondary and meta-analysis of research. Education Research, 5, 3–8.
Gøtzsche, P.C. (1987). Reference bias in reports of drug trials. British Medical Journal, 295, 654–56.
Gøtzsche, P.C., Podenphant, J., Olesen, M., and Halberg, P. (1992). Meta-analysis of second-line antirheumatic drugs: sample size bias and uncertain benefit.Journal of Clinical Epidemiology, 45, 587–94.
Grégoire, G., Derderian, F., and LeLorier, J. (1995). Selecting the language of the publications included in a meta-analysis: is there a Tower of Babel bias? Journal of Clinical Epidemiology, 48, 159–63.
Green, S., Fleming, T.R., and Emerson, S. (1987). Effects on overviews of early stopping rules for clinical trials. Statistics in Medicine, 6, 361–7.
Greenland, S. (1987). Quantitative methods in the review of epidemiologic literature. Epidemiologic Reviews, 9, 1–30.
Greenland, S. (1994). Quality scores are useless and potentially misleading. American Journal of Epidemiology, 140, 300–2.
Hampton, J.R. (1981). The use of beta blockers for the reduction of mortality after myocardial infarction. European Heart Journal, 2, 259–68.
Hayes, M.V., Taylor, M.S., Bayne, L.R., and Poland, B.D. (1990). Reported versus recorded health service utilization in Grenada, West Indies. Social Science and Medicine,31, 455–60.
Howe, G.R., Hirohata, T., and Hislop, T.G., et al. (1990). Dietary factors and risk of breast cancer, combined analysis of 12 case–control studies. Journal of the National Cancer Institute, 82, 561–9.
Hunter, D.J., D. Spiegelman, H.-O. Adami et al. (1996). Cohort studies of fat intake and the risk of breast cancer—a pooled analysis. New England Journal of Medicine, 334, 356–61.
Huque, M.F. (1988). Experiences with meta-analysis in NDA submissions. Proceedings of the Biopharmaceutical Section of the American Statistical Association, 2, 28–33.
Huston, P. and Moher, D. (1996). Redundancy, disaggregation, and the integrity of medical research. Lancet, 347, 1024–6.
ISIS-2 Collaborative Group (1988). Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17187 cases of suspected acute myocardial infarction: ISIS-2. Lancet, ii, 349–60.
Iyengar, S. and Greenhouse, J.B. (1988). Selection models and the file drawer problem. Statistical Science, 3, 109–35.
Jenicek, M. (1989). Meta-analysis in medicine. Where we are and where we want to go. Journal of Clinical Epidemiology, 42, 35–44.
Jha, P., Flather, M., Lonn, E., Farkouh, M., and Yusuf, S. (1995). The antioxidant vitamins and cardiovascular disease. Annals of Internal Medicine, 123, 860–72.
Jüni, P., Witschi, A., Bloch, R., and Egger, M. (1999). The hazards of scoring the quality of clinical trial for meta-analysis. Journal of the American Medical Association, 282, 1054–60.
Jüni, P., Altman, D.G., and Egger, M. (2001). Assessing the quality of controlled clinical trials. British Medical Journal, in press.
L’Abbé, K.A., Detsky A.S., and O’Rourke K. (1987). Meta-analysis in clinical research. Annals of Internal Medicine,107, 224–33.
Last, J.M. (2001). A dictionary of epidemiology (4th edn). Oxford University Press.
Lau, J., Antman, E.M., Jimenez-Silva, J., Kupelnick, B., Mosteller, F., and Chalmers, T.C. (1992). Cumulative meta-analysis of therapeutic trials for myocardial infarction. New England Journal of Medicine, 327, 248–54.
Laupacis, A., Sackett, D.L., and Roberts, R.S. (1988). An assessment of clinically useful measures of the consequences of treatment. New England Journal of Medicine, 318, 1728–33.
Lefebvre, C. and Clarke, M. (2001). Identifying randomised trials. In Systematic reviews in health care, meta-analysis in context (ed. M. Egger, G. Davey Smith, and D.G. Altman), p. 69. BMJ Books, London.
Leizorovicz, A., Haugh, M.C., and Boissel, J.-P. (1992). Meta-analysis and multiple publications of clinical trial reports. Lancet, 340, 1102–3.
Leviton, A., Pagano, M., Allred, E.N., and El Lozy, M. (1994). Why those who drink the most coffee appear to be at increased risk of disease: a modest proposal. Ecology and Food Nutrition, 31, 285–93.
Light, R.J. and Pillemer, D.B. (1984). Summing up. The science of reviewing research. Harvard University Press, Cambridge, MA.
Lilford, R.J. and Braunholtz, D. (1996). The statistical basis of public policy, a paradigm shift is overdue. British Medical Journal, 313, 603–7.
McAlister, F.A. (2001). Applying the results of systematic reviews at the bedside. In Systematic reviews in health care: meta-analysis in context (ed. M. Egger, G. Davey Smith, and D.G. Altman), p. 373. BMJ Books, London.
McAlister, F.A., Clark, H.D., van Walraven, C., et al. (1999). The medical review article revisited: has the science improved? Annals of Internal Medicine, 131, 947–51.
MacDonald, D., Grant A., Sheridan-Pereira M., Boylan P., and Chalmers I. (1985). The Dublin randomised controlled trial of intrapartum fetal heart rate monitoring. American Journal of Obstetrics and Gynecology, 152, 524–39.
MacMahon, S. (1992). Lowering cholesterol: effects on trauma death, cancer death and total mortality. Australia and New Zealand Journal of Medicine, 22, 580–2.
May, G.S., Demets, D.L., Friedman, L.M., Furberg, C., and Passamani, E. (1981). The randomized clinical trial: bias in analysis. Circulation, 64, 669–73.
Mitchell, J.R.A. (1981). Timolol after myocardial infarction: an answer or a new set of questions? British Medical Journal, 282, 1565–70.
Moher, D., Jadad, A.R., Nichol, G., Penman, M., Tugwell, P., and Walsh, S. (1995). Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists. Controlled Clinical Trials, 16, 62–73.
Moher, D., Jadad, A.R., and Tugwell, P. (1996a). Assessing the quality of randomized controlled trials. Current issues and future directions. International Journal of Technological Assessment in Health Care, 12, 195–208.
Moher, D., Fortin P., Jadad A.R., et al. (1996b). Completeness of reporting of trials published in languages other than English: implications for conduct and reporting of systematic reviews. Lancet, 347, 363–6.
Moher, D., Pham B., Jones A., et al. (1998). Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet,352, 609–13.
Mulrow, C.D. (1987). The medical review article: state of the science. Annals of Internal Medicine, 106, 485–8.
Multicentre International Study (1977). Supplementary report. Reduction in mortality after myocardial infarction with long-term beta-adrenoceptor blockade. British Medical Journal, ii, 419–21.
Multicenter Postinfarction Research Group (1983). Risk stratification and survival after myocardial infarction. New England Journal of Medicine, 309, 331–6.
Murphy, D.J., Povar, G.J., and Pawlson, L.G. (1994). Setting limits in clinical medicine. Archives of Internal Medicine, 154, 505–12.
Nelemans, P.J., Rampen, F.H.J., Ruiter, D.J., and Verbeek, A.L.M. (1995). An addition to the controversy on sunlight exposure and melanoma risk: a meta-analytical approach.Journal of Clinical Epidemiology,48, 1331–42.
Norwegian Multicenter Study Group (1981). Timolol-induced reduction in mortality and reinfarction in patients surviving acute myocardial infarction. New England Journal of Medicine, 304, 801–7.
Nurmohamed, M.T., Rosendaal F.R., Bueller H.R., et al. (1992). Low-molecular-weight heparin versus standard heparin in general and orthopaedic surgery: a meta-analysis. Lancet, 340, 152–6.
O’Farrell, N. and Egger, M. (2000). Circumcision in men and the prevalence of HIV infection: a meta-analysis revisited. International Journal of STDs and AIDS, 11, 137–42.
O’Rourke, K. and Detsky, A.S. (1989). Meta-analysis in medical research: strong encouragement for higher quality in individual research efforts. Journal of Clinical Epidemiology, 42, 1021–4.
Oxman, A.D. (2001). The Cochrane Collaboration in the 21st century: ten challenges and one reason why they must be met. In Systematic reviews in health care: meta-analysis in context (ed. M. Egger, G. Davey Smith, and D.G. Altman), p. 459. BMJ Books, London.
Oxman, A.D. and Guyatt, G.H. (1988). Guidelines for reading literature reviews. Canadian Medical Association Journal, 138, 697–703.
Oxman, A.D. and Guyatt, G.H. (1992). A consumer’s guide to subgroup analyses. Annals of Internal Medicine, 116, 78–84.
Pearson, K. (1904). Report on certain enteric fever inoculation statistics. British Medical Journal, 3, 1243–6.
Peduzzi, P., Wittes, J., Detre, K., and Holford, T. (1993). Analysis as randomized and the problem of non-adherence: An example from the veterans affiars randomized trial of coronary artery bypass surgery. Statistics in Medicine, 12, 1185–95.
Peto, R., Yusuf, S., and Collins, R. (1985). Cholesterol-lowering trial results in their epidemiologic context. Circulation, 72 (Supplement 3), 451.
Phillips, A.N. and Davey Smith, G. (1991). How independent are ‘independent’ effects? Relative risk estimation when correlated exposures are measured imprecisely. Journal of Clinical Epidemiology, 44, 1223–31.
Ravnskov, U. (1992). Cholesterol lowering trials in coronary heart disease: frequency of citation and outcome. British Medical Journal, 305, 15–19.
Rembold, C.M. (1998). Number needed to screen: development of a statistic for disease screening. British Medical Journal, 317, 307–12.
Reynolds, J.L. and Whitlock, R.M.L. (1972). Effects of a beta-adrenergic receptor blocker in myocardial infarction treated for one year from onset. British Heart Journal, 34, 252–9.
Rochon, P.A., Gurwitz J.H., Simms R.W., et al. (1994). A study of manufacturer-supported trials of nonsteroidal anti-inflammatory drugs in the treatment of arthritis. Archives of Internal Medicine, 154, 157–63.
Rosenthal, R. (1990). An evaluation of procedures and results. InThe future of meta-analysis (ed. K.W. Wachter, and M.L. Straf), p. 123. Russell Sage Foundation, New York.
Rossouw, J.E., Lewis, B., and Rifkind, B.M. (1990). The value of lowering cholesterol after myocardial infarction. New England Journal of Medicine,323, 1112–19.
Sackett, D.L. (1979). Bias in analytical research. Journal of Chronic Diseases, 32, 51–63.
Sackett, D.L. and Gent, M. (1979). Controversy in counting and attributing events in clinical trials. New England Journal of Medicine, 301, 1410–12.
Sackett, D.L., Deeks, J.F., and Altman, D. (1996). Down with odds ratios! Evidence-Based Medicine, 1, 164–7.
Schulz, K.F., Chalmers, I., Hayes, R.J., and Altman, D. (1995). Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. Journal of the American Medical Association, 273, 408–12.
Shapiro, S. (1994). Meta-analysis/Shmeta-analysis. American Journal of Epidemiology,140, 771–8.
Simes, R.J. (1987). Confronting publication bias: a cohort design for meta-analysis. Statistics in Medicine, 6, 11–29.
Singh, R. and Singh, S. (1994). Research and doctors. Lancet, 344, 546.
Smeeth, L., Haines, A., and Ebrahim, S. (1999). Numbers needed to treat derived from meta-anslyses—sometimes informative, sually misleading. British Medical Journal, 318, 1548–51.
Smith, B.J., Darzins, P.J., Quinn, M., and Heller, R.F. (1992). Modern methods of searching the medical literature. Medical Journal of Australia, 157, 603–11.
Song, F. (1999). Exploring heterogeneity in meta-analysis: is the L’Abbé Plot useful? Journal of Clinical Epidemiology, 52, 725–30.
Song, F., Abrams, K.R., Jones, D.R., and Sheldon, T.A. (1998). Systematic reviews of trials and other studies. Health Technology Assessment, 2, 19.
Spiegelhalter, D.J., Myles, J.P., Jones, D.R., and Abrams, K.R. (1999). An introduction to Bayesian methods in health technology assessment. British Medical Journal, 319, 508–12.
Sterling, T.D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. Journal of the American Statistical Association, 54, 30–4.
Stern, J.M. and Simes, R.J. (1997). Publication bias: evidence of delayed publication in a cohort study of clinical research projects. British Medical Journal, 315, 640–5.
Sterne, J.A.C. and Egger, M. (2001). Funnel plots for detecting bias in meta-analysis: guidelines on choice of axis. Journal of Clinical Epidemiology, in press.
Sterne, J.A.C., Gavaghan, D.J., and Egger, M. (2000). Publication and related bias in meta-analysis: power of statistical tests and prevalence in the literature. Journal of Clinical Epidemiology, 53, 1119–29.
Sterne, J.A.C., Bradburn, M.J., and Egger, M. (2001). Meta-analysis in Stata. In Systematic reviews in health care: meta-analysis in context (ed. M. Egger, G. Davey Smith, and D.G. Altman). BMJ Books, London.
Su, X.Y. and Li Wan Po, A. (1996). Combining event rates from clinical trials: comparison of Bayesian and classical methods. Annals of Pharmacotherapy, 30, 460–5.
Susser, M. (1977). Judgment and causal inference: criteria in epidemiologic studies. American Journal of Epidemiology, 105, 1–15.
Teagarden, J.R. (1989). Meta-analysis: whither narrative review? Pharmacotherapy, 9, 274–84.
Thompson, S.G. (1994). Why sources of heterogeneity in meta-analysis should be investigated. British Medical Journal, 309, 1351–5.
Thompson, S.G. and Sharp, S. (1998). Explaining heterogeneity in meta-analysis: a comparison of methods. Statistics in Medicine, 18, 2693–708.
Thornley, B. and Adams, C. (1998). Content and quality of 2000 controlled trials in schizophrenia over 50 years. British Medical Journal, 317, 1181–4.
Tramèr, M.R., Reynolds, D.J.M., Moore, R.A., and McQuay, H.J. (1997). Impact of covert duplicate publication on meta-analysis, a case study. British Medical Journal, 315, 635–40.
Van Howe, R.S. (1999). Circumcision and HIV infection: review of the literature and meta-analysis. International Journal of STDs and AIDS, 10, 8–16.
Vevea, J.L. and Hedges, L.V. (1995). A general linear model for estimating effect size in the presence of publication bias. Psychometrika, 60, 419–35.
Withering, W. (1785).An account of the foxglove and some of its medical uses. M. Swinney for G.G.J and J. Robinson, Birmingham.
Woodhill, J.M., Palmer, A.J., Leelarthaepin, B., McGilchrist, C., and Blacket, R.B. (1978). Low fat, low cholesterol diet in secondary prevention of coronary heart disease. Advances in Experimental Medicine and Biology, 109, 317–30.
Yusuf, S., Peto, R., Lewis, J., Collins, R., and Sleight, P. (1985). Beta blockade during and after myocardial infarction: an overview of the randomized trials. Progress in Cardiovascular Disease, 17, 335–71.


One comment on “6.12 Systematic reviews and meta-analysis

  1. Many of the medical journal are used for the pregnancy blogs where they contain the specific type of information helpful for the readers.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: