6.10 Concepts of validity in epidemiological research

Oxford Textbook of Public Health

6.10

Concepts of validity in epidemiological research

Sander Greenland

Inference and validity

Validity in prediction problems

Comparison validity

Follow-up validity

Specification validity

Measurement validity

Summary of example

Validity in causal inference

Comparison validity

Follow-up validity

Specification validity

Measurement validity

Summary of example

Validity in case–control and retrospective cohort studies

Case–control studies

Summary of example

Retrospective cohort studies

Conclusion

Chapter References

Some of the major validity concepts in epidemiological research are outlined in this chapter. The contents are organized into three main sections: validity in prediction problems, validity in causal inference, and special validity problems in case–control and retrospective cohort studies. Familiarity with the basics of epidemiological study design and a number of terms of epidemiological theory, amongst them risk, competing risk, average risk, population at risk, and rate, are assumed. A number of textbooks provide more background and depth than can be given here. Among them, Checkoway et al. (1989), Walker (1991), Kelsey et al. (1996), and Chapters 1 to 11 of Rothman and Greenland (1998) provide epidemiological treatments, while Breslow and Day (1980, 1987), Clayton and Hills (1993), and Chapters 12 to 21 of Rothman and Greenland (1998) focus on statistical details.

Despite similarities, there is considerable diversity and conflict amongst the classification schemes and terminologies employed in various textbooks. This diversity reflects that there is no unique way of classifying validity conditions, biases, and errors. It follows that the classification schemes employed here and elsewhere should not be regarded as anything more than convenient frameworks for organizing discussions of study validity and epidemiological inference.

Several important study designs, including prevalence studies and ecological studies, are not discussed in this chapter. Such studies require consideration of the above validity conditions and also require special considerations of their own. Further details of these and other designs can be found in the general textbooks cited above. For a review of the special problems of ecological studies, see Greenland and Robins (1994) and Morgenstern (1998). Meta-analytic methods are discussed by Greenland (1994, 1998a).

Also not covered are a number of central problems of epidemiological inference, including choice of effect measures and interpretation of statistics. Critical discussions of effect measures are given by Greenland (1987), Greenland and Robins (1988), Greenland et al. (1986, 1991), and Chapter 4 of Rothman and Greenland (1998). Oakes (1990) and Barnett (1999) provide introductions to competing schools of statistical inference. They discuss shortcomings of the prevailing approaches to statistics and alternative approaches as well; see also Berger and Berry (1988), Goodman and Royall (1988), and Greenland (1998b). Rubin (1991) contrasts different statistical approaches to causal inference, and Greenland et al. (1999a) provide an introduction to graphical methods in causal inference. Poole (1987a, b), Goodman (1992, 1993, 1999), Greenland (1990, 1993b), and Chapter 12 of Rothman and Greenland (1998) discuss the use and misuse of statistical inference in epidemiological research.

Inference and validity

Epidemiological inference is the process of drawing inferences from epidemiological data, such as prediction of disease patterns or identification of causes of diseases or epidemics. These inferences must often be made without the benefits of direct experimental evidence or established theory about disease aetiology. Consider the problem of predicting the risk and incubation (induction) time for AIDS among people infected with HIV-1. Unlike an experiment, in which the exposure is administered by the investigator, the date of HIV-1 infection cannot be accurately estimated in most cases; furthermore, the mechanism by which ‘silent’ HIV-1 infection progresses to AIDS is not known with certainty. Nevertheless, some prediction must be made from the available data in order to prepare effectively for future health-care needs.

As another example, consider the problem of estimating how much excess risk of coronary heart disease (if any) is produced by coffee drinking. Unlike an experimental exposure, coffee drinking is self-selected; it appears that people who use coffee are more likely to smoke than non-users and probably tend to differ in many other behaviours as well (Greenland 1993a). As a result, even if coffee use is harmless, we should not expect to observe the same pattern of heart disease in users and non-users. Thus small coffee effects should be very difficult to disentangle from the effects of other behaviours. Nevertheless, because of the high prevalence of coffee use and the high incidence of heart disease, determination of the effect of coffee on heart disease risk may be of considerable public health importance.

In both these examples, and in general, inferences will depend on evaluating the validity of the available studies, or the degree to which the studies meet basic logical criteria for absence of bias. In each section of this chapter major concepts of validity in epidemiological research as applied in three settings—prediction from one population to another, causal inference from cohort studies, and causal inference from case–control and retrospective cohort studies—are outlined and illustrated. Parallel aspects of each application will be emphasized. In particular, each problem requires consideration of comparison validity, follow-up validity, specification validity, and measurement validity. Case–control studies require the additional consideration of case- and control-selection validity, and are often subject to additional sources of measurement error beyond those occurring in prospective cohort studies. Similar problems arise in retrospective cohort studies.

Validity in prediction problems

The following prediction problem will be used to illustrate the basic concepts of validity in epidemiological inference. A health clinic for homosexual men is about to begin enrolling HIV-l-negative men in an unrestricted programme that will involve retesting each participant for HIV-1 antibodies at 6-month intervals. It can be expected that, in the course of the programme, many participants will seroconvert to positive HIV-1 status. Such participants will invariably ask difficult questions, such as: What are my chances of developing AIDS over the next 5 years? How many years do I have before I develop AIDS? In attempting to answer these questions, it will be convenient to refer to such participants (that is, those who seroconvert) as the target cohort. Even though membership of this cohort is not determined in advance, it will be the target of our predictions. It will also be convenient to refer to the time from HIV-1 infection until the onset of clinical AIDS as the AIDS incubation time. Reasonable answers could be provided to a participant’s questions if it were possible to predict AIDS incubation times accurately, although we would also have to estimate the time elapsed between infection and the first positive test.

There might be someone who responds to the questions posed above with the following anecdote: ‘I’ve known several men just like the ones in this cohort, and they all developed AIDS within 5 years after a positive HIV-1 test’. No trained scientist would conclude from this anecdote that all or most of the target cohort will develop AIDS within 5 years of seroconversion. One reason is that the men in the anecdote cannot be ‘just like’ men in our cohort in every respect: they may have been older or younger when they were infected; they may have experienced a greater degree of stress following their infection; they may have been heavier smokers, drinkers, or drug users, and so on. In other words, we know that the anecdotal men and their postinfection life events could not have been exactly the same as the men in our target cohort with respect to all factors that affect AIDS incubation time, including measured, unmeasured, and unknown factors. Furthermore, it may be that some or all of the men referred to in the anecdote had been infected long before they were first tested, so that (unlike men in our target cohort) the time from their first positive test to AIDS onset was much shorter than the time from seroconversion to AIDS onset.

Any reasonable predictions must be based on observing the distribution of AIDS incubation times in another cohort. Suppose that we obtain data from a study of homosexual men who underwent regular HIV-1 testing, and then assemble from these data a study cohort of men who were observed to seroconvert. Suppose also that most of these men were followed for at least 5 years after seroconversion. It cannot be expected that any member of this study is going to be ‘just like’ any member of our target cohort in every respect. Nevertheless, if it was possible to identify no difference between the two cohorts with respect to factors that affect incubation time, it might be argued that the study cohort could serve as a point of reference for predicting incubation times in the target cohort. Thus, henceforth the study cohort shall be referred to as our reference cohort. Note that our reference and target cohorts may have originated from different populations; for example, the clinic generating the target cohort could be in New York, but the study that generated the reference cohort may have been in San Francisco. For both the target and reference cohorts, the actual times of HIV-1 infection will have to be imputed, based on the dates of the last negative and the first positive tests.

Suppose that our statistical analysis of data from the reference cohort produces estimates of 0.05, 0.25, and 0.45 for the average risk of contracting AIDS within 2, 5, and 8 years of HIV-1 infection. What conditions would be sufficient to guarantee the validity of these figures as estimates or predictions of the proportion of the target cohort that would develop AIDS within 2, 5, and 8 years of infection? If by ‘valid’ we mean that any discrepancy between our predictions and the true target proportions is purely random (unpredictable in principle), the following conditions would be sufficient.

Comparison validity (C). The distribution of incubation times in the target cohort will be approximately the same as the distribution in the reference cohort.

Follow-up validity (F). Within the reference cohort, the risk of censoring (that is, follow-up ended by an event other than AIDS) is not associated with risk of AIDS.

Specification validity (Sp). The distribution of incubation times in the reference cohort can be closely approximated by the statistical model used to compute the estimates. For example, if one employs a log-normal distribution to model the distribution of incubation times in the reference cohort, this model should be approximately correct.

Measurement validity (M). All measurements of variables used in the analysis closely approximate the true values of the variables. In particular, each imputed time of HIV-1 infection closely approximates the true infection time, and each reported time of AIDS onset closely approximates a clinical event defined as AIDS onset.

The first condition concerns the external validity of making predictions about the target cohort based on the reference cohort. The remaining conditions concern the internal validity of the predictions as estimates of average risk in the reference cohort. The following sections will explore the meaning of these conditions in prediction problems.

Comparison validity

Comparison validity is probably the easiest condition to describe, although it is difficult to evaluate. Intuitively, it simply means that the distribution of incubation times in the target cohort could be almost perfectly predicted from the distribution of incubation times in the reference cohort, if the incubation times were observed without error and there was no loss to follow-up. Other ways of stating this condition are that the two cohorts are comparable or exchangeable with respect to incubation times, or that the AIDS experience of the target cohort can be predicted from the experience of the reference cohort.

Confounding

If the two cohorts are not comparable, some or all of our risk estimates for the target cohort based on the reference cohort will be biased as a result. This bias is sometimes called confounding. There has been much research on methods for identifying and adjusting for such bias (see the textbooks cited above). (The term ‘bias’ is here used in the informal epidemiological sense, and corresponds to the formal statistical concept of inconsistency.)

To evaluate comparison validity, we must investigate whether the two cohorts differ on any factors that influence incubation time. If so, we cannot reasonably expect the incubation time distributions of the two cohorts to be comparable. A factor responsible for some or all of the confounding in an estimate is called a confounder or confounding variable, the estimate is said to be confounded by the factor, and the factor is said to confound the estimate.

To illustrate these concepts, suppose that men infected at younger ages tend to have longer incubation times and that the members of the reference cohort are on average younger than members of the target cohort. If there were no other differences to counterbalance this age difference, we should then expect that members of the reference cohort will on average have longer incubation times than members of the target cohort. Consequently, unadjusted predictions of risk for the target cohort derived from the reference cohort would be biased (confounded) by age in a downward direction. In other words, age would be a confounder for estimating risk in the target cohort, and confounding by age would result in underestimation of the proportion of men in the target cohort who will develop AIDS within 5 years.

Suppose that it is possible to compute the age at infection of men in the reference cohort, and that within 1-year strata of age, for instance, the target and reference cohorts had virtually identical distributions of incubation times. The age-specific estimates of risk derived from the reference cohort would then be free of age confounding and so could be used as unconfounded estimates of age-specific risk for men in the target cohort. Also, if we wished to construct unconfounded estimates of average risk in the entire target cohort, we could do so via the technique of age standardization.

To illustrate, let Px denote our estimate of the average risk of AIDS within 5 years of infection among members of the reference cohort who become infected at age x. Let Wx denote the proportion of men in the target cohort who are infected at age x. Then the estimated average risk of AIDS within 5 years of infection, standardized to the target cohort’s age distribution, is simply the average of the age-specific reference estimates Px, weighted by the age distribution (at infection) of the target cohort; algebraically, this average is the sum of the products WxPx over all ages and is denoted by åxWxPx. Considered as an estimate of the overall proportion of the target cohort that will develop AIDS within 5 years of HIV-1 infection, the standardized proportion åxWxPx, will be free of age confounding.

The preceding illustration brings forth an important and often overlooked point: when employing standardization to adjust for potential biases, the choice of standard distribution should never be considered arbitrary. In fact, the standard distribution should always be taken from the target cohort or the population about which inferences will be made. If inferences are to be made about several different groups, it may be necessary to compute several different standardized estimates.

Methods for removing bias in estimates by taking account of variables responsible for some or all of the bias are known as adjustment or covariate control methods. Standardization is perhaps the oldest and simplest example of such a method; methods based on multivariate models, which are discussed below, are more complex.

Unmeasured confounders

If all confounders were measured accurately, comparison validity could be achieved simply by adjusting for these confounders (although various technical problems might arise when attempting to do so). Nevertheless, in any non-randomized study we would ordinarily be able to think of a number of possible confounders that had not been measured or had been measured only in a very poor fashion. In such cases, it may still be possible to predict the direction of uncontrolled confounding by examining the manner in which people were selected into the target and reference cohorts from the population at large. If the cohorts are derived from populations with different distributions of predictors of the outcome, or the predictors themselves are associated with admission differentially across the cohorts, these predictors will become confounders in the analysis.

To illustrate this approach, suppose that HIV-1 infection via an intravenous route (for example through needle sharing) leads to shorter incubation times than HIV-1 infection through sexual activity. Suppose also that the reference cohort had excluded all or most intravenous drug users, whereas the target cohort was non-selective in this regard. Then incubation times in the target cohort will on average be shorter than times in the reference cohort owing to the presence of intravenously infected people in the target cohort. Thus we should expect the results from the reference cohort to underestimate average risks of AIDS onset in the target cohort.

Random sampling and confounding

Suppose, for the moment, that our reference cohort had been formed by taking a random sample of the target cohort. Can predictions about the target made from such a random sample still be confounded? With the above definition of confounding, the answer is yes. To see this, note for example that by chance alone men in our sample reference cohort could be younger on average than the total target; this age difference would in turn downwardly bias the unadjusted risk predictions if men had longer incubation times at younger ages.

Nevertheless, random sampling can help to ensure that the distribution of the reference cohort is not too far from the distribution of the target cohort. In essence, the probability of severe confounding can be made as small as necessary by increasing the sample size. Furthermore, if random sampling is used, any confounding left after adjustment will be accounted for by the standard errors of the estimates, provided that the correct statistical model is used to compute the estimates and standard errors. The latter condition is examined below under the section on specification validity.

Follow-up validity

In any cohort study covering an extended period of risk, subjects will be followed for different lengths of time. Some subjects will be lost to follow-up before the study ends. Others will be removed from the study by an event that precludes AIDS onset, which in this setting is death before AIDS onset from fatal accidents, fatal myocardial infarctions, and so on. Because subjects come under study at different times, those who are not lost to follow-up or who die before developing AIDS will still have had different lengths of follow-up when the study ends; traditionally, a subject still under follow-up at study end is said to have been ‘withdrawn from study’ at the time of study end.

Suppose that we wish to estimate the average risk of AIDS onset within 5 years of infection. The data from a member of the reference cohort who is not observed to develop AIDS but is also not followed for the full 5 years from infection are said to be censored for the outcome of interest (AIDS within 5 years of infection). Consider, for example, a subject killed in a car crash 2 years after infection but before contracting AIDS: the incubation time of this subject was censored at 2 years of follow-up.

Follow-up validity means that over any span of follow-up time, risk of censoring is unassociated with risk of the outcome of interest. In our example, follow-up validity means that over any span of time following infection, risk of censoring (loss, withdrawal, or death before AIDS) is unassociated with risk of AIDS. All common methods for estimating risk from situations in which censoring occurs (for example person-years, life table, and Kaplan–Meier methods) are based on the assumption of follow-up validity. Given follow-up validity, it can be expected that, at any time t after infection, the distribution of incubation times will be the same for subjects lost or withdrawn at t and subjects whose follow-up continues beyond t.

Violations of follow-up validity can result in biased predictions of risk; such violations are referred to as follow-up bias or biased censoring. To illustrate, suppose that younger reference subjects tend to have longer incubation times (that is, lower risks) and are lost to follow-up at a higher rate than older reference subjects. In other words, lower-risk subjects are lost at a higher rate than higher-risk subjects. Then, after enough time, the average risk of AIDS in the observed portion of the reference cohort will tend to be overestimated, that is, higher than the average risk occurring in the full reference cohort (as the latter includes both censored and uncensored subject experience).

The follow-up bias in the last illustration would not affect the age-specific estimates of risk (where age refers to age at infection). Consequently, the age bias in follow-up would not produce bias in age-standardized estimates of risk. More generally, if follow-up bias can be traced to a particular variable that is a predictor of both the outcome of interest and censoring, bias in the estimates can be removed by adjusting for that variable. Thus, some forms of follow-up bias can be dealt with in the same manner as confounding.

Specification validity

All statistical techniques, including so-called ‘distribution-free’ or ‘non-parametric’ methods, as well as basic contingency table methods, are derived by assuming the validity of a sampling model or error distribution. A common example is the binomial model, which is discussed in all the textbooks cited in the introduction. For parametric methods, the sampling model is a mathematical formula that expresses the probability of observing the various possible data patterns as a function of certain unknown constants (parameters). Although the parameters of this model may be unknown, the mathematical form of this model incorporates only known or purely random aspects of the data-generation process; unknown systematic aspects of this process (such as most follow-up and selection biases) will not be accounted for by the model.

All parametric statistical techniques also assume a structural model, which is a mathematical formula that expresses the parameters of the sampling model as a function of study variables. A common example is the logistic model (Breslow and Day 1980; Checkoway et al. 1989; Kelsey et al. 1996; Rothman and Greenland 1998). The structural model is most often incorporated into the sampling model, and the combination is referred to as the statistical model. An estimate can be said to have specification validity if it is derived using a statistical model that is correct or nearly so.

If either the sampling model or the structural model used for analysis is incorrect, the resulting estimates may be biased. Such bias is sometimes called specification bias, while the use of an incorrect model is known as model mis-specification or specification error. Even when mis-specification does not lead to bias, it can lead to invalidity of statistical tests and confidence intervals.

The true structural relation among the study variables is almost never known in studies of human disease. Furthermore, in the absence of random sampling and randomization, the true sampling process (that is, the exact process leading people to enter and stay in the study groups) will also be unknown. It follows that we should ordinarily expect some degree of specification error in an epidemiological analysis. Minimizing such error largely consists of contrasting the statistical model against the data and against any available information about the processes that generated the data (McCullagh and Nelder 1989), such as prior information on demographic patterns of incidence.

Many statistical techniques in epidemiology are based on assuming some type of logistic model. Examples include all the popular adjusted odds ratios, such as the Woolf, maximum likelihood, and Mantel–Haenszel estimates, as well as tests for odds ratio heterogeneity. Classical ‘indirect’ adjustment of rates and other comparisons of standardized morbidity ratios depend on similar multiplicative models for their validity (Breslow and Day 1987).

The degree of bias in traditional epidemiological analysis methods when the model assumptions fail has not been extensively studied. A few traditional methods, such as directly standardized comparisons and the Mantel–Haenszel test, remain valid under a wide variety of structural models. In addition, risk regression has been extended to situations involving more general models than assumed in classical theory (Breslow and Day 1987; Hastie and Tibshirani 1990). Leamer (1978) and White (1993) give more details on the effects of specification error in multiple regression problems, while Maldonado and Greenland (1994) and Greenland and Maldonado (1994) examine the implications of specification error in epidemiology.

Measurement validity

An estimate from a study can be said to have measurement validity if it suffers from no bias due to errors in measuring the study variables. Unfortunately there are sources of measurement error in nearly all studies, and nearly all sources of measurement error will contribute to bias in estimates. Thus evaluation of measurement validity primarily focuses on identifying sources of measurement error and attempting to deduce the direction and magnitude of bias produced by these sources.

To aid in the task of identifying sources of measurement error, it may be useful to classify such errors according to their source. Errors from specific sources can then be further classified according to characteristics that are predictive of the direction of the bias they produce. One classification scheme divides errors into three major categories, according to their source:

procedural error, arising from mistakes or defects in measurement procedures

proxy-variable error, arising from using a ‘proxy’ variable as a substitute for an actual variable of interest

construct error, arising from ambiguities in the definition of the variables.

Regardless of their source, errors can be divided into two basic types, differential and non-differential, according to whether the direction or magnitude of error depends on the true values of the study variables. Two different sources of error may be classified as dependent or independent, according to whether or not the direction or magnitude of the error from one source depends on the direction or magnitude of the error from the other source. Finally, errors in continuous measurements can be factored into systematic and random components. As described in the following subsections, these classifications have important implications for bias.

Procedural error

Procedural error is the most straightforward to imagine. It includes errors in recall when variables are measured through retrospective interview (for example, mistakes in remembering all medications taken during pregnancy). It also includes coding errors, errors in calibration of instruments, and all other errors in which the target of measurement is well defined and the attempts at measurement are direct but the method of measurement is faulty. In our example, one target of measurement is HIV-1 antibody presence in blood. All available tests for antibody presence are subject to error (false negatives and false positives), and these errors can be considered to be procedural errors of measurement.

Proxy-variable error

Proxy-variable error is distinguished from procedural error in that use of proxies necessitates imputation and hence virtually guarantees that there will be measurement error. In our example, we must impute the time of HIV-1 infection. For instance, we might take as a proxy the infection time computed as 6 weeks before the mid-point between the last negative test and the first positive test for HIV-1 antibodies. Even if our HIV-1 tests are perfect, this measurement incorporates error if (as is certainly the case) time of infection does not always occur 6 weeks before the mid-point between the last negative and first positive tests.

Construct error

Construct error is often overlooked, although it may be a major source of error. Consider our example in which the ultimate target of measurement is the time between HIV-1 infection and onset of AIDS. Before attempting to measure this time span, the events that mark the beginning and end of the span must be unambiguously defined. While it may be reasonable to think of HIV-1 infection as a point event, the same cannot be said of AIDS onset. Symptoms and signs may gradually accumulate, and then it is only by convention that some point in time is declared the start of the disease. If this convention cannot be translated into reasonably precise clinical criteria for diagnosing the onset of AIDS, the construct of incubation time (the time span between infection and AIDS onset) will not be well defined let alone accurately measurable. In such situations it may be left to various clinicians to improvise answers to the question of time of AIDS onset, and this will introduce another source of extraneous variation into the final ‘measurement’ of incubation time.

Differential and non-differential error

Errors in measuring a variable are said to be differential when the direction or magnitude of the errors tend to vary across the true values of other variables. Suppose, for example, that recall of drug use during pregnancy is enhanced among mothers of children with birth defects. Then a retrospective interview about drug use during pregnancy will yield results with differential error, since false-negative error will occur more frequently among mothers whose children have no birth defects.

Another type of differential error occurs in the measurement of continuous variables when the distribution of errors varies with the true value of the variable. Suppose, for example, that women more accurately recall the date of a recent cervical smear test (Papanicolaou or Pap test) than the date of a more distant test. Then a retrospective interview to determine length of time since a woman’s last cervical smear test would tend to suffer from larger errors when measuring longer times.

Errors in measuring a variable are said to be non-differential with respect to another variable if the magnitudes of errors do not tend to vary with the true values of the other variable. Measurements are usually assumed to be non-differential if neither the subject nor the person taking the measurement knows the values of other variables. For example, if drug use during pregnancy is measured by examining prepartum prescription records for the mother, it would ordinarily be assumed that the error will be non-differential with respect to birth defects discovered postnatally. Nevertheless, such ‘blind’ assessments will not guarantee non-differential error if the measurement scale is not as fine as the scale of the original variable (Flegal et al. 1991; Wacholder et al. 1991) or if there is a third uncontrolled variable that affects both the measurement and the other study variables.

Dependent and independent error

Errors in measuring two variables are said to be dependent if the direction or magnitude of the errors made in measuring one of the variables is associated with the direction or magnitude of the errors made in measuring the other variable. If there is no association of errors, the errors are said to be independent.

In our example, errors in measuring age at HIV-1 infection and AIDS incubation time are dependent. Our measure of incubation time is equal to our measure of age at AIDS onset minus our measure of age at infection; hence overestimation of age at infection will contribute to underestimation of incubation time, and underestimation of age at infection will contribute to overestimation of incubation time. In contrast, in the same example it is plausible that the errors in measuring age at infection and age at onset are independent.

Misclassification and bias towards the null

Measurement of a binary (dichotomous) variable is called better than random if, regardless of the true value, the probability that the measurement yields the true value is higher than the probability that it does not. In other words, the measurement is better than random if it is more likely to be correct than incorrect, no matter what the true value is. Given two binary variables, better-than random measurements with independent non-differential errors cannot inflate or reverse the association observed between the variables. In other words, any bias produced by independent non-differential error in better-than-random measurements can only be towards the null value of the association (which is 1 for a relative risk measure) and not beyond.

If either variable has more than two levels, then (contrary to assertions in most pre-1990 literature) the preceding conditions are not sufficient to guarantee that the resulting bias will only be towards the null and not beyond (Dosemeci et al. 1990). Despite this insufficiency, knowing that errors are independent and non-differential can increase the plausibility that any resulting bias is towards the null. For further discussions of sufficient conditions for error to produce bias towards the null, see Dosemeci et al. (1991), Flegal et al. (1991), Wacholder et al. (1991), and Weinberg et al. (1994).

There is one important situation in which the assumption of independent non-differential measurement error and hence bias towards the null have particularly high plausibility: in a double-blind clinical trial with a dichotomous treatment and outcome, successful blinding of treatment status during outcome evaluation should lead to independence and non-differentiality of treatment and outcome measurement errors. Successful blinding thus helps to ensure (although it does not guarantee) that any bias produced by measurement error contributes to underestimation of treatment effects.

Systematic and random components of error

For well-defined measurement procedures on continuous variables, measurement errors can be subdivided into systematic and random components. The systematic component (sometimes called the bias of the measurement) measures the degree to which the procedure tends to underestimate or overestimate the true value on repeated application. The random component is the residual error left after subtracting the systematic component from the total error.

To illustrate, suppose that in our study HIV-1 infection time was unrelated to time of antibody testing and that the average time of HIV-1 seroconversion was 8 weeks after infection. Then, even if one used a perfect HIV-1 test, a procedure that estimated infection time as 6 weeks before the mid-point between the last negative and first positive test would on average yield an estimated infection time that was 2 weeks later than the true time. Thus the systematic component of the error of this procedure would be +2 weeks. Since AIDS incubation time is AIDS onset time minus HIV-1 infection time, use of this procedure would add –2 weeks (that is, a 2-week underestimation) to the systematic component of error in estimating incubation time.

Each of the components of an error, systematic and random, may be differential (that is, may vary with other variable values) or non-differential, and may or may not be independent of the error components in other variables. We shall not explore the consequences of the numerous possibilities. However, one important (but semantically confusing) fact is that, for certain quantities, independent and non-differential systematic components of error will not harm measurement validity in that they will produce no bias in estimation.

To illustrate, suppose that in our example we wish to estimate the degree to which AIDS incubation time depends on age at HIV-1 infection. Suppose also that the systematic components of the measurements of incubation time and age of infection are –2 weeks and +2 weeks (as above), and do not vary with true incubation time or age at measurement (that is, the systematic components are non-differential). Then the systematic components, being equal, will cancel out when we compute differences in incubation time and differences in age at infection. Since only these differences are used to estimate the association, the observed dependence of incubation time on age at infection will not be affected by the systematic components of error (although it may be biased by the random components of error).

Summary of example

The example of this section provides an illustration of the most common threats to the validity of predictions. The unadjusted estimates of AIDS risk may be confounded if the target and reference cohorts differ in composition, and may also be biased by losses to follow-up or use of an incorrect statistical model. Finally, our predictions are likely to be compromised by errors in measurements. These sources of error should be borne in mind in any attempt to predict AIDS incidence.

Validity in causal inference

Concepts of valid prediction are applicable in evaluating studies of causation; comparison validity, follow-up validity, specification validity, and measurement validity must each be considered. In fact, as argued below, problems of causal inference can be viewed as a special type of prediction problem, namely prediction of what would happen (or what would have happened) to a population if certain characteristics of the population were (or had been) altered.

To illustrate validity issues in causal inference, we shall consider the hypothesis that coffee drinking causes acute myocardial infarction. This hypothesis can be operationally interpreted in a number of ways.

1. There are people for whom the consumption of coffee results in their experiencing a myocardial infarction sooner than they might have, had they avoided coffee.

While this hypothesis is appealingly precise, it offers little practical guidance to an epidemiological researcher. The problem lies in our inability to recognize an individual whose myocardial infarction was caused by coffee drinking. It is quite possible that myocardial infarctions precipitated by coffee use are clinically and pathologically indistinguishable from myocardial infarctions due to other causes. If so, the prospect of finding convincing physiological evidence concerning the hypothesis is not good.

This impasse could be overcome by examining a related epidemiological hypothesis, that is, a hypothesis that refers to the distribution of disease in populations. One of many such hypotheses is as follows.

2. Among five-cup-a-day coffee drinkers, cessation of coffee use will lower the frequency of myocardial infarction.

This form not only involves a population (five-cup-a-day coffee drinkers) but also asserts that a mass action (coffee cessation) will reduce the frequency of the study disease. Thus the form of the hypothesis immediately suggests a strong test of the hypothesis: conduct a randomized intervention trial to examine the impact of coffee cessation on myocardial infarction frequency. This solution has some profound practical limitations, not least of which would be persuading anyone to give up or take up coffee drinking to test a speculative hypothesis.

Having ruled out intervention, we might consider an observational cohort study. In this case our epidemiological hypothesis should refer to natural conditions, rather than intervention. One such hypothesis is as follows.

3. Among five-cup-a-day coffee drinkers, coffee use has elevated the frequency of myocardial infarction.

There have been a number of conflicting cohort and case–control studies of coffee and myocardial infarction. The present discussion will be confined to the issues arising in the analysis of a single study. For a review of issues arising in the analysis of multiple studies (meta-analysis) using the coffee–myocardial infarction literature as an example, see Greenland (1994, 1998a); additional discussion of the coffee–myocardial infarction literature may be found in Greenland (1993a).

Consider a cohort study of coffee and first myocardial infarction. At baseline, a cohort of people with no history of myocardial infarction is assembled and classified into subcohorts according to coffee use (for example never-drinkers, ex-drinkers, occasional drinkers, one-cup-a-day drinkers, two-cup-a-day drinkers, and so on). Other variables are measured as well: age, sex, smoking habits, blood pressure, and serum cholesterol. Suppose that at the end of 10 years of monitoring this cohort for myocardial infarction events, we compare the five-cup-a-day and never-drinker subcohorts, and obtain an unadjusted estimate of 1.22 for the ratio of the person–time incidence rates of first myocardial infarction among five-cup-a-day drinkers and never-drinkers (with 95 per cent confidence limits of 1.00 and 1.49). In other words, it appears that the rate of first myocardial infarction among five-cup-a-day drinkers was 1.22 times higher than the rate among never-drinkers. (Hereafter, myocardial infarction means first myocardial infarction, risk means average risk, and rate means person–time incidence rate.)

The estimated rate ratio of 1.22 may not seem large. Nevertheless, if it accurately reflects the impact of coffee use on the five-cup-a-day subcohort, this estimate implies that people drinking five cups a day at baseline suffered a 22 per cent increase in their myocardial infarction rate as a result of their coffee use. Given the high frequency of both coffee use and myocardial infarction in many populations, this could represent a substantial health impact. Therefore a careful evaluation of the validity of the estimate should be performed.

As in the previous AIDS example, we can proceed by examining a series of conditions sufficient for validity of the estimate as a measure of coffee effect.

Comparison validity (C). If the members of the five-cup-a-day subcohort had instead never drunk coffee, their distribution of myocardial infarction events over time would have been approximately the same as the distribution among the never-drinkers.

Follow-up validity (F). Within each subcohort, the risk of censoring (that is, follow-up ended by an event other than myocardial infarction) is not associated with the risk of myocardial infarction.

Specification validity (Sp). The distribution of myocardial infarction events over time in the subcohorts can be closely approximated by the statistical model on which the estimates are based.

Measurement validity (M). All measurements of variables used in the analysis closely approximate the true values of the variables.

These four conditions are sometimes called internal validity conditions because they pertain only to estimating effects within the study cohort rather than to generalizing results to other cohorts. They are sufficient but not necessary for validity, in that certain violations of the conditions will not produce bias in the effect estimate (although most violations will produce some bias). The meaning of these conditions for an observational cohort study of a causal hypothesis is explored in the following sections. An important phenomenon known as effect modification, which is relevant to both internal validity and generalizability, is also discussed.

Comparison validity

In our example, comparison validity simply means that the distribution of myocardial infarctions among never-drinkers accurately predicts what would have happened in the coffee-drinking groups had the members of these groups never drunk coffee. Another way of stating condition C is that the five-cup-a-day and never-drinker subcohorts would be comparable or exchangeable with respect to myocardial infarction times if no one had ever drunk coffee.

Despite its simplicity, note that the comparison validity condition depends on the hypothesis of interest in a very precise way. In particular, the research hypothesis (hypothesis 3 above) is a statement about the impact of coffee among five-cup-a-day drinkers. Thus this subcohort is the target cohort, while never-drinkers serve as the reference cohort for making predictions about this target.

To illustrate further the correspondence between comparison validity and the hypothesis at issue, suppose for the moment that our research hypothesis was as follows.

4. Among never-drinkers, five-cup-a-day coffee use would elevate the frequency of myocardial infarction.

In examining this hypothesis, the never-drinkers would be the target cohort and the coffee drinkers would be the reference cohort. Thus the comparison validity condition would have to be replaced by a condition such as C’.

C’. If the never-drinkers had drunk five cups of coffee per day, their distribution of myocardial infarctions would have been approximately the same as the distribution among five-cup-a-day drinkers.

Other ways of stating condition C’ are that the five-cup-a-day and never-drinker subcohorts would be comparable or exchangeable with respect to myocardial infarction times if everyone had been five-cup-a-day drinkers, and that the myocardial infarction experience of five-cup-a-day drinkers accurately predicts what would have happened to the never-drinkers if the latter had drunk five cups a day.

Confounding

Failure to meet condition C results in a biased estimate of the effect of five-cup-a-day coffee drinking on five-cup-a-day drinkers, a condition sometimes referred to as confounding of the estimate. Similarly, failure to meet condition C’ results in a biased estimate of the effect that five-cup-a-day drinking would have had on never-drinkers.

To evaluate comparison validity, it is necessary to check whether the subcohorts differed at baseline on any factors that influence myocardial infarction time. If so, it should not be expected that the myocardial infarction distributions of the subcohorts were comparable, even if the subcohorts had the same level of coffee use. In other words, it should not be expected that condition C (or C’) would hold. If condition C failed, our estimates would suffer from confounding. This is so, regardless of whether adjustment appeared to change the association of coffee use and myocardial infarction (Greenland et al. 1999b).

In our example, it is important to note that several studies have found a positive association between cigarette smoking (an established risk factor for myocardial infarction) and coffee use (Greenland 1998a). It also seems plausible that a person habituated to a stimulant such as nicotine would be attracted to coffee use as well. Thus we should expect to see a higher prevalence of smoking among coffee users in our study.

Suppose then that, in our cohort, smoking is more prevalent among five-cup-a-day subjects than never-drinkers. This elevated smoking prevalence should have led to elevated myocardial infarction rates among five-cup-a-day drinkers, even if coffee had no effect. More generally, we should expect the myocardial infarction rate among never-drinkers to underestimate the myocardial infarction rate that five-cup-a-day drinkers would have had if they had never drunk coffee. The result would be an inflated estimate of the impact of coffee on the myocardial infarction rate of five-cup-a-day drinkers. Similarly, we should expect the myocardial infarction rate among five-cup-a-day drinkers to overestimate the myocardial infarction rate that never-drinkers would have had if they had drunk five cups a day.

Adjustment for measured confounders

As in the prediction problem, the data can be stratified on potential confounders with the objective of creating strata within which confounding is minimal or absent. We can also employ standardization to remove confounding from estimates of overall effect. Again, some care in the selection of the standard is required.

To illustrate, let Rxz denote the estimated rate of myocardial infarction among cohort members who drank x cups of coffee per day and smoked z cigarettes per day at baseline, with R0z denoting the estimated rate among never-drinkers. Let Wxz denote the proportion of person-time among x-cup-per-day drinkers that was contributed by z-cigarette-per-day smokers. Finally, let Rxc be the crude (unadjusted) rate observed among cohort members who drank x cups per day at baseline, with R0c denoting the estimated crude rate among never-drinkers.

Suppose that any change in coffee-use patterns would have negligible impact on the person–time distribution of smoking in the cohort. The predicted (that is, expected) rate among five-cup-a-day drinkers had they never drunk coffee, adjusted for confounding by smoking, is the average of the smoking-specific estimates from the never-drinker (reference) subcohort weighted by the smoking distribution of the five-cup-per-day (target) cohort. Algebraically, this average is the following sum (over z):

This sum is commonly termed the rate in the never-drinkers standardized to the distribution of smoking among five-cup-a-day drinkers. Such terminology obscures the fact that the sum is a prediction about the five-cup-a-day drinkers, not the never-drinkers.

Given the last computation, a smoking-standardized estimate of the increase in myocardial infarction rate produced by coffee drinking among five-cup-per-day drinkers is the rate ratio standardized to the five-cup-per-day smoking distribution:

This formula reveals a property common to a simple standardized rate ratio: the same weights Wxz must be used in the numerator and denominator sums. Some insight into this formula can be obtained by noting that the crude rate R5c among the five-cup-a-day drinkers is equal to

so that the standardized rate ratio can be rewritten as

This version shows that the ratio is a classical observed (crude) over expected ratio, or standardized morbidity ratio.

Another standardized rate ratio is

This differs from the previous standardized ratio in that the weights are taken from the never-drinkers (W0z) instead of five-cup-a-day drinkers (W5z). Insight into this formula can be obtained by noting that the numerator sum is simply a prediction (expectation) of what would have happened to the never-drinkers if they had been five-cup-a-day drinkers, while the denominator sum is equal to the crude rate R0c, among never-drinkers. Thus the last standardized ratio is a smoking-standardized estimate of the increase in the myocardial infarction rate that five-cup-a-day drinking would have produced among the never-drinkers.

Standardization is appealingly simple in both justification and computation. Unfortunately, if the number of cases occurring within the confounder categories tends to be small (under five or so), the technique will be subject to various technical problems including possible bias. These problems can be avoided by broadening confounder categories or by not adjusting for some of the measured confounders. Unfortunately, both these strategies are likely to result in incomplete control of confounding. To avoid having to adopt these strategies, many researchers attempt to control confounding by using a multivariate model. This remedy has problems of its own, some of which are addressed in the section on specification validity below.

Another problem is that standardized procedures (as well as typical modelling procedures) take no account of potential exposure effects on the adjustment variables or their distribution. Thus, in the above example, to justify use of the fixed weights Wxz the dubious assumption had to be invoked that changes in coffee use would only negligibly affect the smoking distribution. This issue is briefly discussed in the section on intermediate variables below.

Unmeasured confounders

Among the possible confounders not measured in our hypothetical study are diet and exercise. Suppose that ‘health conscious’ subjects who exercise regularly and eat low-fat diets also avoid coffee. The result will be a concentration of these lower-risk subjects among coffee non-users and a consequent overestimation of coffee’s effect on risk.

Confounding by unmeasured confounders can sometimes be minimized by controlling variables along pathways of the confounders’ effect. For example, if exercise and low-fat diet lowered myocardial infarction risk only by lowering serum cholesterol and blood pressure, control of serum cholesterol and blood pressure would remove confounding by exercise and dietary fat. Unfortunately, such control may also generate bias if the controlled variables are intermediates between our study variable and our outcome variable.

If external information is available to indicate the relationship in our study between an unmeasured confounder and the study variables, an indirect method to adjust for the confounder can be used (Schlesselman 1978; Flanders and Khoury 1990). If external information is unreliable or unavailable, it is still possible to examine the sensitivity of our results to unmeasured confounding (Cornfield et al. 1959; Flanders and Khoury 1990; Rosenbaum 1995; Chapter 19 in Rothman and Greenland 1998).

Randomization and confounding

Suppose, for the moment, that the level of coffee use in our cohort had been randomly assigned and that the participants diligently consumed only their assigned amount of coffee. Could our estimates of coffee effects from such a randomized trial still be confounded? By our earlier definition of confounding, the answer is yes. To see this, note for example that by chance alone the five-cup-a-day drinkers could be older on average than the never-drinkers; this difference would in turn result in an upward bias in the unadjusted estimate of the effect of five cups a day, since age is an important risk factor for myocardial infarction.

Nevertheless, randomization can help to ensure that the distributions of confounders in the different exposure groups are not too far apart. In essence, the probability of severe confounding can be made as small as necessary by increasing the size of the randomized groups. Furthermore, if randomization is used and subjects comply with their assigned treatments, any confounding left after adjustment will be accounted for by the standard errors of the estimates, provided that the correct statistical model is used to compute the effect estimates and their standard errors (Robins 1988; Greenland 1990).

Intermediate variables

In effect estimation, it is important to take care to distinguish intermediate variables from confounding variables. Intermediate variables represent steps in the causal pathway from the study exposure to the outcome event. The distinction is essential, for control of intermediate variables can increase the bias of estimates.

To illustrate, suppose that coffee use affects serum cholesterol levels (as suggested by the results of Curb et al. 1986). Then, given that serum cholesterol affects myocardial infarction risk, serum cholesterol is an intermediate variable for the study of coffee effects on this risk. Now suppose that we stratify our cohort data on serum cholesterol levels. Some coffee drinkers will be in elevated cholesterol categories because of coffee use and so will be at elevated myocardial infarction risk because of coffee effects, yet these subjects will be compared with never-drinkers in the same stratum who are also at elevated risk due to their elevated cholesterol. Therefore the effect of coffee on myocardial infarction risk via the cholesterol pathway will not be apparent within the cholesterol strata, and so cholesterol adjustment will contribute to underestimation of the coffee effect on myocardial infarction risk. Analogously, if coffee affected myocardial infarction risk by elevating blood pressure, blood pressure adjustment will contribute to underestimation of the coffee effect. Such underestimation can be termed overadjustment bias.

Intermediate variables may also be confounders and thus present the investigator with a severe dilemma. Consider that most of the variation in serum cholesterol levels is not due to coffee use and that much (perhaps most) of the association between coffee use and cholesterol is not due to coffee effects, but rather to factors associated with both coffee and cholesterol (such as exercise and dietary fat). This means that serum cholesterol may also be viewed as a confounder for the coffee–myocardial infarction study and that estimates unadjusted for serum cholesterol will be biased unless they are also adjusted for the factors contributing to the coffee–cholesterol association.

Suppose that a variable is both an intermediate and a confounder. It will usually be impossible to determine how much of the change in the effect estimate produced by adjusting for the variable is due to introduction of overadjustment bias and how much is due to removal of confounding. Nevertheless, a qualitative assessment may be possible in some situations. For example, if we know that the effects of coffee on serum cholesterol are weak and that most of the association between coffee and serum cholesterol is due to confounding of this association by uncontrolled factors (such as exercise and diet), we can conclude that the cholesterol-adjusted estimate is the less biased of the two. Alternatively, if we have accurately measured all the factors that confound the coffee–cholesterol association, we can control these factors instead of cholesterol to obtain an estimate free of both overadjustment bias and confounding by cholesterol. Finally, if we have multiple measurements of coffee use and cholesterol over time, techniques are available that adjust for the confounding effects of cholesterol but do not introduce overadjustment (Robins and Greenland 1994).

Direct and indirect effects

Often, we may wish to estimate how much of the effect under study is indirect relative to an intermediate variable (in the sense of being transmitted through the intermediate), or how much of the effect is direct relative to the intermediate (not mediated by the intermediate). For example, we might wish to estimate how much of coffee’s effect on myocardial infarction risk is due to its effect on serum cholesterol, or how much is due to coffee effects through pathways not involving cholesterol.

One common approach to this problem is to adjust the coffee–myocardial infarction association for serum cholesterol level via ordinary stratification or regression methods and then use the resulting estimate as the estimate of the direct coffee effect. This procedure is potentially biased as it may introduce new confounding by determinants of serum cholesterol, even if these determinants did not confound the total (unadjusted) association (Robins and Greenland 1992). However, given sufficient data, it is possible to obtain separate estimates for direct and indirect effects using special stratification or modelling techniques (Robins and Greenland 1994).

Follow-up validity

In our example, follow-up validity means that follow-up is valid within every subcohort being compared. In other words, over any span of time during follow-up, myocardial infarction risk within a subcohort is unassociated with censoring risk in the subcohort. Given follow-up validity, we can expect that, at any follow-up time t, the myocardial infarction rates in a subcohort will be the same for subjects lost or withdrawn at t and subjects whose follow-up continues beyond t.

In fact, we should expect follow-up to be biased by cigarette smoking: smoking is associated with mortality from myocardial infarction and from many other causes; the association of smoking with socioeconomic status might also produce an association between smoking and loss to follow-up. The result would be elevated censoring among high-risk (smoking) subjects. As a consequence, unadjusted estimates of myocardial infarction risks will underestimate those risks in the complete subcohorts (as the latter includes both censored and uncensored subject experience). If the degree of underestimation varies across subcohorts, bias in the relative risk estimates will result.

In fact, the degree of underestimation should vary in this example because of the variation in smoking prevalence across subcohorts. Nevertheless, variation in smoking prevalence is not necessary for smoking-related censoring to produce biased estimates of absolute effect. For example, if smoking-related censoring produced a uniform 15 per cent underestimation of the myocardial infarction rate in each subcohort, all rate differences would also be underestimated by 15 per cent.

Analogous to control of confounding, any bias produced by smoking’s association with myocardial infarction and censoring can be removed by smoking adjustment. As before, if adjustment is by standardization, the standard distribution should be chosen from the target subcohort.

Some authors classify follow-up bias as a form of confounding because the same correction methods can sometimes be applied. Nevertheless, the two phenomena are reversed with respect to the causal ordering of the third variable responsible for the bias: confounding arises from an association of the study exposure (coffee use) with other exposures (such as smoking) that affect outcome risk; in contrast, follow-up bias arises from an association between the risk of the study outcome (myocardial infarction) and risks of other end-points (such as other-cause mortality or loss to follow-up) that are affected by exposure. Furthermore, certain forms of follow-up bias cannot be removed by adjustment. These problems are discussed in the statistics literature under the topic of dependent competing risks; see Kalbfleisch and Prentice (1980) and Slud and Byar (1988) for discussions of this issue.

Some authors (Kelsey et al. 1996) classify follow-up bias as a form of selection bias. Here, we reserve the latter term for a special problem of case–control studies (discussed below).

Specification validity

As noted above, the use of a statistical method based on an incorrect model (specification error) can lead to bias in estimates and improper performance of statistical tests and interval estimates. All statistical techniques, including non-parametric methods, must assume some sort of model for the process generating the data; however, in the absence of randomization or random sampling, it will rarely be possible to identify a ‘correct’ sampling model. In addition, structural assumptions are rarely (if ever) exactly satisfied. Thus some specification error should be expected. As before, minimization of specification error must rely on checking the model against the data and against background information about the processes generating the data.

Recall that the unadjusted rate ratio estimate for five-cup-a-day versus never-drinkers is 1.22 in the present example, with 95 per cent confidence limits of 1.00 and 1.49, and a p value of 0.05. Suppose that these figures were obtained by the person–time methods given in textbooks such as Breslow and Day (1987) or Rothman and Greenland (1998). These methods are based on a binomial sampling model for the number of cases who drank five cups a day at baseline, given the combined (total) number of cases among five-cup-a-day and never-drinkers. In our example, the validity of this model depends on the assumption that the myocardial infarction rate remains constant within subcohorts over the follow-up period. It follows that the model and hence the statistics given earlier cannot be valid in our example; the subcohort members grow older over the follow-up period, and hence the myocardial infarction rates must increase with follow-up time.

The invalidity just noted can be rectified by stratifying either on follow-up time or the variable responsible for the change in rates over follow-up time (here, age). The stratification need only be fine enough to ensure that the myocardial infarction rate change within strata is negligible over follow-up. As noted above, however, smoking and perhaps other factors responsible for confounding or follow-up bias must also be adjusted for. If we stratify finely enough to remove all the bias from these sources, the resulting estimates would be undefined or so unstable that they would tell us nothing about the association of coffee and myocardial infarction.

The standard solution to such problems is to compute adjusted estimates using regression models. These are structural models representing a set of assumptions (usually rather strong ones) about the joint effects of the study variables. Such models allow estimates and tests to be extracted from what would otherwise be hopelessly sparse data, at a cost of a greater risk of bias arising from violations of the assumptions underlying the models (Robins and Greenland 1986). For further details of cohort modelling, see Breslow and Day (1987), Checkoway et al. (1989), Hosmer and Lemeshow (1989), Clayton and Hills (1993), Kelsey et al. (1996), or Rothman and Greenland (1998).

Measurement validity

Unlike sex, the continuous variables of coffee use, cigarette use, blood pressure, cholesterol, and age are time-dependent covariates. With the exception of age (whose value at any time can be computed from birth date), this fact adds considerable complexity to measuring these variables and estimating their effects.

Consider that we cannot reasonably expect a single baseline measurement, no matter how accurate, to summarize adequately a subject’s entire history of coffee drinking, smoking, blood pressure, or cholesterol. Even if the effect of a subject’s history could be largely captured by using a single summary number (for example total number of cigarettes smoked), the baseline measurement may well be a poor proxy for this ideal and unknown summary. For these reasons, we should expect proxy-variable errors to be very large in our example.

Proxy-variable error in the study variables

The degree of proxy-variable error in measuring the study variables depends on the exact definitions of the variables that we wish to study. In turn, this definition should reflect the hypothesized effect that we wish to study. To illustrate, consider the following acute-effect hypothesis.

1. Drinking a cup of coffee produces an immediate rise in short-term myocardial infarction risk. In other words, coffee consumption is an acute risk factor.

This hypothesis does not exclude the possibility that coffee use also elevates long-term risk of myocardial infarction, perhaps through some other mechanism; it simply does not address the issue of chronic effects.

One way to examine the hypothesis would be to compare the myocardial infarction rates among person-days in which one, two, three, and so on cups were drunk with the rate among person-days in which no coffee was drunk (adjusting for confounding and follow-up bias). If we had only baseline data, baseline daily consumption would have to serve as the proxy for consumption on every day of follow-up. This would probably be a poor proxy for daily consumption at later follow-up times where more outcome events occur. A ‘standard’ analysis, which only examines the association of baseline coffee use with myocardial infarction rates, is equivalent to an analysis that uses baseline consumption as a proxy for consumption on all later days. Thus, estimates from a standard analysis would suffer large bias if considered as estimates of acute coffee effect.

The proxy-variable error in this example could easily be differential with respect to the outcome: person-days accumulate more rapidly in early follow-up, where the error from using baseline consumption as the proxy is relatively low; in contrast, myocardial infarction events accumulate more rapidly in later follow-up, where the error is probably higher. This difference in accumulation illustrates an important general point: errors in variables can be differential, even if the variables are measured before the outcome event. Such phenomena occur when errors are associated with risk factors for the outcome; in our example, the error is associated with follow-up time and hence age. In turn, such associations are likely to occur when measurements are based on proxy variables.

Suppose now that we examine the following chronic-effect hypothesis.

2. Each cup of coffee drunk eventually results in a long-term elevation of myocardial infarction risk.

This hypothesis was suggested by reports that coffee drinking produces a rise in serum lipid levels (Curb et al. 1986); it does not address the issue of acute effects. One way to examine the hypothesis would be to compare the myocardial infarction rates among person-months with different cumulative doses of coffee (perhaps using a lag period in calculating dose; for example, one might ignore the most recent month of consumption). If we had only baseline data, however, baseline daily consumption would have to be used to construct a proxy for cumulative consumption at every month of follow-up. This construction could be done in several different ways. For example, we could estimate subjects’ cumulative doses up to a particular date by multiplying their baseline daily consumption by the number of days that they had lived between age 18 and the date in question. This estimate assumes that coffee drinking began at age 18 and the baseline daily consumption is the average daily consumption since that age. We should expect considerable error in such a crude measure of cumulative consumption.

The degree of bias in estimating chronic effects could be quite different from the degree of bias in estimating acute effects. Furthermore, as discussed below, the errors in each proxy will make it virtually impossible to discriminate between acute and chronic effects.

Measurement error and confounding

If a variable is measured with error, estimates adjusted for the variable as measured will still be somewhat confounded by the variable. This residual confounding arises because measurement error prevents construction of strata that are internally homogeneous with respect to the true confounding variable (Greenland 1980).

To illustrate, consider baseline daily cigarette consumption. This variable can be considered a proxy for consumption on each day of follow-up or can be used to construct an estimate of cumulative consumption (analogous to the cumulative coffee variable discussed above). Suppose that we stratify the data on a cumulative smoking index constructed from the baseline smoking measurement. Within any stratum of the index, there would remain a broad range of cumulative cigarette consumption. For example, two subjects who were age 40 and smoked one pack a day at baseline would receive the same value for the smoking index and so end up in the same stratum. However, if one of them stopped smoking immediately after baseline, while the other continued to smoke a pack a day, after 10 years of follow-up the former subject would have 10 less pack-years of cigarette consumption than the continuing smoker.

Suppose now that cumulative cigarette consumption is positively associated with cumulative coffee consumption. Then, even within strata of the smoking index, we should expect subjects with high coffee consumption to exhibit elevated myocardial infarction rates simply by virtue of having higher levels of cigarette consumption. As a consequence, the estimate of coffee effect adjusted for the smoking index would still be confounded by cumulative cigarette consumption.

In some cases a study variable may appear to have an effect (or no effect) only because of poor measurement of an apparently unimportant confounder. This can occur, for example, when an important confounding variable is measured with a large amount of non-differential error. Such an error would ordinarily reduce the apparent association of the variable with the exposure, and would also make the variable appear to be a weak risk factor, perhaps weaker than the study exposure. This in turn would make the variable appear to be only weakly confounding, in that adjustment for the variable as measured would produce little change in the result. However, this appearance would be deceptive because adjustment for the variable as measured would eliminate little of the actual confounding by the variable.

As an example, suppose that coronary proneness of personality was measured only by the baseline yes/no question: Do you consider yourself a hard-driving person? Such a crude measure of the original construct would be unlikely to show more than a weak association with either coffee use or myocardial infarction, and adjusting for it would produce little change in our estimate of coffee effect. Suppose, however, that coronary-prone personalities have an elevated preference for coffee. Such a phenomenon would lead to a concentration of coronary-prone people (and hence a spuriously elevated myocardial infarction rate) among coffee drinkers, even after stratification on response to the above question.

One would ordinarily expect adjustment for a non-differentially misclassified confounder to produce an estimate lying somewhere between the crude (unadjusted) estimate and the estimate adjusted for the true values of the confounder (Greenland 1980). Unfortunately, if the true confounder has more than two levels, it is possible for adjustment by the misclassified confounder to be more biased than the crude estimate (Brenner 1993). It is also possible for adjustment by factors that affect misclassification to worsen bias (Greenland and Robins 1985).

Measurement error and separation of effects

Measurement errors can severely reduce our ability to separate different effects of the study variable because of their impact on the effectiveness of adjustment procedures. Suppose in our example that we wished to estimate the relative strength of acute and chronic coffee effects. To do so we must take account of the fact that acute and chronic effects will be confounded. When examining acute effects, person-days with high coffee consumption will occur most frequently among people with high cumulative coffee consumption. As a consequence, if cumulative coffee consumption is a risk factor, it will be a confounder for estimating the acute effects of coffee consumption. By similar arguments, if coffee consumption has acute effects, these will confound estimates of the chronic effects of cumulative consumption.

Unfortunately, both cumulative and daily consumption are measured with considerable error. As a result, any effect observed for one may be wholly or partially due to the other, even if the other has little or no apparent effect.

Repeated measures

One costly but effective method for reducing the degree of proxy-variable error in measuring time-dependent variables is to take repeated (serial) measurements over the follow-up period and ask subjects to report their prebaseline history of such variables at the baseline interview. In our example, subjects could be asked about their age at first use and level of consumption at different ages for coffee and cigarettes; they could then be recontacted every year or two to assess their current consumption. Of course, not all subjects may be willing to co-operate with such active follow-up, but the penalties of some extra loss may be far outweighed by the benefit of improved measurement accuracy.

Errors in assessing incidence

An important form of measurement error in assessing incidence is misdiagnosis of the outcome event. In the AIDS example, a false-positive diagnosis of AIDS would result in underestimation of incubation time, while a false-negative diagnosis would result in overestimation. In the present example, false-positive errors would result in overestimation of myocardial infarction rates, while false-negative errors would result in underestimation. These errors will be of particular concern when the study depends on existing surveillance systems or records for detection of outcome events.

There are special cases in which the errors will induce little or no bias in estimates (Poole 1985), provided the errors have little effect on the person-time observed. If the only form of misdiagnosis is false-negative error, the proportion of outcome events missed in this fashion is the same across cohorts, and there is no follow-up bias, then the relative risk estimates will not be distorted by the underdiagnosis. Suppose in our example that all recorded myocardial infarction events are true myocardial infarctions, but that in each subcohort 10 per cent of myocardial infarctions are missed. The myocardial infarction rates in each subcohort will then be underestimated by 10 per cent; nevertheless, if we consider any two of these rates, say R0 and R5, the observed rate ratio will be

which is undistorted by the underdiagnosis of myocardial infarction. Nonetheless, if coffee primarily induced ‘silent’ myocardial infarctions and these were the most frequently undiagnosed events, the coffee effect would be underestimated.

In an analogous fashion, if the only form of misdiagnosis is false-positive error, the rate of false positives is the same across cohorts, and there is no follow-up bias, then rate differences will not be distorted by the overdiagnosis. Suppose that the rate of false positives in our example is Rf in all subcohorts; then if we consider any two true rates, say R0 and R5, the observed rate difference will be

which is undistorted by the overdiagnosis of myocardial infarction. However, if there is non-differential underdiagnosis of myocardial infarction, as is probably the case in our example, the rate difference will be underestimated.

Effect-measure modification (heterogeneity of effect)

Estimation of effects usually requires consideration of effect-measure modification, which is also known as effect modification, effect variation, or heterogeneity of effect. As an example, suppose that drinking five cups of coffee a day elevated the myocardial infarction rate of men in our cohort by a factor of 1.40 (that is, a 40 per cent increase), but elevated the myocardial infarction rate of women by a factor of only 1.10 (a 10 per cent increase). This situation would be termed modification (or variation or heterogeneity) of the rate ratio by sex, and sex would be called a modifier of the coffee–myocardial infarction rate ratio.

As another example, suppose that drinking five cups of coffee a day elevated the myocardial infarction rate in men in our cohort by a factor of 400 cases per 100 000 person-years but elevated the rate in women by a factor of only 40 cases per 100 000 person-years. This situation would be termed modification of the rate difference by sex, and sex would be called a modifier of the coffee–myocardial infarction rate difference.

As a final example, suppose that drinking five cups of coffee per day elevated the myocardial infarction rate in our cohort by a factor of 1.22 in both men and women. This situation would be termed homogeneity of the rate ratios across sex.

Effect modification and homogeneity are not absolute properties of an effect but instead are properties of the way that the effect is measured. For example, suppose that drinking five cups of coffee per day elevated the myocardial infarction rate in men from 1000 cases per 100 000 person-years to 1220 cases per 100 000 person-years, but elevated the rate in women from 400 cases per 100 000 person-years to 488 cases per 100 000 person-years. Then the sex-specific rate ratios would both be 1.22, homogeneous across sex. In contrast, the sex-specific rate differences would be 220 cases per 100 000 person-years for males and 88 cases per 100 000 person-years for females, and so are heterogeneous or ‘modified’ by sex. Examples such as this show that one should not equate effect modification with biological concepts of interaction such as synergy or antagonism (Rothman and Greenland 1998, Chapter 18).

Effect modification can be analysed by stratifying the data on the potential effect modifier under study, estimating the effect within each stratum, and comparing the estimates across strata. There are several potential problems with this approach. The number of subjects in each stratum may be too small to produce stable estimates of stratum-specific effects, particularly after adjustment for confounder effects. Estimates may fluctuate wildly from stratum to stratum owing to random error. A related problem is that statistical tests for heterogeneity in stratified data have extremely low power in many situations, and therefore are likely to miss much if not most of the heterogeneity when used with conventional significance levels (such as 0.05). Finally, the amount of bias from confounding, measurement error, and other sources may vary from stratum to stratum, in which case the observed pattern of modification will be biased (Greenland 1980).

Effect-measure modification and generalizability

Suppose that we succeed in obtaining approximately unbiased estimates from our study. We can then confront issues of generalizability (external validity) of our results. For example, we can ask whether they accurately reflect the effect of coffee on myocardial infarction rates in a new target cohort. We can view such a question as a prediction problem in which the objective is to predict the strength of coffee effects in the new target cohort. From this perspective, generalizability of an effect estimate involves just one validity issue in addition to those discussed so far, namely confounding of the predicted effect by effect modifiers.

Suppose that the rate increase (in cases per 100 000 person-years) produced by coffee use is 400 for males and 40 for females among five-cup-a-day drinkers in both our study cohort and the new target. If our study cohort is 70 per cent male while the new target is only 30 per cent male, the average increase among five-cup-a-day drinkers in our study cohort would be 0.7 × 400 + 0.3 × 40 = 292, whereas the average increase in the new target would be only 0.3 × 400 + 0.7 × 40 = 148. Thus, any valid estimate of the average increase in our study cohort will tend to overestimate greatly the average increase in the new target. In other words, modification of coffee’s effect by sex confounds the prediction of its effect in the new target. This bias can be avoided by making only sex-specific predictions of effect or by standardizing the study results to the sex distribution of the new target population.

Summary of example

The example used in this section provides an illustration of the most common threats to the validity of effect estimates from cohort studies. The unadjusted estimates of coffee effect on myocardial infarction will be confounded by many variables (such as smoking), and there will be follow-up bias. As a result, the number of variables that must be controlled is too large to allow adequate control using only stratification. The true functional dependence of myocardial infarction rates on coffee and the confounder is unknown, so that estimates based on multivariate models are likely to be biased. Even if this bias is unimportant, our estimates will remain confounded because of our inability to measure the key confounders accurately. Finally, our inability to summarize coffee consumption accurately would further bias our estimates, making it impossible to separate acute and chronic effects of coffee use reliably.

Given that there are several sources of bias of unknown magnitude and different directions, it would appear that no conclusions about coffee effect could be drawn from a study like the one described above, other than that coffee does not appear to have a large effect. This type of result—inconclusive, other than to rule out very large effects—is common in thorough epidemiological analyses of observational data. In particular, inconclusive results are common when the data being analysed were collected for purposes other than to address the hypothesis at issue, for such data often lack accurate measurements of key variables.

Validity in case–control and retrospective cohort studies

Case–control studies

The practical difficulties of cohort studies have led to extensive development of case–control study designs. The distinguishing feature of such designs is that sampling is intentionally based on the outcome of individuals.

In a population-based or population-initiated case–control study, one first identifies a population at risk of the outcome of interest, which is to be studied over a specified period of time or risk period. As in a cohort study, one attempts to ascertain outcome events in the population at risk. Nevertheless, unlike a cohort study, one selects people experiencing the outcome event (cases) and a ‘control’ sample of the entire population at risk for ascertainment of exposure and covariate status.

In a case-initiated case–control study, one starts by identifying a source of study cases (for example a hospital emergency room is a source of myocardial infarction cases). One then attempts to identify a population at risk such that the source of cases provides a random or complete sample of all cases occurring in this population. Study cases recruited from the source occur over a risk period; controls are selected in order to ascertain the distribution of exposure in the population at risk over that period.

Case–control studies may also begin with an existing series of controls (Greenland 1985). Regardless of how a case–control study is initiated, evaluation of validity must ultimately refer to a population at risk that represents the target of inference for the study.

Relative risk estimation in case–control studies

The control sample may or may not be selected in a manner that excludes cases. If people who become cases over the risk period are ineligible for inclusion in the control group (as in traditional case–control designs), a ‘rare-disease’ assumption may be needed to estimate relative risks from the case–control data. In contrast, if people who become cases over the risk period are also eligible for inclusion in the control group (as in newer case–control designs), the rare-disease assumption can be discarded. These points are discussed in more detail in the textbooks cited above.

The basics of case–control estimation will be illustrated with the following example. We wish to study the effect of coffee drinking on rates of first myocardial infarction and we have selected a population for study (for example, all residents aged 40 to 64 in a particular town) over a 1-year risk period. At any point during the risk period, the population at risk comprises people in this selected population who have not yet had a myocardial infarction.

Suppose that the average number of never-drinkers in the population at risk was 20 000 over the risk period, the average number of five-cup-a-day drinkers was 10 000, there were 120 first myocardial infarctions among never-drinkers, and there were 90 first myocardial infarctions among five-cup-a-day drinkers. Then, if one observed the entire population without error, the estimated rates among never-drinkers and five-cup-a-day drinkers would be

and

Thus, if we observed the entire population, the estimated rate ratio would be

This estimate depends on only two figures: the relative prevalence of five-cup-a-day versus never-drinkers among cases (90/120), and the relative prevalence in the person-years at risk (10 000/ 20 000). These two relative prevalences are often called the case exposure odds and the population exposure odds.

The first relative prevalence (numerator) could be estimated by interviewing an unbiased sample of all the new myocardial infarction cases that occur over the risk period, and the second relative prevalence (denominator) could be estimated by interviewing an unbiased sample of the population at risk over the risk period. The ratio of relative prevalences from the case- and control-sample interviews would then be an unbiased estimate of the population rate ratio of 1.50. This estimate is called the sample odds ratio.

Three points about the preceding argument should be carefully noted. Firstly, no rare-disease assumption was made. Secondly, the control sample of the population at risk was accumulated over the entire risk period (rather than at the end of the risk period); such sampling is called density sampling (Chapter 7 in Rothman and Greenland 1998) or risk-set sampling (Breslow and Day 1987). Thirdly, because of the density sampling, someone may be selected for the control sample, and yet have a myocardial infarction later in the risk period and become part of the case sample as well. Methods for carrying out density sampling can be found in the textbooks cited above.

Validity conditions in case–control studies

The primary advantages of case–control studies are their short time frame and the large reduction in the number of subjects needed to achieve the same statistical power as a cohort study. The primary disadvantage is that more conditions must be met to ensure their validity (in addition to the four listed in the cohort study example).

Suppose that our case–control study data yield an unadjusted rate-ratio estimate (odds ratio) of 1.50, with 95 per cent confidence limits of 1.00 and 2.25. The following series of conditions would be sufficient for the validity of this figure as an estimate of the effect of drinking five cups of coffee a day (versus none) on the myocardial infarction rate.

Comparison validity (C). If five-cup-a-day drinkers in the population at risk had instead drunk no coffee, their distribution of myocardial infarction events over time would have been approximately the same as the distribution among never-drinkers.

Follow-up validity (F). Within each subpopulation defined by coffee use, censoring risk (that is, population membership ended by an event other than myocardial infarction, such as emigration or death from another cause) is not associated with myocardial infarction risk.

Specification validity (Sp). The distribution of myocardial infarction events over time in the subpopulations can be closely approximated by the statistical model on which the estimates are based.

Measurement validity (M). All measurements of variables used in the analysis closely approximate the true values of the variables.

Selection validity (Se). This has two components:

a.

case-selection validity: if one studies only a subset of the myocardial infarction cases occurring in the population over the risk period (for example, because of failure to detect all cases), this subset provides unbiased estimates of the prevalence of different levels of coffee use among all cases occurring in the population over the risk period

b.

control-selection validity: the control sample provides unbiased estimates of the prevalences of different levels of coffee use in the population at risk over the risk period.

Issues of comparison validity, follow-up validity, specification validity, effect modification, and generalizability in case–control studies parallel those in follow-up studies, and so will not be discussed here. Case–control studies are vulnerable to certain problems of measurement error that are less severe or do not exist in prospective cohort studies. These problems are discussed first, and then selection validity and modelling are examined. Finally, analogous issues in retrospective cohort studies are briefly discussed.

Retrospective ascertainment

A special class of measurement errors arises from retrospective ascertainment of time-dependent variables when attempting to measure past values of the variables. Retrospective ascertainment must be based on individual memories, existing records of past values, or some combination of the two. Therefore such ascertainment usually suffers from faulty recall, missing or mistaken records, or lack of direct measurements in existing records.

Retrospective ascertainment may be an important component of a cohort study. For example, the cohort study of coffee and myocardial infarction discussed above could have been improved by asking subjects about their coffee use and smoking prior to the start of follow-up. This information would allow one to construct better cumulative indices than could be constructed from baseline consumption alone, although the resulting indices would still incorporate error due to faulty recall.

Unless records of past measurements are available for all subjects, measurements on cases and controls must be made after the time period under study since subjects are not selected for study until after that period. Thus, unlike cohort studies, most case–control studies of time-dependent variables depend on retrospective ascertainment. Considering our example, there may be much more error in determining daily coffee consumption 10 years before interview than 1 month before interview; it might then be expected that case–control studies are more accurate for studying acute effects than for studying chronic effects. Nonetheless, if acute and chronic effects are heavily confounded, the elevated inaccuracies of long-term recall will make it impossible to disentangle short-term from long-term effects. As illustrated above, this confounding can arise in a cohort study. In a cohort study such confounding can be minimized by taking repeated measurements. In contrast, such confounding would be unavoidable in a case–control study based on recall, even if detailed longitudinal histories were requested from the subjects.

The preceding observations should be tempered by noting that some case–control studies have access to exposure measurements of the same quality as found in cohort studies, and that the exposure measurements in some cohort studies may be no better than those used in some case–control studies. For example, a cohort study in which measurements are derived by abstracting routine medical records would suffer from no less measurement error than a case–control study in which measurements are derived by abstracting the same records.

Outcome-affected measurements

One common potential problem in case–control studies is outcome-affected recall, often termed recall bias. These terms refer to the differential measurement error that originates when the outcome event affects recall of past events. Examples arise in case–control studies of birth defects, for instance. If the trauma of having an affected child either enhances recall of prenatal exposures among case mothers or increases the frequency of false-positive reports among case mothers, estimates of relative risk will be upwardly biased by effects of the outcome on case recall (although this bias may be counterbalanced by other biases, such as recall bias among controls (Drews and Greenland 1990)).

One method commonly proposed for preventing bias due to outcome-affected recall is to restrict controls to a group believed to have recall similar to the cases. Unfortunately, one usually cannot tell to what degree this restricted selection corrects the bias from outcome-affected recall. Even more unfortunately, one usually cannot tell if the selection bias produced by such restriction is worse than the recall bias one is attempting to correct (Swan et al. 1992; Drews et al. 1993).

A problem similar to outcome-affected recall can occur when the outcome event affects a psychological or physiological measurement. This is of particular concern in case–control studies of nutrient levels and chronic disease. For example, if colon cancer leads to a drop in serum retinol levels, the relative risk for the effect of serum retinol will be underestimated if serum retinol is measured after the cancer develops. Errors of this type can be viewed as proxy-variable errors in which the post-outcome value is a poor proxy for the pre-outcome value of interest.

Selection validity

Selection validity is straightforward to understand but can be extraordinarily difficult to verify. A violation of the selection validity conditions is known as selection bias. Many case–control designs and field methods are devoted to avoiding such bias (Schlesselman 1982; Kelsey et al. 1996; Chapter 7 in Rothman and Greenland 1998).

In some instances it may be possible to identify a factor or factors that affect the chance of selection into the study. If in such instances we have accurate measurements of one of these factors, we can stratify on (or otherwise adjust for) the factor and thereby remove the selection bias due to the factor. Because of this possibility, some authors classify selection bias as a form of confounding. Nevertheless, there are some forms of selection bias that cannot be removed by adjustment. These points will be illustrated in the following subsections.

Case-selection validity

Unbiased selection of a case series can be best assured if one can identify every case that occurs in the population at risk over the risk period. This requires a surveillance system for the outcome of interest, such as a population-based disease registry. In our coffee–myocardial infarction example, we would probably have to construct a myocardial infarction surveillance system from existing resources, such as emergency room admission records, ambulance service records, and paramedic records.

Even if all cases of interest can be identified, selection bias may arise from failure to obtain information on all of the cases. In our example, many cases would be dead before interview was possible. For such cases, there are only two alternatives: attempt to obtain information from some other source, such as next of kin or coworkers, or exclude such cases from the study. The first alternative increases measurement error in the study. The second alternative will introduce bias if coffee affects risk of fatal and non-fatal myocardial infarction differently, or if coffee affects risk of myocardial infarction survivorship. To illustrate, suppose that coffee drinking reduced one’s chance of reaching the hospital alive when a myocardial infarction occurred. Then the prevalence of coffee use among myocardial infarction survivors would under-represent the prevalence among all myocardial infarction cases. Underestimation of the rate ratio would result if fatal myocardial infarction cases were excluded from the study.

It might seem possible to remove the case-selection bias in this example by redefining the study outcome as non-fatal myocardial infarction. This does not remove the bias, however; it only leads to its reclassification as a bias due to differential censoring (here classified as a form of follow-up bias). In a study of non-fatal myocardial infarction, fatal myocardial infarction is a censoring event associated with risk of non-fatal myocardial infarction; if fatal myocardial infarction is also associated with coffee use, the result will be underestimation of the rate ratio for non-fatal myocardial infarction. More generally, it is usually not possible to remove bias by placing restrictions on admissible outcomes.

Unfortunately, exclusion is the only alternative for cases that refuse to participate or cannot be located. In our example, if such cases tend to be heavier coffee users than others, underestimation of the rate ratio would result. However, suppose that, within levels of cigarette use, such cases were no different from other cases with respect to coffee use. Then adjustment for smoking would remove the selection bias induced by refusals and failures to locate cases. (Of course, such adjustment would require accurate smoking measurement, which is a problem in itself.)

Bias that arises from failure to detect certain cases is sometimes called detection bias. If our surveillance system used only hospital admissions, many out-of-hospital myocardial infarction deaths would be excluded, and a detection bias of the sort described above could result.

Control-selection validity

Unbiased selection of a control group can best be assured if one can potentially identify every member of the population at risk at every time during the risk period. In such a situation one could select controls with one of many available probability sampling techniques, using the entire population at risk as the sampling frame. Unfortunately, such situations are exceptional.

Many studies attempt to approximate the ideal sampling situation through use of existing population lists. An example is control selection by random digit dialling; here, the list (of residential telephone numbers) is not used directly but nevertheless serves as a partial enumeration of the population at risk. This list excludes people without telephone numbers. In our example, if people without telephones drink less coffee than people with telephones, a control group selected by random digit dialling would over-represent coffee use in the population at risk. The result would be underestimation of the rate ratio.

One could redefine the population at risk in the previous example so that the telephone-related selection bias did not exist by restricting the study to people with telephones. This would require excluding people without telephones from the case series. The resulting relative risk estimate would suffer no selection bias. The only important penalty from this restriction is that the resulting estimate might apply only to the population of people with telephones, which is a problem of generalizability rather than a problem of selection validity. In a similar fashion, it is often possible to prevent confounding or selection bias by placing restrictions on the population at risk (and hence the control group). In such instances, however, one must take care to apply the same restrictions to the case series and avoid using restrictions based on events that occur after exposure (Chapter 7 in Rothman and Greenland 1998; Poole 1999).

Even if all members of the population at risk can be identified, selection bias may arise from failure to obtain information on all people selected as controls. The implications are the reverse of those for case-selection bias. In our example, if controls who refuse to participate or cannot be located tend to be heavier coffee users than other controls, overestimation of the rate ratio would result. This should be contrasted with the underestimation that results from the same tendency among cases.

More generally, one might expect an association of selection probabilities with the study variable to be in the same direction for both cases and controls. If so, the resulting case-selection and control-selection biases would be in opposite directions and so, to some extent, they would cancel one another out, although not completely. To illustrate, suppose that among cases the proportions who refuse to participate are 0.05 for five-cup-a-day drinkers and 0.02 for never-drinkers, and among controls the analogous proportions are 0.20 and 0.10. These refusals will result in the odds of five-cup-a-day versus never-drinkers among cases being underestimated by a factor of 0.95/0.98 = 0.97; this in turn results in a 3 per cent underestimation of the rate ratio. Among controls, the odds will be underestimated by a factor of 0.80/0.90 = 0.89; this results in a 1/0.89 = 1.12, or a 12 per cent overestimation of the rate ratio. The net selection bias in the rate-ratio estimate will then be 0.97/0.89 = 1.09, or 9 per cent overestimation.

For further discussions of control-selection validity, see the textbooks cited above, and also Schlesselman (1982), Savitz and Pearce (1988), Swan et al. (1992), and Wacholder et al. (1992).

Matching

In cohort studies, matching refers to selection of exposure subcohorts in a manner that forces the matched factors to have similar distributions across the subcohorts. If the matched factors are accurately measured and the proportion lost to follow-up does not depend on the matched factors, cohort matching can prevent confounding by the matched factors, although there are statistical reasons to control the matched factors in the analysis (Weinberg 1985).

In case–control studies, matching refers to selection of subjects in a manner that forces the distribution of certain factors to be similar in cases and controls. Because the population at risk is not changed by case–control matching, such matching does not prevent confounding by the matched factors. In fact, it is now widely recognized that case–control matching is a form of selection bias that can be removed by adjusting for the matching factor; to the extent the factor has been closely matched and accurately measured, this adjustment also controls for confounding by the factor (Rothman and Greenland 1998, Chapter 10).

As an example, suppose that our population at risk is half male, that the men tend to drink less coffee than the women, and that about 75 per cent of our cases are men. Unbiased control selection should yield about 50 per cent men in the control group. However, if we matched controls to cases on sex, about 75 per cent of our controls would be men. Since men drink less coffee than women and men would be over-represented in the matched control group, the matched control group would under-represent coffee use in the population at risk. Note, however, that matching does not affect the sex-specific prevalence of coffee use among controls, and so the sex-specific and sex-adjusted estimates would be unaffected by matching. In other words, the selection bias produced by matching could be removed by adjustment for the matching factor.

The conclusion to be drawn is that matching can necessitate control of the matching factors. Thus, in order to avoid increasing the number of factors requiring control unnecessarily, one should limit matching to factors for which control would probably be necessary anyway. In particular, matching is usually best limited to known strong confounders, such as age and sex in the above example (Schlesselman 1982; Rothman and Greenland 1998, Chapter 10).

More generally, the primary theoretical value of matching is that it can sometimes reduce the variance of adjusted estimators. However, there are circumstances in which matching can facilitate control selection and so is justified on practical grounds. For example, neighbourhood controls may be far easier to obtain than unmatched general population controls. In addition, although neighbourhood matching would necessitate use of a matched analysis method, the neighbourhood-matched results would incorporate some control of confounding by factors associated with neighbourhood (such as socio-economic status and air pollution).

Special control groups

It is not unusual for investigators to select a special control group that is clearly not representative of the population at risk if they can argue that (a) the group will adequately reflect the distribution of the study factor in the population at risk; or (b) that the selection bias in the control group is of the same magnitude of (and so will cancel with) the selection bias in the case group. The first rationale is common in case–control studies of mortality in which people dying of other selected causes of death are used as controls; in such studies, selection validity can be assured only if the control causes of death are unrelated to the study factor. The second rationale is common in studies using hospital cases and controls; in particular, selection validity can be assured in such studies if the control conditions are unrelated to the study factor, and the study disease and the control conditions have proportional exposure-specific rates of hospital admission (Schlesselman 1982).

Selection into a special control group usually requires membership in a small and highly select subset of the population at risk. Thus use of a special control group requires careful scrutiny for mechanisms by which the study factor may influence entry into the subset. See Schlesselman (1982) and Kelsey et al. (1996) for discussions of practical issues in evaluating special control groups, and Rothman and Greenland (1998) for validity principles in mortality case–control studies (so-called proportionate mortality studies).

Case–control modelling

The most popular model for case–control analysis is the logistic model. Details of logistic modeling for case–control analysis are covered in many textbooks including Breslow and Day (1980), Schlesselman (1982), Hosmer and Lemeshow (1989), Clayton and Hills (1993), Kelsey et al. (1996), and Rothman and Greenland (1998, Chapter 21).

One important aspect of case–control modelling is that matched factors require special treatment. For example, suppose that matching is done on age in 5-year categories and age is associated with the study exposure. To control for the selection bias produced by matching, one must either employ conditional logistic regression with age as a stratifying factor, or else enter indicator variables for each age-matching category into an ordinary logistic regression (the latter strategy has the drawback of requiring about 10 or more subjects per age stratum to produce valid estimates). Simply entering age into the model as a continuous variable may not adequately control for the matching-induced bias (Rothman and Greenland 1998, Chapter 21).

Summary of example

The example in this section provides an illustration of the most common threats to validity in case–control studies (beyond those already discussed for cohort studies). After adjustments for possible confounding and follow-up bias (along the lines described for the cohort study), there may still be irremediable selection bias, especially if we use only select case groups (for example myocardial infarction survivors) or control groups (for example hospital controls). In addition, retrospective ascertainment will lead to greater measurement error than prospective ascertainment, and some of this additional error may be differential.

Given the even greater number of potential biases of unknown magnitude and different directions, it would appear that (as in the cohort example) no conclusions about coffee effect could be drawn from a study like the one described above, other than that coffee does not have a large effect. Again, this is a common result in thorough epidemiological analyses of observational data.

Retrospective cohort studies

Two major types of cohort studies can be distinguished depending on whether members of the study cohort are identified before or after the follow-up period under study. Studies in which all members are identified before their follow-up period are called concurrent or prospective cohort studies, while studies in which all members are identified after their follow-up period are called historical or retrospective cohort studies. Like case–control studies, retrospective cohort studies often require special consideration of retrospective ascertainment and selection validity.

In particular, retrospective cohort studies that obtain exposure or covariate histories from post-event reconstructions are vulnerable to bias from outcome-affected measurements. Suppose, for example, that a study of cancer incidence at an industrial facility had to rely on company personnel to determine the location and nature of various exposures in the plant during the relevant exposure periods. If these personnel were aware of the locations at which cases worked (as when a publicized ‘cluster’ of cases has occurred), biased exposure assessment could result. Such problems can also occur in a prospective cohort study if exposure or covariate histories are based on post-event reconstructions.

Retrospective cohort studies can also suffer from selection biases analogous to those found in case–control studies. Suppose, for example, that a retrospective cohort study relied on company records to identify members of the cohort of plant employees. If retention of an employee’s records (and hence identification of the employee as a cohort member) were associated with both the exposure and outcome status of the employee, the exposure–outcome association observed in the incomplete study cohort could poorly represent the exposure–outcome association in the complete cohort of plant employees.

Conclusion

Uncertainty about validity conditions is responsible for most of the inconclusiveness inherent in epidemiological studies. This inconclusiveness can be partially overcome when multiple complementary studies are conducted, that is, when new studies are conducted under conditions that effectively limit bias from one or more of the sources present in earlier studies. Ideally, after enough complementary studies have been conducted, each known or suspected source of bias will have been rendered unimportant in at least one study. If at this point all the study results appear consistent with one another (which is not the case for coffee and myocardial infarction, although the studies of smoking and lung cancer provide a good example), the epidemiological community may reach some consensus about the existence and strength of an effect.

Even in such ideal situations, however, one should bear in mind that consistency is not validity. For example, there may be some unsuspected source of bias present in all the studies, so that they are all consistently biased in the same direction. Alternatively, all the known sources of bias may be in the same direction, so that all the studies remain biased in the same direction if no one study eliminates all known sources of bias. For these and other reasons, many authors warn that all causal inferences should be considered tentative, at least if drawn from observational epidemiological data alone (Rothman 1988; Rothman and Greenland 1998, Chapter 2).

Chapter References

Barnett, V. (1999). Comparative statistical inference (3rd edn). Wiley, New York.

Berger, J.O. and Berry, D.A. (1988). Statistical analysis and the illusion of objectivity. American Scientist, 76, 159–65.

Brenner, H. (1993). Bias due to nondifferential misclassification of a polytomous confounder. Journal of Clinical Epidemiology, 46, 57–63.

Breslow, N.E. and Day, N.E. (1980). Statistical methods in cancer research. I: The analysis of case control studies. IARC, Lyon.

Breslow, N.E. and Day, N.E. (1987). Statistical methods in cancer research. II: The analysis of cohort data. IARC, Lyon.

Checkoway, H., Pearce, N., and Crawford-Brown, D. (1989). Research methods in occupational epidemiology. Oxford University Press, New York.

Clayton, D. and Hills, M. (1993). Statistical models in epidemiology. Oxford University Press, New York.

Cornfield, J., Haenszel, W.H., Hammond, E.C., et al. (1959). Smoking and lung cancer: recent evidence and a discussion of some questions. Journal of the National Cancer Institute, 22, 173–203 (Appendix A).

Curb, J.D., Reed, D.M., Kautz, J.A., et al. (1986). Coffee, caffeine, and serum cholesterol in Japanese men in Hawaii. American Journal of Epidemiology, 123, 648–55.

Dosemeci, M., Wacholder, S., and Lubin, J.H. (1990). Does nondifferential misclassification of exposure always bias a true effect towards the null value? American Journal of Epidemiology, 132, 746–8.

Dosemeci, M., Wacholder, S., and Lubin, J.H. (1991). The authors clarify and reply. American Journal of Epidemiology, 134, 441–2.

Drews, C.D. and Greenland, S. (1990). The impact of differential recall on the results of case–control studies. International Journal of Epidemiology, 19, 1107–12.

Drews, C., Greenland, S., and Flanders, W.D. (1993). The use of restricted controls to prevent recall bias in case–control studies of reproductive outcomes. Annals of Epidemiology, 3, 86–92.

Flanders, W.D. and Khoury, M.J. (1990). Indirect assessment of confounding: graphic description and limits on effects of adjusting for covariates. Epidemiology, 1, 239–46.

Flegal, K.M., Keyl, P.M., and Nieto, E.J. (1991). Differential misclassification arising from nondifferential errors in exposure measurement. American Journal of Epidemiology, 134, 1233–44.

Goodman, S.N. (1992). A comment on replication, P-values and evidence. Statistics in Medicine, 11, 875–9.

Goodman, S.N. (1993). P-values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. American Journal of Epidemiology, 137, 485–96.

Goodman, S.N. (1999). Toward evidence-based medical statistics. I: The P value fallacy. Annals of Internal Medicine, 130, 995–1021.

Goodman, S.N. and Royall, R.M. (1988). Evidence and scientific research. American Journal of Public Health, 78, 1568–74.

Greenland, S. (1980). The effect of misclassification in the presence of covariates. American Journal of Epidemiology, 112, 564–9.

Greenland, S. (1985). Control initiated case–control studies. International Journal of Epidemiology, 14, 130–4.

Greenland, S. (1987). Interpretation and choice of effect measures in epidemiologic analyses. American Journal of Epidemiology, 125, 761–8.

Greenland, S. (1990). Randomization, statistics, and causal inference. Epidemiology, 1, 421–9.

Greenland, S. (1993a). A meta-analysis of coffee, myocardial infarction, and coronary death. Epidemiology, 4, 366–74.

Greenland, S. (1993b). Summarization, smoothing, and inference. Scandinavian Journal of Social Medicine, 21, 421–9.

Greenland, S. (1994). A critical look at some popular meta-analytic methods. American Journal of Epidemiology, 140, 290–6.

Greenland, S. (1998a). Meta-Analysis. In Modern epidemiology (ed. K.J. Rothman and S. Greenland), (2nd edn). Lippincott-Raven, Philadelphia, PA.

Greenland, S. (1998b). Probability logic and probabilistic induction. Epidemiology, 9, 322–32.

Greenland, S. and Maldonado, G. (1994). The interpretation of multiplicative model parameters as standardized parameters. Statistics in Medicine, 13, 989–99.

Greenland, S. and Robins, J.M. (1985). Confounding and misclassification. American Journal of Epidemiology, 122, 495–506.

Greenland, S. and Robins, J.M. (1988). Conceptual problems in the definition and interpretation of attributable fractions. American Journal of Epidemiology, 128, 1185–97.

Greenland, S. and Robins, J.M. (1994). Ecologic studies: biases, misconceptions, and counterexamples. American Journal of Epidemiology, 139, 747–60.

Greenland, S., Schlesselman, J.J., and Criqui, M.H. (1986). The fallacy of employing standardized regression coefficients and correlations as measures of effect. American Journal of Epidemiology, 123, 203–8.

Greenland, S., Maclure, M., Schlesselman, J.J., et al. (1991). Standardized coefficients: a further critique and a review of alternatives. Epidemiology, 2, 387–92.

Greenland, S., Pearl, J., and Robins, J.M. (1999a). Causal diagrams for epidemiologic research. Epidemiology, 10, 37–48.

Greenland, S., Robins, J.M., and Pearl, J. 1999b). Confounding and collapsibility in causal inference. Statistical Science, 14, 29–46.

Hastie, T. and Tibshirani, R. (1990). Generalized additive models. Chapman & Hall, New York.

Hosmer, D.W. and Lemeshow S. (1989). Applied logistic regression. Wiley, New York.

Kalbfleisch, J.D. and Prentice, R.L. (1980). The statistical analysis of failure-time data. Wiley, New York.

Kelsey, J.L, Whittemore, A.S., Evans, A.S., and Thompson, W.D. (1996). Methods in observational epidemiology (2nd edn). Oxford University Press, New York.

Leamer, E.E. (1978). Specification searches. Wiley, New York.

McCullagh, P. and Nelder, J.A. (1989). Generalized linear models, (2nd edn). Chapman & Hall, New York.

Maldonado, G. and Greenland, S. (1994). A comparison of the performance of model-based confidence intervals when the correct model form is unknown. Epidemiology, 5, 171–82.

Morgenstern, H. (1998). Ecologic studies. In Modern epidemiology (ed. K.J. Rothman, and S. Greenland), (2nd edn). Lippincott-Raven, Philadelphia, PA.

Oakes, M. (1990). Statistical inference. Epidemiology Resources, Chestnut Hill, MA.

Poole, C. (1985). Exceptions to the rule about nondifferential misclassification (abstract). American Journal of Epidemiology, 122, 508.

Poole, C. (1987a). Beyond the confidence interval. American Journal of Public Health, 77, 197–9.

Poole, C. (1987b). Confidence intervals exclude nothing. American Journal of Public Health, 77, 492–3.

Poole, C. (1999). Controls who experienced hypothetical causal intermediates should not be excluded from case–control studies. American Journal of Epidemiology, 150, 547–51.

Robins, J.M. (1988). Confidence intervals for causal parameters. Statistics in Medicine, 7, 773–85.

Robins, J.M. and Greenland, S. (1986). The role of model selection in causal inference from nonexperimental data. American Journal of Epidemiology, 123, 392–402.

Robins, J.M. and Greenland, S. (1992). Identifiability and exchangeability for direct and indirect effects. Epidemiology, 3, 143–55.

Robins, J.M. and Greenland, S. (1994). Adjusting for differential rates of prophylaxis therapy for PCP in high-versus low-dose AZT treatment arms in an AIDS randomized trial. Journal of the American Statistical Association, 90, 737–49.

Rosenbaum, P.R. (1995). Observational studies. Springer-Verlag, New York.

Rothman, K.J. (1988). Causal inference. Epidemiology Resources, Chestnut Hill, MA.

Rothman, K.J. and Greenland, S. (1998). Modern epidemiology (2nd edn). J.B. Lippincott, Philadelphia, PA.

Rubin, D.R. (1991). Practical implications of modes of statistical inference for causal effects, and the critical role of the assignment mechanism. Biometrics, 47, 1213–34.

Savitz, D.A. and Pearce, N. (1988). Control selection with incomplete case ascertainment. American Journal of Epidemiology, 127, 1109–17.

Schlesselman, J.J. (1978). Assessing the effects of confounding variables. American Journal of Epidemiology, 108, 3–8.

Schlesselman, J.J. (1982). Case-control studies: design, conduct, analysis. Oxford University Press, New York.

Slud, E. and Byar, D. (1988). How dependent causes of death can make risk factors appear protective. Biometrics, 44, 265–70.

Swan, S.H., Shaw, G.R., and Schulman, J. (1992). Reporting and selection bias in case–control studies of congenital malformations. Epidemiology, 3, 356–63.

Wacholder, S., Dosemeci, M., and Lubin, J.H. (1991). Blind assignment of exposure does not always prevent differential misclassification. American Journal of Epidemiology, 134, 433–7.

Wacholder, S., M.L., McLaughlin, J.K., Silverman, D.T., and Mandel, J.S. (1992). Selection of controls in case–control studies. American Journal of Epidemiology, 135, 1019–50.

Walker, A.M. (1991). Observation and inference: an introduction to the methods of epidemiology. Epidemiology Resources, Chestnut Hill, MA.

Weinberg, C.R. (1985). On pooling across strata when frequency matching has been followed in a cohort study. Biometrics, 41, 103–16.

Weinberg, C.R., Umbach, D., and Greenland, S. (1994). When will non-differential misclassification preserve the direction of the trend? American Journal of Epidemiology, 140, 565–71.

White, H. (1993). Estimation, inference, and specification analysis. Cambridge University Press, New York.

Dear Service

How can i find a table of all contents of the book, linking with each single chapter?

Thank you for time and attention

Gianfranco Damiani