Technical Comments

Genetic Analysis of Complex Diseases

See allHide authors and affiliations

Science  28 Feb 1997:
Vol. 275, Issue 5304, pp. 1327-1330
DOI: 10.1126/science.275.5304.1327

Neil Risch, in a series of seminal papers in 1990 (1, 2), demonstrated the utility of sib-pair linkage analysis in identifying genes for complex genetic traits. In doing so, he defined what is the current paradigm for the genetic dissection of common complex genetic diseases. The recent Perspective by Risch and Kathleen Merikangas (3) again shapes the future of human disease gene mapping by defining what will undoubtedly become the statistical “state of the art.” Risch and Merikangas advocate conducting genomic screens based on association studies of candidate genes using the transmission disequilibrium test (TDT). It is important, however, not to infer from their arguments that current linkage analysis methods cannot detect most genes underlying complex disease.

Risch and Merikangas extend the current paradigm to include anticipated technological advances. However, the magnitude of γ (defined as the relative risk in the heterozygote) in complex traits and the application of this approach using current molecular technology must be considered.

In their formulation, Risch and Merikangas show that a TDT approach is more powerful than a sib pair approach, particularly for disease alleles with small genetic effects. This conclusion is based on the sample size required to detect a gene with a γ ≤ 4. They show that, while sib pair analysis requires a practical (for example, 100 to 400) number of sib pairs to detect a gene with γ = 4 and a disease allele frequency p, between 0.1 and 0.5, the number of sib pairs required becomes impractical (for example, more than 1000) when γ ≤ 2.

Previously, Risch (1, 2) established the use of a sibling recurrence risk ratio (λs) to estimate the power of a sib pair design to detect linkage. Estimable from epidemiologic data, λs is calculated as the ratio of the recurrence risk in siblings of an affected individual and the population prevalence of the disorder. This λs represents the overall recurrence risk ratio, which may result from the actions of a single gene or multiple genes acting additively or epistatically. If a number of genes are hypothesized, the gene-specific λs (referred to here as λgs) may be estimated by assuming a model (additive or epistatic), the number of genes, and partitioning λs accordingly.

Many researchers are accustomed to evaluating the magnitude of genetic effects by using λs rather than γ. We calculated λgs corresponding to γ ≤ 2 and p = 0.01, 0.10, 0.5, and 0.8, respectively. The results indicated that for γ ≤ 2, λgs < 1.3. It has previously been shown that genes with λgs < 1.3 would be difficult to detect using sib pair methods (2, 4). The curve comparing λgs with γ shows that even genes with moderate effect (for example, λgs < 2) may produce γ that can be detected by linkage analysis in reasonably sized samples of sib pairs.

These results indicate that linkage analysis of complex disease based on genomic screens using current microsatellite markers can be a fruitful enterprise in complex genetic diseases. An excellent example is the discovery of the late onset Alzheimer's disease susceptibility gene APOE. Using Risch and Merikangas's formulas, we calculated the number of sib pairs that would have been necessary to detect the effect of APOE on the risk of AD. The γ in individuals heterozygous for APOE-4 is 4.5 (5) and the frequency of APOE-4 in the general population is about 15% (6); the resulting probability of allele sharing is Y = 0.625, and the minimum number of affected sib pairs required to detect linkage is 164. Alzheimer's disease has an overall λs of 5, with a λgs of only 2 for APOE (7). Other complex diseases, such as multiple sclerosis (λs = 30) (8) and autism (λs = 150) (9) have substantial genetic components. Even if there are 10 epistatic genes of equal multiplicative effect underlying multiple sclerosis (λs = 30; λgs = 1.4), linkage analysis should be able to detect them. Because it is difficult to determine a priori which disease alleles have minor or moderate genetic effects, linkage analysis should not be arbitrarily abandoned.

Risch and Merikangas point out that genomic screening of candidate genes is several years from becoming reality. When the molecular resources become available, the advantages of genomic screening using TDT, such as increasing power to detect minor genetic effects, allowing the use of singleton cases, and testing effects of functional polymorphisms in genes, will make this the method of choice. Until then, well-designed linkage studies of complex traits will still be able to detect genes of major or moderate effect.

REFERENCES AND NOTES

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.

Risch and Merikangas (1) point out the efficiency of association studies for statistical power in identifying genetic markers of disease. But they limit themselves to studies of family-based association, affected sib-pairs, and parental transmission of alleles and do not mention population-based association studies (either cohort or case-control). While family-based association studies do have certain strengths, population-based studies can be far more efficient in terms of time, money, and logistics. It can take much longer to identify and collect samples from a single affected family than to collect samples from 10 or 100 patients with disease. Some studies, such as those of parental transmission, may not be practical in adult onset diseases where parents are deceased. Such practical issues, as well as our ability to generalize the results to the larger population, favor the use of population-based studies.

Perhaps as important, population-based studies commonly measure environmental exposures and can assess gene-environment interaction, data for which are nearly always lacking in family-based studies. An association between a susceptibility gene and a disease may not be apparent if there is a second factor required to initiate the disease process, such as an environmental exposure. Similarly, one may detect the effect of exposure only among genetically susceptible subpopulations. There are a number of neurologic and other diseases in which this model functions, but the cases of genes that modulate carcinogen-induced cancers (such as the polymorphic glutathione S-transferases and N-acetyltransferases) are perhaps the best examples (2). Simple tests of gene-disease association are likely to be misleading without due attention to environmental factors.

REFERENCES AND NOTES

  1. 1.
  2. 2.

Risch and Merikangas make the intriguing point that, given a relatively small number of families, the transmission-disequilibrium test (1) has enough statistical power to determine if any of a large (in some sense complete, genome-wide) set of diallelic markers is associated with a disease. Another approach, based on disequilibrium between marker alleles and disease in a randomly ascertained population sample, can be considered. Like Risch and Merikangas, we can show that, when the disease is relatively common, the disease-allele frequency is intermediate and its effect small, statistical power comparable to that of standard family-based linkage studies is achieved with a smaller number of randomly sampled individuals. The sample sizes required for the disequilibrium method are generally larger than those for transmission-disequilibrium, but the random-ascertainment scheme has practical advantages.

If one assumes that π is the probability that an individual with genotype aa has the disease, with Aa and AA individuals being, respectively, γ and γ2 more likely to develop the disease than aa individuals, one can show the expected frequencies of four categories defined by allelic state and disease status in a random population sample (2). The association between marker allele A and disease can be tested with the “chi-square” statistic. The same statistic can be used to calculate sample sizes needed to detect such an association, if indeed it exists, with a given significance level and power, for fixed values of π, γ, and p, respectively (3). When the prevalence of the disease is greater than about 5% and the disease allele is not rare, the random sample approach requires no more than 10 times the number of genotyped individuals in an affected offspring study (4). Once genotyped, the same sample can be used to study a number of diseases for the additional, small cost of ascertaining the presence or absence of a disease in each individual. The random sample approach could be especially useful for efficiently diagnosed late onset diseases, where it may not be possible to type parents for affected offspring studies. Non-insulin-dependent diabetes and hypertension, with prevalences of 6% (5) and 23% (6), respectively, could be effectively studied using this approach.

A realistic program for mapping disease markers using the random-ascertainment scheme may require a prospective design in which a cohort is fully genotyped and monitored for disease. The successes of the Framingham Study (7) and others like it show that large-scale prospective studies are not beyond reach. The effort required to genotype a large sample at many marker loci seems formidable, but the automated methods envisioned by Risch and Merikangas greatly reduce the labor for the random-ascertainment scheme as well. The utility of both approaches depends on the existence of marker alleles strongly associated with disease-causing polymorphisms, but as yet the nature and extent of such associations in the human genome are not well understood (8). Population structure (the result of admixture or other factors) introduces complications for simple disequilibrium methods that are minimized in family-based transmission-disequilibrium studies (9). However, if study populations are defined carefully and data are examined for the effects of population structure, these difficulties may be balanced by gains in efficiency that accrue when a single large sample is used to study several diseases.

REFERENCES AND NOTES

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.

Risch and Merikangas (1) show the great power of genetic association studies such as the TDT in the detection of genes with modest effects. As they mention, all TDT computations were based on the optimal assumption that the analyzed allele was the disease allele itself. A more common situation is, and could well remain, the analysis of polymorphisms which have a low prior probability to be the disease allele even if they are within the actual disease gene. The power of the TDT is highly dependent not only on the linkage disequilibrium between the disease allele and the analyzed allele but also on the relative frequencies of both these alleles.

With the same genetic model as that used by Risch and Merikangas—a disease locus with two alleles, A and a, with population frequencies of p and 1-p, respectively, and a multiplicative model with genotypic relative risks of γ and γ2 for Aa and AA subjects, respectively—one can assume a closely linked diallelic marker (recombination fraction = 0) with alleles B and b of respective frequencies m and 1-m. The coefficient of linkage disequilibrium, δ, is defined as freq(AB)-pm, and the maximum value of δ, δmax, is reached with freq(AB) is the lowest of the two frequencies m and p. The probability that a heterozygous Bb subject carries A in coupling when B is α1 = p + δ/m, and the probability that the same subject carries A in coupling with b is α2 = p − δ/(1 − m) (2). In a sample of single affected individuals with their parents, the probability for a Bb parent to transmit B to his affected child is P(tr − B) = [1 + (γ − 1)α1]/[2 + (γ − 1)(α1 + α2)] (3). The situation described by Risch and Merikangas corresponds to complete linkage disequilibrium, that is, δ = δmax with m = p, with P(tr − B) reducing to γ/(1 + γ). In other cases, the number of necessary families increases dramatically as p differs from m even when δ = δmax, and also as δ decreases. Thus, the power of association studies such as the TDT can be quite strong when there is a high probability that the allele studied is the causal allele as shown by Risch and Merikangas. In other cases, researchers should be aware that the power of such association studies can be greatly diminished as soon as the ratio m/p departs from unity and the linkage disequilibrium becomes weaker.

REFERENCES AND NOTES

  1. 1.
  2. 2.
  3. 3.

Response

We agree with Scott et al. that linkage analysis will be able to identify genes of major, but not genes of modest, effect. As such, we also agree that linkage analysis should not be arbitrarily abandoned, because undoubtedly it will lead to the discovery of some important disease susceptibility genes. However, we do not agree that linkage analysis can detect most genes underlying complex diseases, and we anticipate that few genes for complex disorders will be identified in this fashion.

As indicated by Scott et al., one measure of the total genetic effect for a complex disease is λs, the sibling risk ratio (1). However, it is generally impossible to determine the number of loci contributing to that total; if the number is large, even for a large value of λs, then none of the loci may be easily detected by linkage analysis.

We showed in our Perspective (2) that loci which confer a genotypic relative risk γ less than 4 would be difficult or impossible to identify with current linkage strategies. The numbers of sib pairs required to detect linkage that were given in the table in our Perspective (2, p. 1516) were actually underestimates, for two reasons: (i) There was an error in the computer program producing the required number of sib pairs for linkage; the actual numbers are approximately 50% larger than given (3); and (ii) the numbers given correspond to the ideal case of completely informative markers and no recombination. Allowing for more realistic circumstances of reduced marker informativity and moderate recombination, the corrected numbers probably would be about two to three times larger than given in the table. Thus, while it is still possible to detect a locus with γ of 4 or greater in a large family collection (say 500 or more), loci with smaller values of γ are unlikely to be detected.

How many loci are likely to exist for complex diseases with γ > 4? While it is difficult to know beforehand, animal models might offer a clue. As an example, the non-obese diabetic (NOD) mouse provides a useful model for human insulin dependent diabetes mellitus in being genetically complex, having an autoimmune etiology, and in the importance of the major histocompatibility loci. However, backcross experiments have shown that at least 10 other loci are probably involved in susceptibility, and only one of these loci had a value of γ greater than 4, with the rest in the range of 2 or less (4). We also note that animal backcross experiments are more analogous to human association studies than linkage studies, and this is why they have been more successful in identifying susceptibility loci than human linkage studies.

As indicated by Scott et al., multiple sclerosis is a complex disease with a presumed substantial genetic component (5). However, three recently published genome screens (6) of moderate size did not produce clear and replicable evidence of linkage in any chromosomal region. This lack of susceptibility loci of large effect in this disease suggests that a very large number of families may be required to detect linkage.

The discovery of apoE as a major risk factor for late onset Alzheimer's disease is surely one of the major success stories of modern human genetics. Thus, it is important to evaluate the means by which this discovery was made. As indicated by Scott et al., it has been estimated that apoE confers a λs value of around 2, with some modification for age of onset (7). Thus, in theory, this locus would be identifiable by linkage analysis with a sufficient number of sib pairs (several hundred minimum). In fact, the initial linkage observation on chromosome 19 (8), which produced a lod score of 4, was based on an analysis with markers that were likely to be in linkage disequilibrium with apoE. Performing linkage analysis with a marker associated with disease leads to an increase in the lod score (9). Similar linkage analysis with a nearby marker with little or no linkage disequilibrium (for example, the apo CII microsatellite) in the same material does not produce significant evidence for linkage (8, 10). Thus, in reality, the “linkage” discovery on chromosome 19 was actually based on an association between marker loci and the disease.

We agree with Scott et al. that genome-wide association studies will be based on future rather than current technology (as indicated in our title), and for the present we are still limited to the technology that exists. Although we agree that linkage studies should continue to be pursued, we also believe that this approach will produce only a modest number of loci for complex diseases.

We agree with Bell and Taylor that candidate genes are best tested in the framework of a biological hypothesis, often involving an interaction with a predisposing environmental agent, and the examples they provide are illuminating [for others, see (11)]. Also, as they point out, classic epidemiologic study designs, such as case-control or cohort, are excellent for testing such gene-environment interaction effects. The primary drawback from such designs for detecting genetic effects, however, is the potential for confounding, leading to an incorrect inference of causality for an observed association (12). Specifically, consider a population that has ethnic stratification and a tendency toward endogamy within strata. Further suppose these strata differ both in disease prevalence and allele frequencies at an unrelated locus. When performing a case-control study from such an admixed population, if the cases and controls are unbalanced for these strata, an allele frequency difference between cases and controls may emerge which is artifactual and not causal. The solution, of course, is to precisely match the cases and controls according to these strata, or to perform a stratified analysis; such would be possible with the major ethnic groups such as exist in the United States. However, further strata are likely to exist within the major ethnic groupings (for example, European subgroups of Caucasians) for which matching and stratification might generally be quite difficult. Of course, this problem disappears in a completely randomly mating population.

This problem can also be solved by resorting to family-based association tests, such as the TDT we used in our analysis. This test has been shown to be immune to confounding due to population stratification (13). Also, in the absence of population stratification, this test has similar power to the usual case-control design (14). Furthermore, cases or families (or both) can also be classified according to a relevant environmental exposure and allelic transmission compared across these classes to search for gene-environment interactions. We also showed that unless the disease predisposing allele frequency is high, families with more than one affected child can be substantially more powerful than singletons, although they are also likely to be more difficult to find.

Because of the potential problem of genetic stratification, the optimal design for searching for genes of modest effect, especially in the absence of a clear biological model, is the family-based design, such as singleton or multiple affected sibs with parents. For early onset diseases, such samples should not be difficult to obtain, and are likely worth the potential additional cost. We would add that precise ethnic matching in a case-control paradigm can also lead to increased expense, if achievable at all. In the situation of late onset diseases, where parents are usually unavailable, an alternative design is discordant sib pairs, where effectively an unaffected sib serves as a control for the affected sib. This design also protects against genetic stratification artifact, but may lead to somewhat reduced power because of the genetic correlation between sibs (14).

Long et al. suggest a prospective study design where a random population sample is subsequently followed for development of disease. Presumably, at initiation, everyone in the study is genotyped for a large number of loci. They show that if the disease is sufficiently common, reasonable power is obtained by contrasting the allele frequencies in those who develop the disease with those that do not. The primary benefit from this approach is that multiple diseases can be studied using the same population of subjects, again provided the diseases are sufficiently common. It would appear that a minimum frequency of 10% is required to obtain plausible sample sizes for sufficient power.

There are also several drawbacks to this approach. First, as for the typical epidemiologic paradigms, such as case-control studies, there is the problem of population substructure as we have described (in our response to Bell and Taylor) and also mentioned by Long et al. Second, with this approach, sample pooling is not possible, because it is unknown a priori which individuals will become affected. Thus, this approach requires construction of individual genotypes, which can greatly magnify the technical effort. By contrast, for a typical case-control design, two pools can be formed—one for affected individuals, another for those unaffected, and overall allele frequencies within the two groups determined. Thus, for a study of n cases and n controls and t loci, genotypes for only 2t samples need to be determined as opposed to 2nt samples (15). The same efficiency may obtain for a family-based design, such as affected individuals and their parents, where those affected are pooled and contrasted to the pooled group of parents. While this approach cannot give the precise data needed for a TDT analysis, it still provides a robust, powerful, and efficient means for initial screening; any positive loci can subsequently be subjected to individual genotyping (14).

The approach of Long et al. would not be practical for rare diseases, for example, those with a population frequency less than 5%. However, a compromise is possible. Numerous studies already exist that sample affected individuals, with parents or unaffected sibs, for a variety of diseases. The subjects from these studies can be followed for a variety of other diseases and then subjected to analysis as they develop these other, more frequent diseases. Pooling across studies could then provide sufficient material.

As indicated by Müller-Myhsok and Abel, our analysis was based on association studies where the actual disease predisposing polymorphism is in hand. This is why we incorporated such a large number of tested alleles (1,000,000). We also indicated that the number of loci to be tested might be reduceable substantially if one allows for linkage disequilibrium. However, as pointed out by Müller-Myhsok and Abel, depending on linkage disequilibrium is not without risk. The power of the association test can decline dramatically as linkage disequilibrium diminishes or if the tested allele has a substantially different frequency than the disease allele. To a large extent, the expectation with regard to linkage disequilibrium across the genome is uncharted territory, and thus it is difficult to predict the power of using a less dense map at this point in time. However, we can present two cases that provide some degree of optimism. The first pertains to apoE and late onset Alzheimer's disease. Several polymorphisms in the apoE region show strong linkage disequilibrium and comparable allele frequencies, allowing association to be readily detected with other neighboring polymorphisms (16). A second example is the insulin VNTR region of chromosome 11p. Several polymorphisms in this region have been identified showing strong disequilibrium and similar allele frequencies, leading to comparable degrees of association with disease (17).

As genome-wide linkage studies are supplanted by genome-wide association studies, and the distribution of linkage disequilibrium across chromosomes and populations is further explored, the degree to which linkage disequilibrium as opposed to direct causality can be utilized to locate disease susceptibility loci in the genome will become more apparent.

REFERENCES AND NOTES

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.

Navigate This Article