Harvesting Medical Information from the Human Family Tree

See allHide authors and affiliations

Science  18 Feb 2005:
Vol. 307, Issue 5712, pp. 1052-1053
DOI: 10.1126/science.1109682

A central goal of human genetics is to identify and understand causal links between variant forms of genes and disease risk in patients. To date, most progress has been made studying rare, Mendelian diseases in which a mutation in a single gene acts strictly in a deterministic manner, that is, the mutation causes the disease. The fact that such mutations strictly cosegregate with disease in families offers a shortcut to identifying the relevant chromosomal region, and means that the enrichment of mutations in patients with the disease compared with healthy controls can be convincingly documented in small numbers of individuals. In contrast, common diseases typically are caused by a complex combination of multiple genetic risk factors, environmental exposures, and behaviors. Because mutations involved in complex diseases act probabilistically—that is, the clinical outcome depends on many factors in addition to variation in the sequence of a single gene—the effect of any specific mutation is smaller. Thus, such effects can only be revealed by searching for variants that differ in frequency among large numbers of patients and controls drawn from the general population.

Limited knowledge about genetic variants in the human population, and the scarcity of technologies to score them accurately and at a reasonable cost, have been key impediments to performing this type of search. On page 1072 of this issue, Hinds et al. (1) describe dramatic progress toward overcoming these impediments. They describe a publicly available, genome-wide data set of 1.58 million common single-nucleotide polymorphisms (SNPs)—genome sequence sites where two alternative “spellings” exist in the population—that have been accurately genotyped in each of 71 people from three population samples. A second public data set of more than 1 million SNPs typed in each of 270 people has been generated by the International Haplotype Map (HapMap) Project (2). These two public data sets, combined with multiple new technologies for rapid and inexpensive SNP genotyping, are paving the way for comprehensive association studies involving common human genetic variations.

The rationale for genetic mapping by association to a dense map of common polymorphisms is based on two observations. The first is that most heterozygosity in the human population is due to a finite collection of common variants (on the order of 10 million and with a frequency exceeding 1%). The second is that nearby variants tend to correlate with one another in the population (known as linkage disequilibrium). Correlations among variants exist because when a mutation first arises, it does so on a single chromosome that carries a particular combination of alleles at flanking polymorphisms. Over time the mutation may spread to become common in a population, carrying with it the nearby flanking markers (see the figure). This correlation is eroded over the generations by recombination, just as in a pedigree study, except that the time scale may be thousands of generations, instead of one or two. In essence, genetic mapping with linkage disequilibrium treats the entire human population as a large family study with an unknown pedigree. The use of unrelated individuals makes it feasible to obtain sample sizes large enough to demonstrate modest relationships between genotype and phentoype through statistical associations.

Gene mapping in families and in populations.

(Left) The cotransmission of genes in families remains an important approach for the genetic mapping of human diseases. Pedigrees are collected, and genetic markers of many types (SNPs, microsatellites, insertions and deletions) are scored in each individual. Computer programs then calculate the probability that the pattern of transmission through the family is consistent with linkage of the disease and certain markers. (Right) For linkage disequilibrium mapping, the time scale is much longer, going back thousands of years. The diagram depicts a gene genealogy. At the top is an ancestral chromosome, with time flowing down the page, and the tips of the tree are individual chromosomes in the population today. Across a population sample, linkage is inferred if there is a statistical correlation (linkage disequilibrium) between the disease and a SNP marker. Numbers indicate mutations that generate SNP variations. A Mendelian disease is caused by mutation 2 (blue); all descendant chromosomes also carry mutation 1. Because recombination may occur over many generations, this correlation between variants is found only when the two are very close together (less than about 100 kb).

Patterns of linkage disequilibrium are shaped by the local recombination rate, genealogical history, and chance. In the human genome, recombination is highly variable (3) and often clusters in regions of local high intensity or “hotspots” (4). Moreover, the human population has expanded recently from a much smaller founder pool, experiencing bottlenecks as well as expansions in its history. These forces combine to make human SNP patterns simpler and broader, such that they extend over longer distances than would otherwise be the case. Nevertheless, because the typical span of linkage disequilibrium is from thousands to more than 100,000 base pairs, genetic maps of very high density are needed to use linkage disequilibrium for mapping genes. These goals motivated the creation of the public human SNP map, which today contains more than 8 million variants (5). Developing genotyping assays for large numbers of these variants, determining their frequencies in population samples, and establishing their patterns of correlation have been the goals of both Perlegen—a private company whose work is described in the Hinds et al. (1) paper—and of the International HapMap Project.

In their study, Hinds et al. describe the genotyping of 1.58 million SNPs in each of 71 individuals. Critically, the authors document that the data are highly accurate. They identified 157,000 SNPs and nine individuals in their own data set that had also been collected and released by the public HapMap Project (6). Comparing these overlapping genotypes, the authors show that both data sets are of exceptionally high accuracy: 99.6% of genotype calls were identical in the two independent studies. High-quality data are extremely important because the goal is to identify associations among variants. Errors cause both an underestimation of correlation and an overestimation of diversity.

The SNPs genotyped by Hinds et al. are distributed across the genome, but as with all methods, certain biases of experimental design have shaped the data collected. Two major biases arise in the Perlegen study: One is caused by a desire to study variants that have appreciable frequency in the population (also shared by the HapMap project); the other is a particular technical aspect of the oligonucleotide chips used by Perlegen that limits analysis to unique (nonrepetitive) DNA sequences. A critical question is how completely this subset of SNPs allows prediction of the larger set of all common variants.

To answer this question, the authors cleverly included in their study a set of DNA samples in which a large collection of genes had been resequenced by a project at the University of Washington called SeattleSNPs (7). Hinds and colleagues measured the fraction of variants in this more complete data set that could be highly correlated with Perlegen's less complete set of genetic markers. The results are encouraging: 73% of all common variants in the SeattleSNP genes showed a strong correlation (r2 > 0.8) with Perlegen's 1.58 million SNPs, and the mean correlation coefficient was 0.84. Moreover, the authors find that for future studies, an equivalent level of statistical power can be maintained by typing a selected set of just 300,000 SNPs in the samples with ancestry from Europe and Asia, and 500,000 SNPs in the African American sample. Because it is likely that the pairwise method of tagging used by the authors is conservative (8, 9), even fewer markers would be likely to achieve a similar power.

In addition to the potential utility for disease research, such data are an excellent resource for population and evolutionary geneticists. Of particular interest is the inference of past natural selection. For example, if a mutation with a strong positive effect is “swept” to fixation, it leaves a footprint of low diversity and a skewed spectrum of allele frequencies nearby (10). Methods that incorporate information on this genomic scale are being devised to find these selective footprints. The hope is, of course, to locate genes that have evolved under positive selection in the recent history of humans, presumably because those changes were required for local adaptation to different environments. Some of these changes may cause differences in susceptibility to modern diseases in today's human populations.

Although the data described by Hinds et al. represent a major step forward, much more is needed to develop the resources for comprehensive genetic association studies. As the number and density of SNPs typed in reference samples become more complete, the power and efficiency of the markers selected will rise. In this regard, it is exciting that Perlegen and the public HapMap Project are now working together to generate an even denser map for the 270 HapMap samples. Integrating these SNP data with duplication, deletion, and inversion polymorphisms (11, 12) will be required to fully capture all common sequence variations. It will be important to document how well allele frequencies and patterns of linkage disequilibrium observed in the 71 samples studied by Perlegen, and the 270 samples studied by the HapMap Project, will project over disease cohorts collected across the globe. Collecting data on diet, exercise, and relevant environmental exposures in long-term studies is key if we are ever to understand the confounding roles of genes and environment in influencing disease risk. Although there are many promising technologies for collecting genotype data, there is an acute need for improved methods to analyze these data for association with disease and to achieve robust results (13).

Ultimately, a complete description of each disease will require finding all variants, common and rare, and understanding their interactions with one another, with environmental exposures, and with multiple disease phenotypes. Association studies with common variants represent a screening method to find the most prevalent genetic risk factors. Although our population clearly contains common allelic variants that contribute to disease, the ultimate explanatory power of this approach depends critically on the unknown frequency distribution of genetic variants that contribute to disease risk, and on the magnitude of the effect of each allelic variant. There may be diseases for which there are no common alleles, presumably because the mutations that occurred long ago have been lost due to purifying selection, leaving only the more recent, rare mutations in the population. In such cases, because it is so hard to demonstrate association with rare variants, even direct resequencing data may be difficult to interpret. Where effects of common alleles are particularly weak, or if they are entangled in complex interactions with other genes and environmental factors, all methods will have correspondingly lower power. Suppose, for example, that a disease had the genetic architecture of the oil content of corn, where at least 50 genes, all of small effect, have been found to influence the trait (14, 15). Such a disease would demand an enormous amount of resources and yield little predictive information of use to public health—although the biological insights could still be of tremendous value. In short, we need to pick the targets for these approaches judiciously, and to modify the approach in light of what is learned.

Although population genetic theory has played a vital role in shaping our thinking about these problems (16), ultimately the contribution of common and rare variants in complex disorders is an empirical question that will only be answered by collecting data on an adequate scale. It is exciting to live in a time when the necessary tools are becoming available so that we can stop debating the hypothetical, and turn our attention to what we can learn from the data about real human diseases.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
View Abstract

Stay Connected to Science

Navigate This Article