Genetic Association by Whole-Genome Analysis?

See allHide authors and affiliations

Science  23 Nov 2001:
Vol. 294, Issue 5547, pp. 1669-1670
DOI: 10.1126/science.1066921

Geneticists have long dreamed of determining the genetic basis of disease susceptibility by comparing variations in the human genome sequences of a large number of individuals. But it has been considered an impossible dream because of the technical difficulties involved in obtaining human genome sequence data. For example, it took an international team consisting of hundreds of scientists many years just to produce a “working draft” DNA sequence of a reference human genome (1, 2). Furthermore, humans are diploid organisms containing two genomes in each nucleated cell, making it very hard to determine the DNA sequence of the haploid genome. Yet, armed with the complete DNA sequence of one of our smallest chromosomes, human chromosome 21, scientists at Perlegen Sciences (a subsidiary of Affymetrix Inc.) have undertaken a pilot study to demonstrate that this dream is within reach. On page 1719 of this issue, Patil et al. (3) report their scan of some 21.7 million base pairs of unique (nonrepetitive) DNA sequence in human chromosome 21.

These investigators set out to identify all sequence variations—called single nucleotide polymorphisms (SNPs)—in human chromosome 21 and to group them into blocks called haplotypes. First, they established human-rodent hybrid cell lines, each containing one copy of human chromosome 21 from a different individual. Then they performed long range-polymerase chain reaction (LR-PCR) to amplify the regions containing unique DNA sequences. Finally, they obtained the complete DNA sequences of all of the copies of human chromosome 21 using high-density oligonucleotide arrays. The Perlegen team scanned 20 different copies of human chromosome 21 across the unique two-thirds of the chromosome, and an additional 19 copies of human chromosome 21 across one-eighth of the unique DNA sequence. The authors conclude that just three common haplotypes can describe variations among 80% of the human population, a far smaller haplotype number than previously thought.

Although only 65% of all the bases on the microarrays yielded high-quality data, the 20 sets of data each containing 14 million base pairs still constitute one of the largest sequence comparison studies ever. For the most part, the results agree with other large-scale sequence comparisons in terms of the rate of discovery of SNPs, the unpredictable pattern of haplotype structure across the chromosome, and the lack of haplotype diversity along much of the chromosome. The haplotype patterns observed across such a large span of DNA are certainly interesting, but broad conclusions about haplotype structures within the human genome cannot be made with a high degree of certainty because the number of chromosomes analyzed is still quite small.

Although human geneticists and population geneticists will continue to debate the merits of the conclusions of this study, all will agree that it marks a dramatic shift in strategic thinking. Traditionally, one looked at a limited set of markers (even at 1 million SNPs, one is still looking at just 0.03% of the human genome) in a genetic association study where the genetic makeup of individuals with a disease is compared with that of healthy individuals. Now, one can aspire to analyze all of the unique DNA sequences in the genome simultaneously.

To ensure technical success, Patil et al. wisely adopted the best features of several previous approaches. First, they realized that haplotype information had been helpful in genetic association studies, and so they decided to physically separate the two homologous copies of human chromosome 21 up front. Although this was quite laborious, the resulting DNA samples yielded not only haplotype data but also data that were much easier to analyze on high-density oligonucleotide microarrays. Second, they avoided amplification of background rodent DNA (from the somatic cell hybrids) by designing LR-PCR assays with uniquely human PCR primers. If they had amplified human DNA sequences in short PCR assays, some amplification of the rodent background would have been unavoidable. Third, high-density oligonucleotide microarrays are most effective when the experiment requires only a small number of reactions (exemplified by gene expression studies where one RNA preparation is used to study the global expression pattern of a particular cell type). The authors followed the lead of previous groups (4, 5) and used LR-PCR products as the DNA targets in their experiments. This made it possible to have ∼400 LR-PCR products for each hybridization experiment and thus to interrogate roughly 4 million bases simultaneously.

Patil et al. emphasize the “common haplotype structure” of the human genome. Their whole-genome scanning approach has defined and produced a dense set of SNPs that then have been used to select the most common haplotypes of the human population. Instead of endorsing this strategy, I suggest that we adopt the new thinking provoked by this study and work toward comparing whole human genomes when performing genetic association studies. To achieve this, a number of improvements will be needed. We must be able to convert diploid cells to haploid cell lines readily so that even very large population studies are possible. Likewise, the generation of DNA targets needs to be accomplished with much less effort. Patil and colleagues performed 3253 LR-PCR assays to scan the unique sequence of 1% of the human genome. Clearly, performing 325,300 LR-PCR assays is not practical. Perhaps a whole-genome amplification strategy would solve the problem. Resequencing by hybridization has obvious limitations. For example, duplicated sequence motifs cannot be analyzed by hybridization on microarrays. Similarly, certain sequence contexts will always yield low-quality signals. Some of the low-quality data points could be salvaged in association studies where the focus is not so much on the correct call of a particular base in the sequence, but instead on pattern alterations when the hybridization signals from two genomes are compared. The pattern changes could then be followed up using other types of analyses. Finally, it goes without saying that analytical tools and algorithms capable of analyzing data generated from whole genomes must be developed to handle the comparisons. It is no easy task to compare two genomes from each of 1000 patients with a particular disease against those from 1000 normal controls in order to identify the genetic factors associated with a disorder.

Despite the obstacles, Patil et al. have shown us that when one sets out to achieve the almost impossible and does something about it, we are one step closer to realizing the dream.


View Abstract

Stay Connected to Science

Navigate This Article