Report

Targeted Investigation of the Neandertal Genome by Array-Based Sequence Capture

Science  07 May 2010:
Vol. 328, Issue 5979, pp. 723-725
DOI: 10.1126/science.1188046

Abstract

It is now possible to perform whole-genome shotgun sequencing as well as capture of specific genomic regions for extinct organisms. However, targeted resequencing of large parts of nuclear genomes has yet to be demonstrated for ancient DNA. Here we show that hybridization capture on microarrays can successfully recover more than a megabase of target regions from Neandertal DNA even in the presence of ~99.8% microbial DNA. Using this approach, we have sequenced ~14,000 protein-coding positions inferred to have changed on the human lineage since the last common ancestor shared with chimpanzees. By generating the sequence of one Neandertal and 50 present-day humans at these positions, we have identified 88 amino acid substitutions that have become fixed in humans since our divergence from the Neandertals.

The fossil record provides a rough chronological overview of the major phenotypic changes during human evolution. However, the underlying genetic bases for most of these events remain elusive. This is partly because it is not known when most human-specific genetic changes, identified from genome comparisons to living relatives, occurred during the ~6.5 million years since the separation of the human and chimpanzee evolutionary lineages. However, shotgun sequencing of the Neandertal, a human form whose ancestors split from modern human ancestors 270,000 to 440,000 years ago, has been performed to ~1.3-fold coverage of the entire genome (1). Comparison of Neandertal and present-day human genomes can reveal information about whether genetic changes occurred before or after the ancestral population split of modern humans and Neandertals. However, low-coverage whole-genome shotgun sequencing inevitably leaves a substantial proportion of the genome uncovered. Although deeper shotgun sequencing of one or a few individuals may produce higher coverage across the whole genome, simple shotgun approaches cannot economically retrieve specific loci from multiple individuals, both due to the size of the mammalian genome per se and to the very high proportion (up to 99.9%) of microbial DNA in the vast majority of ancient tissue remains, with the exception of some instances of preservation in permafrost (2, 3). Primer extension capture can isolate specific DNA sequences from multiple Neandertal individuals (4). However, although useful for capture of small target regions such as mitochondrial DNA (mtDNA) (4, 5), this method is unlikely to be scalable up to megabase target regions, ruling out experiments such as the retrieval of exomes, large chromosomal regions, or validation of sites of interest identified in the low-coverage shotgun genome data.

Because microarrays can carry hundreds of thousands of probes, we investigated the use of massively parallel hybridization capture on glass slide microarrays (6, 7) on Neandertal DNA at thousands of genomic positions where nucleotide substitutions changing amino acids (nonsynonymous substitutions) have occurred on the human lineage since its split from chimpanzees. For any substitution that is fixed, i.e., occurs in all present-day humans, it is currently impossible to judge how long ago either the original mutation or the subsequent fixation event occurred. However, by ascertaining the Neandertal state at these positions, we can separate fixed substitutions into two classes: (i) sites where a Neandertal carries the derived state, which indicates that the substitution must have occurred before the population split of modern humans and Neandertals; and (ii) sites where a Neandertal is ancestral, which indicates that fixation of a substitution in modern humans occurred after the population split with Neandertals (Fig. 1A).

Fig. 1

(A) Identification of protein-coding changes that are likely to have become fixed recently (red bar) in modern humans after the population split from Neandertals. Such positions would be derived in all present-day humans but ancestral in the Neandertal. (B) Distribution of Neandertal coverage for ~14,000 amino acid substitution sites found in the human genome by comparison to primate outgroups. The same sites were also sequenced in 50 present-day humans. Of these, 88 were found to be fixed derived in present-day humans and ancestral in Neandertal, representing recently fixed protein-coding changes in the human genome.

To identify substitutions that occurred on the human lineage since the ancestral split with chimpanzee, we aligned human, chimpanzee, and orangutan protein sequence for all orthologous proteins in HomoloGene (8, 9). Comparison of these three species allowed us to assign human/chimpanzee differences to their respective evolutionary lineages. We designed a 1 Million Agilent oligonucleotide array covering, at 3–base pair tiling, all 13,841 nonsynonymous substitutions inferred to have occurred on the human lineage (9). We used this array to capture DNA from a ~49,000-year-old Neandertal bone (Sidrón 1253) from El Sidrón Cave, Spain (10, 11). This bone contains a high amount of Neandertal DNA in absolute terms, but also a high proportion (99.8%) of microbial DNA (4), making it unsuitable for shotgun sequencing. To identify which of the 13,841 substitutions are fixed in present-day humans, we also collected data from 50 individuals from the Human Genome Diversity Panel (12) with the same array design as used for the El Sidrón Neandertal (table S1). The DNA libraries from these individuals were barcoded, pooled, and captured on a single array (13). All captured products were sequenced on the Illumina GAII platform and aligned to the human genome (9). Overall, 37% of the Neandertal sequence reads aligned to the target regions, representing ~190,000-fold target enrichment. We retrieved Neandertal sequence for 13,250 (96%) of the substitutions targeted on the array, with an average coverage of 4.8-fold after filtering for polymerase chain reaction (PCR) duplicates (Fig. 1B). We considered a Neandertal position ancestral if all overlapping reads matched the chimpanzee state and derived if all reads carried the modern human state or if we found a mixture of derived and third-state reads, disregarding positions that carried only a third state or positions where Neandertal reads were found both in the ancestral and in the derived state. From each present-day individual, a total of 25% (23 to 27%) of reads aligned to the target regions. In each individual, we retrieved on average 98% (97 to 99%) of targeted positions and had on average coverage of 10-fold (fig. S1). We estimated genotypes for each individual and considered a position to be fixed derived if it was homozygous and derived in all humans observed, and if data were available for at least 25 individuals (50 chromosomes) (9).

We included several additional target regions on the array to assess levels of human DNA contamination, which can frequently affect ancient DNA experiments (14). One such region was the complete human mtDNA, which is known to differ between the Sidrón 1253 Neandertal analyzed here and almost all (99%) present-day humans at 130 positions (4). Even though the array probes were designed to match present-day human mtDNA, 253,549 of the 254,296 (99.71%) fragments that overlapped these 130 positions matched the Neandertal state. We therefore conclude that the vast majority of mtDNA in the Sidrón 1253 library is of Neandertal origin.

For a more direct estimate of contamination in the nuclear DNA, we used 46 nucleotide sites on the X chromosome that differ between present-day humans and chimpanzees and that were found to be ancestral in a Neandertal from Croatia (Vindija 33.16) by shotgun sequencing (1), whereas ~1000 present-day humans in the human diversity panel carry a derived state. The Sidrón 1253 individual will obviously not match Vindija 33.16 at all of these sites. However, because Sidrón 1253 is a male (15) and thus carried a single X chromosome, at sites where he does match Vindija 33.16, all reads should carry the ancestral base while apparent heterozygosity will indicate human DNA contamination. By analyzing the consistency of reads overlapping these sites on the X chromosome, we calculated a maximum likelihood estimator of X-chromosomal contamination of 4%, although confidence intervals are large (1 to 12%) due to the small number of relevant positions (9).

Another way to estimate contamination across autosomes is to investigate patterns of allele counts. Because at every site an individual is either homozygous derived, homozygous ancestral, or heterozygous, DNA from a single individual will yield at each site either only derived alleles, only ancestral alleles, or a draw with equal chance for either. Contamination from other individuals would cause systematic deviation from these patterns. We thus produced a likelihood model that estimated contamination at the positions recovered from Sidrón 1253, and calculated a 95% upper bound for contamination of 2% (9). From these results we conclude that the Sidrón 1253 data are not substantially affected by human DNA contamination.

In total, we determined with high confidence the Neandertal and present-day human state for 10,952 nonsynonymous substitutions. In 10,015 (91.5%) of all cases the Neandertal carries the derived state, whereas in 937 (8.5%) cases the ancestral state was found (fig. S2). Of the positions that are fixed in the derived state in present-day humans, 9525 (87%) are derived in Neandertal, whereas 88 (0.8%) (table S2) are ancestral (fig. S2). In agreement with previous results generated by PCR (15), two substitutions that change amino acids in the gene FOXP2 (16), involved in speech and language (17), are both derived in this Neandertal individual.

The 88 recently fixed substitutions occur in 83 genes (tables S2 and S3). We asked if these genes cluster in any group of functionally related genes relative to the genes that were targeted in the capture array (18) (as defined in the Gene Ontology) but found no such groups. We furthermore asked if the 88 substitutions that recently became fixed in humans differ from those that occurred before the divergence from the Neandertal with respect to how evolutionarily conserved the positions in the encoded proteins are (9, 19) (Fig. 2). We found that the 88 recent substitutions tend to affect amino acid positions that are more conserved than the older substitutions (Wilcoxon rank text; P = 0.014). Similarly, the recently fixed substitutions caused more radical amino acid changes with respect to the chemical properties of the amino acids (Wilcoxon rank test; P = 0.04). One possible explanation for these observations is that the effective population size of humans since their separation from the Neandertal lineage has been small, leading to a reduced efficiency of purifying selection, as seen, e.g., in Europeans (20). We also looked for evidence that the recent substitutions may have been fixed by positive selection. One recent substitution occurred in SCML1, a gene involved in spermatogenesis (21) that has been previously proposed as a target of positive selection in humans (22) as well as frequent positive selection in primates (23). However, we found no significant overrepresentation of the 83 genes among candidate genes in three genome-wide scans for positive selection (24) (table S4). Nevertheless, we believe that all of these amino acid substitutions warrant functional studies.

Fig. 2

Evolutionary conservation at positions affected by substitutions that are fixed in present-day humans. For each bin of conservation GERP (Genomic Evolutionary Rate Profiling) scores, the fractions of derived and ancestral alleles of all positions where the Neandertal carries derived (blue) and ancestral alleles (red), respectively, are given. Error bars are 95% binomial confidence intervals.

Our results demonstrate that hybridization capture arrays can generate data from genomic target regions of megabase size from ancient DNA samples, even when only ~0.2% of the DNA in a sample stems from the endogenous genome. By generating an average coverage of 4- to 5-fold, errors from sequencing and small amounts of human DNA contamination can be minimized. A further approximately 5-fold reduction of errors was achieved here by the enzymatic removal of uracil residues that are frequent in ancient DNA (25). Because the Sidrón 1253 Neandertal library used for this study has been amplified and effectively immortalized, the same library should be able to provide similar-quality data for any other genomic target region, or even the entire single-copy fraction of the Neandertal genome.

Supporting Online Material

www.sciencemag.org/cgi/content/full/328/5979/723/DC1

Materials and Methods

Figs. S1 to S4

Tables S1 to S5

References and Notes

  1. Materials and methods are available as supporting material on Science Online.
  2. We thank C. S. Burbano, C. de Filippo, J. Kelso, and D. Reich for helpful comments; M. Kircher, K. Pruefer, and U. Stenzel for technical support; C. D. Bustamante and K. E. Lohmueller for access to human resequencing databases; D. L. Goode and A. Sidow for providing conservation scores; J. M. Akey for providing coordinates of genome-wide scans for selection; I. Gut for human genotyping; E. Leproust and M. Srinivasan for providing early access to the 1 Million feature Agilent microarrays; and the Genome Center at Washington University for pre-publication use of the orangutan genome assembly (http://genome.wustl.edu/genomes/view/pongo_abelii/). The government of the Principado de Asturias funded excavations at the Sidrón site. J.M.G. was supported by an NSF international postdoctoral fellowship (OISE-0754461) and E.H. by a postdoctoral training grant from the NIH and by a gift from the Stanley Foundation. G.J.H. is an investigator of the Howard Hughes Medical Institute, which together with the Presidential Innovation Fund of the Max Planck Society provided generous financial support. DNA sequences are deposited in the European Bioinformatics Institute short read archive, with accession number ERP000125. The array capture technologies used in this study are the subject of pending patent filings U.S. 60/478, 382 (filed 2003) and U.S. 61/205, 834 (filed 2009), on which G.J.H. and E.H. are listed as inventors.
View Abstract

Cited By...

Subjects

Navigate This Article