Review

A Paleogenomic Perspective on Evolution and Gene Function: New Insights from Ancient DNA

See allHide authors and affiliations

Science  24 Jan 2014:
Vol. 343, Issue 6169, 1236573
DOI: 10.1126/science.1236573

Structured Abstract

Background

After three decades of research aimed at recovering DNA from preserved remains, the field of ancient DNA is moving rapidly toward the sequencing and analysis of complete paleogenomes. These data provide a means to better understand evolutionary processes through time, including inference of ancient demography and admixture between lineages, as well as adaptive evolution within populations.

Embedded Image

The increasing scope of paleogenomics. The proportion of sequence reads that are mappable to a reference genome decreases rapidly with evolutionary distance (blue bars). Recent divergence from a living species is therefore key to successful paleogenomic assembly. Fortunately, most species are diverged from a living relative by <50 million years, so it should in principle be possible to generate paleogenomes for a wide taxonomic variety of organisms. Ma, millions of years ago.

Advances

Key advances enabling a paleogenomic perspective include improvements in DNA extraction and library preparation, as well as methods to enrich ancient libraries for targeted loci. These methods have made it possible to isolate ancient DNA from a far wider range of preservation environments than has been assumed to be attainable, including extending the temporal reach of ancient DNA back to nearly 1 million years.

Outlook

Although relatively few paleogenomes have been published to date, their number is rising rapidly, and it is increasingly clear that the range of specimens from which paleogenomes could be produced is much larger than has been assumed previously. As more data become available, including genomic data from living organisms, the capacity to use paleogenomic data to infer evolutionary change through time will continue to expand, particularly with respect to the evolution of populations and the link between genotype and phenotype.

The Present of the Past

A major goal of evolutionary biology is to understand the process of speciation and the changes that have accompanied and shaped the current distribution of species. Paleogenomics directly addresses these questions through the use of ancient DNA. Shapiro and Hofreiter (10.1126/science.1236573) review the origins and growth of this field and explain how challenges owing to the limited amount of DNA available for analysis and the possibility of contamination by modern material have been overcome.

Abstract

The publication of partial and complete paleogenomes within the last few years has reinvigorated research in ancient DNA. No longer limited to short fragments of mitochondrial DNA, inference of evolutionary processes through time can now be investigated from genome-wide data sampled as far back as 700,000 years. Tremendous insights have been made, in particular regarding the hominin lineage. With rare exception, however, a paleogenomic perspective has been mired by the quality and quantity of recoverable DNA. Though conceptually simple, extracting ancient DNA remains challenging, and sequencing ancient genomes to high coverage remains prohibitively expensive for most laboratories. Still, with improvements in DNA isolation and declining sequencing costs, the taxonomic and geographic purview of paleogenomics is expanding at a rapid pace. With improved capacity to screen large numbers of samples for those with high proportions of endogenous ancient DNA, paleogenomics is poised to become a key technology to better understand recent evolutionary events.

The field of molecular genetics that studies ancient DNA has been among those most dramatically transformed by high-throughput, “next-generation” DNA sequencing (NGS) technologies (Fig. 1). Within the last several years, these technologies have made it possible to sequence and assemble ancient genomes (Table 1), an accomplishment that for much of the history of the field was widely believed to be impossible.

Fig. 1 Improvements in ancient DNA recovery through time.

The introduction of NGS substantially increased the amount of DNA that could be targeted in a single experiment, and more recent methodological advances have resulted in increasingly efficient DNA extraction and library preparation. Paleogenomics will always be limited by the amount of DNA that survives in a given sample; future advances will stem from continued improvements in DNA recovery efficiency, as well as from technical advances in sequencing, such as single-molecule sequencing, which will allow better characterization of surviving fragments of DNA.

Table 1 Paleogenomic and partial paleogenomic data sets generated using NGS and used for genome-scale evolutionary inference, as of December 2013.

n/r, not reported; n/a, not applicable; SNP, single-nucleotide polymorphism; indels, insertions and deletions.

View this table:

The first ancient DNA sequences were reported three decades ago from a museum-preserved skin of the extinct quagga (1), and, nearly simultaneously, from an Egyptian mummy (2). Although the latter is now widely accepted to be the result of contamination, highlighting a major issue to be overcome, these early studies garnered enthusiasm to obtain DNA from fossils. The invention of the polymerase chain reaction (PCR), which amplifies nucleic acids (3), soon made it possible to target specific sequences, allowing for replication and validation. Not surprisingly, many of the most improbable results—DNA from dinosaurs and amber, for example—could not be validated and are now known to have been the result of contamination (4, 5). In response, the ancient DNA community adopted a suite of criteria for authenticating data (6, 7). These included replicability—if an ancient DNA sequence is real, it should be possible to reproduce it—and reliability—replicates of the same target sequence should be identical. Although some early ancient DNA studies targeted nuclear DNA (810), ancient nuclear DNA sequences authenticated with these criteria were rare.

The DNA sequences discussed here are labeled “ancient,” but this classification is less about age than about biochemical condition. Most recoverable fragments of ancient DNA are shorter than 100 base pairs (bp) in length (11) and contain miscoding lesions (1214) that can result in erroneous sequences. Although it was predicted that ancient DNA would not survive for more than 100,000 years (15), it is now known that DNA can survive nearly an order of magnitude longer than that (1618).

Ancient DNA makes it possible to observe changes in genetic diversity through time. It can be used to test hypotheses about the relationships between environmental events and evolutionary changes in populations [e.g., (1921)]. It can also resolve controversy about evolutionary relationships between species [e.g., (2225)] and provide calibrations for the molecular clock [e.g., (26)]. However, as ancient DNA has remained restricted primarily to high–copy number mitochondrial and chloroplast DNA, these inferences tend to come from single loci. Without access to the nuclear genome, it is not possible to infer extinct phenotypes, detect episodes of selection, or investigate hypotheses about ancient admixture.

With a complete genome, however, it is possible to infer even complex evolutionary relationships (Fig. 2). For example, if their age is known, paleogenomes can resolve and provide calibration for molecular phylogenies, as in a recent study of horse evolution (Fig. 2A) (18). If sequenced to sufficiently high coverage, paleogenomes can be used to infer long-term demographic trends. For example, using coalescent theory combined with genome-wide heterozygosity (27), the demographic history of the Denisovans, an archaic hominin group known only from Denisova cave in the Altai Mountains in Siberia, was inferred (Fig. 2B) (28). Paleogenomic data can also be used to reveal otherwise cryptic relationships between past and present populations. Most notable has been the discovery of admixture between Neandertals, Denisovans, and anatomically modern humans (2931). As greater numbers of paleogenomes become available, it is likely that similar situations will be revealed for other taxa, providing increased power to understand the relationship between environmental change and biodiversity (Fig. 2C). Beyond demographic inferences, paleogenomes can be used to identify selection within the genome, including genetic changes that may underlie species-specific traits (Fig. 2D). For example, 367 mutations in genes, regulatory regions, and splice sites that have become fixed in humans since divergence from Denisovans were identified from the Denisova genome (28), presenting potential targets for future functional analyses. Finally, paleogenomes provide a means to investigate genome evolution (32), including the evolution of pathogenicity (3338).

Fig. 2 Insights made possible through the analysis of paleogenomes.

(A) Paleogenomes can be used both to resolve evolutionary relationships and provide a source calibration for a molecular clock: Multiple genome alignments of horses including a 700,000-year-old paleogenome pushed back estimates of the divergence among Equus to more than 4 Ma (18). (B) High-coverage paleogenomes can be used to infer complex demographic histories of an extinct lineage: The 30× Denisova genome was used to infer the size of the Denisova population through time (28). ka, thousands of years ago. (C) A simulated data set describing how paleogenomic data can reveal the effect of environmental change on genetic diversity. A previously widespread population (orange circles) becomes subdivided into two isolated populations (orange and blue) during the glacial maximum (~20 ka), when ice sheets block dispersal between the north and south. As the ice recedes, both populations expand into the deglaciated area, resulting in a hybrid zone (shaded circles). Only admixed individuals survive to the present day. (D) Comparison between loci makes it possible to distinguish regions of the genome or phenotypes that (i) are evolving neutrally versus those that (ii) have undergone a recent selective sweep. Based on comparison with the Neandertal genome, 4235 genomic regions >25 kb in length were identified as having swept to fixation in modern humans (28).

The First Paleogenomes and the Enduring Curse of Contamination

In 2005, a high-throughput approach was used to sequence ~15,000 bacterial colonies containing DNA sampled from two ~40,000-year-old Austrian cave bears (39). The result was a mixed sample of cave bear, bacteria, fungi, plant, and other sequences, where less than 6% of the recovered DNA was determined to be that of cave bear (39). Nevertheless, the 27 kb of cave bear nuclear DNA established that, in principle, it would be possible to sequence and assemble a paleogenome.

The low percentage of endogenous DNA in these samples is not surprising. When an organism dies, its DNA begins to decay almost immediately and continues to decay at a rate determined by the environment (15, 40). Cold, dry environments discourage the growth of microorganisms and minimize chemical damage. Remains that are quickly buried and, ideally, frozen tend to be best preserved. Extracted ancient DNA is always a mixture of organismal and environmental DNA, including DNA from bacteria, fungi, and other organisms that colonize the sample during burial, and any contamination occurring during excavation and processing. However, low endogenous percentages do not rule out paleogenomic analysis. For example, paleogenomic data were used to infer the evolutionary relationship between a 6000-year-old Myotragus (an extinct bovid) and other bovid species despite 0.27% endogenous content (41). Even lower values (0.01 to 0.03% endogenous DNA) were reported from a ~40,000-year-old human bone from Tianyuan Cave, China, and yet a complete mitochondrial genome and several nuclear loci were reconstructed (42). Although samples with more endogenous DNA are better targets for sequencing, endogenous DNA content varies widely, even between samples with similar preservation histories (4345), making sample selection a difficult but important step.

The first paleogenomic studies using NGS produced ~13 Mb of nuclear DNA from a 28,000-year-old mammoth fossil (46) and ~1 Mb of Neandertal DNA (47). A simultaneous project using bacterial cloning to sequence the same Neandertal extract produced ~60 kb of Neandertal nuclear DNA. Among the two Neandertal studies, the study that used bacterial cloning inferred an older common ancestor for the lineages leading to humans and Neandertals. A reanalysis of the NGS-derived data (48) suggested that more than 50% of the sequences may have been contaminants from modern humans.

Remains can become contaminated with human DNA at any point during excavation, storage, and processing. The most reliable methods to estimate contamination do so directly, by identifying sequence motifs that differ between the paleogenome and the potential contaminant and then calculating the proportion of contaminating sequences (49). This approach was used to estimate the amount of contamination in the mitochondrial component of the NGS Neandertal data set (47). Positions that differed between the newly available, complete Neandertal mitochondrial genome and humans were identified and counted, and it was estimated that ~11% of the original mitochondrial data were modern human contaminants (50). This direct approach is now widely used in paleogenomic analyses [e.g., (29, 31, 51)].

Although the source of contamination in the first NGS-derived Neandertal data set remains unknown and later Neandertal research has superseded these early data, the issue provided an important lesson to the paleogenomics community: The sequencing library was not prepared in a sterile laboratory (50), and this may have provided an opportunity for contamination. Consequently, paleogenomic libraries are now routinely prepared in dedicated ancient DNA facilities.

Hominin Paleogenomics

In 2010, a 20-fold coverage genome of a 4000-year-old paleo-Eskimo from Greenland’s Saqqaq culture was isolated from a tuft of hair (51). Hair is a good source of ancient DNA because its hydrophobic exterior limits colonization by bacteria and makes it possible to clean the surface before extraction (52). In assembling these data, a potential general limitation of paleogenomics was revealed: Even with 20-fold coverage, only 79% of the Saqqaq genome could be determined. This is likely a consequence of the short length of ancient DNA fragments (an average for the Saqqaq specimen of 55 bp) (51). Although there is no strict rule, most very short fragments cannot be mapped unambiguously to a single location in a genome, particularly when that genome is highly repetitive, as are most eukaryotic genomes. Unfortunately, most ancient sequences are as short as or shorter than the Saqqaq sequences [e.g., (17, 28, 29)] (Table 1). Even those isolated from a tuft of 100-year-old hair from an Australian aborigine were, on average, only 69 bp long, despite the specimen’s young age (53). Given the challenge of accurately mapping short reads to a reference genome, it has been standard in paleogenomic assemblies to discard sequence fragments <30 bp in length (17, 28, 29). This suggests that, even with improved methodologies to recover the shortest surviving DNA fragments (17), it may not be possible to sequence any eukaryotic paleogenome truly to completion.

Soon after the Saqqaq paleogenome, a 1.3-fold coverage Neandertal genome (29) was produced from bones from Vindija Cave in Croatia that contained only 1 to 5% endogenous DNA. This was quickly followed by a 1.9-fold coverage genome from a hominin from Denisova cave (30) and an 11-fold coverage genome from an Australian aborigine (53). The Denisova genome was later improved to 30-fold coverage (28) thanks to very high (~70%) endogenous DNA content and a new, more efficient method to prepare sequencing libraries (54). Recently, a ~50-fold coverage Neandertal paleogenome was recovered from another extremely well preserved bone with a high (~70%) endogenous content, also from a cave in the Altai Mountains of Siberia (31).

Analyses of these paleogenomes revealed several episodes of admixture between hominin lineages during recent evolutionary history. For example, 1 to 4% of the genomes of all modern humans except sub-Saharan Africans is derived from admixture from Neandertals (29, 31). This finding remains controversial; ancient population structure in the African population ancestral to humans and Neandertals has been proposed as an alternative explanation [e.g., (5557)]. Analyses of the Denisovan paleogenome convincingly support the admixture model, however. Although the Denisovan mitochondrial genome is distantly related to that of both humans and Neandertals (58), analysis of the Denisovan nuclear genome shows that Denisovans and Neandertals are sister groups with respect to humans. Thus, they are likely descended from the same original hominin group (30, 59). If ancient population structure in Africa were to explain the sharing of alleles between the Vindija Neandertals and modern Eurasians, the Denisova genome should show the same pattern of allele sharing. However, there is no evidence of allele sharing between Denisovans and modern Europeans or East Asians (30). Instead, the Denisova genome shares a number of rare polymorphisms (around 5 to 7% of the genome) with modern Australian and Melanesian populations (28, 53, 60).

Paleogenomes have also been used to learn specific details about an individual or population. For example, the high-coverage Altai Neandertal paleogenome revealed that inbreeding among close relatives was common in Neandertal population history (31). Within modern humans, paleogenomic analyses have confirmed that the Saqqaq culture represented a different migration from that which later established Inuit populations in Greenland (51), and that Australian aborigines arrived in Australia during a wave of human dispersals before divergence between modern Europeans and Asians (53). At the level of the individual, analyses of a paleogenome of the 5300-year-old Tyrolean Iceman (61) showed that his closest genetic affiliations were with modern Sardinians, even though his remains were recovered from the European Alps. A partial paleogenome from 5000-year-old remains of a human farmer from Gotland, southeastern Sweden, also revealed close affiliations with living southern Europeans and not with 5000-year-old hunter-gatherers from Gotland (62). Together with the Iceman’s genome, these data provide evidence that the spread of agriculture across Europe involved the movement of people and not only ideas.

Beyond Hominins and the Future of Paleogenomics

As of October 2013, the only vertebrate lineages other than hominins for which a >1-fold coverage paleogenome is published are polar bears (Ursus maritimus) (63, 64) and horses (Equus) (18). A partial mammoth genome has been published (65), but despite the excellent biomolecular preservation of permafrost-preserved mammoth remains, no high-coverage genome yet exists. The small number of paleogenomes is at least partly due to the paucity of fossils with high proportions of endogenous DNA. Some substrates, such as hair, contain higher fractions of endogenous versus environmental DNA than do others, but hair is uncommon in the fossil record. Although there is presently no widely implemented method to predict endogenous DNA content, quantitative PCR can estimate relative abundance of environmental versus endogenous DNA (66). Sequencing pooled, barcoded libraries to low coverage can also estimate the quality of each library at low cost. Recently, progress has been made in both ancient DNA isolation and target enrichment. For example, a new extraction protocol increases recovery of the shortest DNA fragments (17) and, consequently, may enrich for endogenous DNA, which tends to be more fragmented than environmental and other contaminants. Also, a new method for preparing genomic libraries retains single-stranded, as well as double-stranded, molecules (54).

DNA hybridization capture methods also aim to enrich genomic libraries for endogenous relative to environmental DNA. Large-scale enrichment approaches rely on a process of selective hybridization, whereby synthesized bait molecules representing targeted regions hybridize with and immobilize ancient DNA sequences in the library, and anything that does not hybridize is washed away (67). These methods have targeted genomic DNA from hominins (42, 68) and maize (69); mitochondrial genomes of archaic and modern humans (42, 44, 58, 70, 71), horses (18), a cave bear (17), and the oldest putative dog remains (72); and DNA from multiple pathogenic organisms (3335, 37). One potential complication of whole-genome enrichment is high–copy number sequences, which tend to dominate enriched libraries (69). Nonetheless, capture-enrichment methods have increased the range of samples useful for paleogenomic research and remain a promising area of research.

Interpreting Paleogenomes

Although nonhuman and pathogenic organisms represent a major area of growth in paleogenomics, most nonhuman taxa lack high-quality, annotated, reference genomes. This presents a challenge to genome assembly and limits biological insight. For example, the mammoth paleogenomic data confirmed a slower evolutionary rate among elephantids than among hominids (65), but this was described previously from an analysis of mitochondrial genomes (73). Also, the main insight obtained from the ancient horse paleogenome was that the genus Equus began to diverge 4 to 4.5 million years ago (Ma), much older than previous estimates (18). Assembling and interpreting paleogenomes will undoubtedly become simpler as more genomes are produced, in particular genomes from taxonomically diverse organisms. As coverage depth increases, it will also become possible to perform analyses that rely on accurate estimates of heterozygosity, such as estimates of changes in population size through time (27).

Demographic inference and admixture analysis are not the only applications of paleogenomes. Paleogenomes whose ages are well constrained may be useful to calibrate a molecular clock or to investigate genome stability—for example, by tracing movements of transposable elements through time (32). Multiple paleogenomes of the same species will enable inference of changes in selection pressure over time, allowing direct observation of Darwinian evolution. For example, as a reaction to potato blight, plant breeders introduced genes from wild relatives into the potato genome, which provided resistance to infection by the fungus Phytophtora infestans. In response, P. infestans evolved new effector protein alleles that enabled them to infect resistant plants. P. infestans paleogenomes isolated from historic specimens were missing these new alleles (36, 38). Similarly, by observation of genomic differences that accumulate through time, paleogenomes could provide a means to discover domestication-associated genes (74), particularly where comparison between wild and domestic genotypes is not possible, either because the wild form is extinct (e.g., European cattle) (75) or because of relatively recent interbreeding between the wild and domestic forms (e.g., pigs) (76).

Understanding how extinct organisms differed from living organisms remains another major objective of paleogenomics. Linking genotype to phenotype has been possible by using PCR (77, 78); however, few insights have been gained thus far from paleogenomes. Lists of genes that may influence phenotype, generated from positive selection scans, have been published for hominins (29), bears (63), and horses (18). However, these data lack functional verification and remain speculative.

More progress has been made in identifying and describing the function of genes passed into the human lineage after admixture with archaic humans. Abi-Rached et al. (79) identified a specific human leukocyte antigen (a gene involved in the human immune response) allele that was acquired by humans from Denisovans and has since risen to high frequency in some west Asian populations. Understanding how the archaic version of this gene differs from human versions, and why the archaic version may be increasing in frequency, may shed light on the evolution of the human immune system. Interestingly, alleles for two other immune-related genes, OAS and STAT2, have also been identified as having introgressed into modern humans from Neandertals and Denisovans (80, 81).

Looking Ahead

As methods to isolate and sequence endogenous ancient DNA continue to improve, the next few years will almost certainly see an explosion in the taxonomic diversity, number, and temporal range of published paleogenomes. Although the de novo assembly of most paleogenomes will remain limited by the short fragment length of ancient genome (82), increasingly evolutionarily diverse genomes from living organisms will provide scaffolds against which most paleogenomes can be assembled (Fig. 3).

Fig. 3 The relationship between evolutionary distance and the utility of using an extant taxon as a reference for paleogenome assembly.

Increasing evolutionary distance results in a rapid decrease in the proportion of reads mappable to the reference genome (blue bars) (82). We selected 11 species for which obtaining a paleogenome is feasible on the basis of known DNA preservation and plot them against approximate divergence from their closest living relative (x axis). Apart from the moa, most species have a living relative that is diverged by no more than ~50 Ma, suggesting that it should in principle be possible to use their genomes as references for assembling paleogenomes.

A major goal of genomics is to infer function directly from a genome. Although it is not possible to observe many ancient phenotypes, it may be possible to recover epigenetic information from some paleogenomic data sets (83); additional work in this very new area will reveal how useful such epigenetic information will be. Improved annotation of modern genomes will also greatly facilitate the analysis and interpretation of paleogenomes. Better integration with other fields of research, including developmental and synthetic biology and biochemistry, will no doubt facilitate achieving these goals. Although paleogenomes are not necessary to understand how a genome encodes an organism, genomic data from extinct lineages will reveal extinct alleles, such as the changes observed in mammoth hemoglobin that appear to have provided an adaptive advantage to elephantids in cold climates (84). Finally, although deextinction remains a controversial topic with many barriers to success (85), a multidisciplinary approach may make it possible to revive extinct phenotypes (86), suggesting that at least some aspects of extinction may not be forever.

Perhaps most importantly, the last few years of paleogenomic research have revealed that the ancient DNA community may have been overcautious with regard to the time scale and range of substrates suitable for analysis. We have learned, for example, that with a conscientious approach to avoiding contamination, it is possible to generate high-quality ancient human genomes. Paleogenomes isolated from pathogenic organisms have confirmed that pathogen DNA survives in the fossil record. A 700,000-year old horse genome (18) and >300,000-year-old mitochondrial genomes from a cave bear (17) and a hominin (71), both from bones preserved in Spanish caves, indicate that DNA preservation extends further back in time and across a wider range of environments than has been generally assumed. Although in many cases individual specimens will continue to yield key information—for example, about demography or paleoecology—the next phase of paleogenomic inference is likely to come from population-level data sets, which will provide a means to explore adaptive evolution directly through time. As the number and range of published palaeogenomes grows, paleogenomics is poised to play an increasingly important role in improving our understanding of evolutionary processes over the short and medium term.

References and Notes

  1. Acknowledgments: We thank J. Cahill for providing coverage statistics for ancient polar bears, and apologize in advance if any studies were inadvertently omitted from Table 1. We thank R. E. Green for helpful comments on earlier drafts of the manuscript.
View Abstract

Navigate This Article