Evidence for DNA Loss as a Determinant of Genome Size

See allHide authors and affiliations

Science  11 Feb 2000:
Vol. 287, Issue 5455, pp. 1060-1062
DOI: 10.1126/science.287.5455.1060


Eukaryotic genome sizes range over five orders of magnitude. This variation cannot be explained by differences in organismic complexity (the C value paradox). To test the hypothesis that some variation in genome size can be attributed to differences in the patterns of insertion and deletion (indel) mutations among organisms, this study examines the indel spectrum inLaupala crickets, which have a genome size 11 times larger than that of Drosophila. Consistent with the hypothesis, DNA loss is more than 40 times slower in Laupala than inDrosophila.

Wide variation in eukaryotic genome size is a pervasive feature of genome evolution. Large differences in haploid DNA content (C value) are found within protozoa (5800-fold range), arthropods (250-fold), fish (350-fold), algae (5000-fold), and angiosperms (1000-fold) (1). This variation is called the C value paradox (2, 3) because genome size is not correlated with the structural complexity of organisms or with the estimated number of genes. Despite much progress in the study of genomes, the C value paradox remains largely unresolved.

Drosophila species, which have small genomes, spontaneously lose DNA at a much higher rate than mammalian species, which have large genomes (4–7). Although many mechanisms can affect genome size—including polyploidy, fixation of accessory chromosomes or large duplications (8), and expansions of satellite DNA or transposable elements (9)—theDrosophila findings suggest that some differences in haploid genome size may result from variation in the rate of spontaneous loss of nonessential DNA (4). Here, we test this hypothesis by examining the indel spectrum in Hawaiian crickets (Laupala), which have a genome size ∼11-fold larger than that of Drosophila (10). Specifically, we test the prediction of a lower rate of DNA loss in Laupala than in Drosophila, corresponding to the large difference in genome size.

Sequences unconstrained by natural selection exhibit patterns of substitution, reflecting the underlying spectra of spontaneous mutations (11). As pseudogene surrogates we chose nontransposing copies of non-LTR (long terminal repeat) retrotransposable elements (4, 12). Transposition of non-LTR elements usually results in a 5′-truncated copy that is unable to transpose because of lack of a promoter and lack of the capacity to encode functional proteins (13,14); these “dead-on-arrival” (DOA) elements are essentially pseudogenes.

We identified a new non-LTR element inLaupala, here designated Lau1, by means of polymerase chain reaction (PCR) with degenerate primers to conserved regions of the non-LTR reverse transcriptase (15). Evolution of unconstrained DOA elements can be distinguished from that of the constrained, active elements via phylogenetic analysis of nucleotide sequences of individual DOA elements (4, 12). Substitutions in a transpositionally active lineage are represented in multiple DOA elements generated by transposition of the active copy, whereas substitutions in each DOA lineage are unique (barring parallel mutations) because of the inability of DOA elements to transpose. This implies that, in a gene tree of non-LTR sequences from closely related species, the active lineages map to internal branches (identified through substitutions shared among elements), whereas DOA lineages map to terminal branches (identified through unique substitutions). Some DOA lineages may also map to internal branches, because elements from different species may be identical by descent (IBD) because of transmission of the same (allelic) DOA copy from a common ancestor (7, 12). Nevertheless, as long as the number of active lineages is small and the sampling is dense, substitutions in the terminal branches will correspond primarily to the DOA element evolution (12).

If the terminal branches of the Lau1 gene tree (Fig. 1) represent unconstrained evolution of DOA elements, we predict the absence of purifying selection operating along these branches. Confirming this prediction, point substitutions in terminal branches map with equal frequencies to all three codon positions (G test; P = 0.64). In addition, the terminal branches feature numerous element-specific indels (48 deletions and 18 insertions in 49 terminal branches). The internal branches show evidence of relaxed selection as well, which suggests that many elements in our sample are IBD through inheritance of ancestral allelic DOA copies. The substitutions in the internal branches are found at equal frequencies among the three codon positions (G test; P = 0.55), with many indels shared among sequences. Most shared indels can be assigned to the tree without homoplasy (21 of 24 deletions and 8 of 8 insertions), which offers strong independent support for our phylogenetic inference (16). Although the deep internal branches are expected to show evidence of purifying selection (4,12), in this case they are too short and poorly resolved to demonstrate this feature.

Figure 1

Phylogenetic analysis of Lau1. The 50% majority-rule tree of 1000 equally parsimonious trees is shown, rooted at midpoint. Numbers of unambiguous nucleotide substitutions are shown above each branch. Indels on internal branches are shown by solid (deletions) and open (insertions) bars. The tree length is 1014 steps (confidence interval = 0.64). The Lau1 gene trees were estimated by NJ, UPGMA, maximum-likelihood (F84 model), and maximum-parsimony methods [as implemented in PAUP4.0b (24)], with equal weighting of all positions, ignoring insertions, and treating deletions as missing data. Although trees differ somewhat, depending on the reconstruction method, none of the conclusions is sensitive to these differences. This is because the conclusions depend only on the changes observed in the terminal branches (the DOA lineages), whereas the alternative trees differ primarily in the deep internal branches. Sequence alignment was done by hand with the aid of Sequencher 3.0 (GeneCodes).

Our analysis of the relative rates of indels and nucleotide substitutions is confined to the terminal branches only. The length of these branches varies widely, from 0 (in Laupala kohalensis 514) to 50 (in L. kohalensis 183), most likely because of varying time that individual elements have been accumulating independent nucleotide substitutions, from the moment of either the original DOA insertion or the most recent speciation for allelic copies. As expected on the basis of this supposition, we find a strong positive correlation between both the numbers of insertions and nucleotide substitutions (Spearman's rank correlation,r s = 0.58; P = 8 × 10−6) and the numbers of deletions and nucleotide substitutions (r s= 0.79; P = 1.4 × 10−11).

These positive correlations provide a basis for estimating relative rates of indels and point substitutions in Laupala (Table 1). The maximum likelihood (ML) estimate of the rate of deletions relative to nucleotide substitutions (4) is about half as great in Laupala as in Drosophila (Fig. 2), and the difference is statistically significant (ML analysis, P = 1 × 10−3). Laupala exhibits a 40% higher rate of insertions than does Drosophila, although this is not statistically significant (ML analysis, P = 0.14). The size distribution of indels also implies a lower rate of DNA loss inLaupala. On average, Laupala deletions are almost four times smaller (Wilcoxon test, P = 0.009), and insertions are almost two times larger (17) (Wilcoxon test, P = 0.03) than those inDrosophila. Most of the difference in DNA loss is due toLaupala having a much smaller fraction of deletions larger than 15 base pairs (bp) (Fig. 3). For deletions smaller than 15 bp, the rates of deletions per substitution are indistinguishable (ML analysis, P = 0.34).

Figure 2

Relationships between numbers of deletions and numbers of terminal-branch substitutions corrected for sequence length and multiple substitutions (Jukes-Cantor correction) inDrosophila (4, 7) (open circles) and Lau1 (solid circles). ML regression line forDrosophila is dashed and that for Lau1 is solid.

Figure 3

Distribution of deletion sizes inLau1 (hatched bars) and Drosophila(7) (open bars).

Table 1

Comparison of deletion and insertion profiles inLaupala and Drosophila.

View this table:

The differences in indel spectra result in a 10.8-fold lower rate of DNA loss per nucleotide substitution in Laupala versusDrosophila. Our data also suggest that the rates of nucleotide substitution vary in a way that magnifies the difference in the rates of DNA loss. Figure 1 includes five clusters of sequences that share indels within each cluster and that also contain a sequence of Lau1 from Prolaupala kukui (for example, the cluster of Laupala molokoiensis 52 and P. kukui 102). Within each cluster the sequences are likely to be IBD from a DOA copy present in a common ancestor of Laupala andProlaupala. By dividing the pairwise nucleotide distances between Prolaupala and Laupalaelements within each cluster by 10 million years (Myr) (twice the divergence time between Laupala from Prolaupalaestimated from biogeography) (18), we obtain an estimate of the absolute rate of point substitution. The average over all clusters is 3.8 × 10−3 nucleotide substitution per Myr, which is almost four times lower than the 15 × 10−3 nucleotide substitution per Myr estimated inDrosophila (19). Combined with the 10.8-fold lower rate of DNA loss per nucleotide substitution inLaupala, the 4 times lower rate of nucleotide substitution yields an overall rate of DNA loss per Myr that is 42-fold less inLaupala than in Drosophila. Thus, as we predicted, the rate of DNA loss in Laupala is substantially lower than that in Drosophila.

We have examined the possibility that, to optimize genome size, each individual DOA copy may be selected for length (4,5, 20). Selection of this type should produce a correlation between the number of substitutions and the lengths of deletions in the terminal branches (12, 20), but the predicted correlation is not observed in either ourLaupala data (r s = −0.07;P = 0.62) or the sample of DOA elements studied inDrosophila (7, 20). These results imply that our estimates of the indel spectra are not significantly biased by natural selection for individual indels.

Our results have no bearing on the presence or absence of selective forces that may affect genome size (1), nor do they imply anything about the lengths of particular classes of constrained sequences, such as introns, intergenic spacers, or 5′ and 3′ untranslated regions (21, 22). Our data do suggest one reason why some small genomes have few pseudogenes (21,23): a high rate of DNA loss should result in a lower steady-state number of pseudogenes in small genomes.

The key question that remains is empirical and quantitative: how much of the variation in genome size can be explained by variation in the indel spectra? The relative ease with which indel patterns can be assayed using non-LTR elements should enable us to answer this question in a wide variety of eukaryotes and thus to test the mutational hypothesis for the C-value paradox in a comprehensive fashion.

  • * To whom correspondence should be addressed. E-mail: dpetrov{at}


View Abstract

Navigate This Article