Evidence for a High Frequency of Simultaneous Double-Nucleotide Substitutions

See allHide authors and affiliations

Science  18 Feb 2000:
Vol. 287, Issue 5456, pp. 1283-1286
DOI: 10.1126/science.287.5456.1283


Point mutations are generally assumed to involve changes of single nucleotides. Nevertheless, the nature and known mechanisms of mutation do not exclude the possibility that several adjacent nucleotides may change simultaneously in a single mutational event. Two independent approaches are used here to estimate the frequency of simultaneous double-nucleotide substitutions. The first examines switches between TCN and AGY (where N is any nucleotide and Y is a pyrimidine) codons encoding absolutely conserved serine residues in a number of proteins from diverse organisms. The second reveals double-nucleotide substitutions in primate noncoding sequences. These two complementary approaches provide similar high estimates for the rate of doublet substitutions, on the order of 0.1 per site per billion years.

Mutational events can be studied either by direct observation of mutations in the laboratory or by comparing sequences that have been accumulating mutations naturally, during evolution. Studies of the first kind have suggested that some mutations can involve multiple nucleotide changes (1, 2), and indeed, mechanisms that affect neighboring nucleotides are known. Examples include template-directed mutations occurring during DNA repair and replication (1) or dipyrimidine lesions induced by ultraviolet light (2, 3). Some evolutionary comparisons have also suggested that simultaneous double-nucleotide substitutions occur at neighboring sites (4), but the significance and generality of these observations have been questioned (5). Thus, changes in neighboring nucleotides are usually attributed to coincidence of independent mutations.

We used two independent and complementary approaches based on sequence comparisons to study double-nucleotide substitutions and to obtain estimates of their frequency. The first approach examined changes that have occurred over long evolutionary time scales, between two particular dinucleotides, TC and AG. Serine is unique among amino acids in that it is encoded by two groups of codons, TCN and AGY, which cannot be interconverted by a single-nucleotide mutation. Switches between these groups of codons could occur indirectly, by two separate single-nucleotide mutations (TC↔AC↔AG or TC↔TG↔AG), or perhaps directly by simultaneous double-nucleotide mutation (TC↔AG). In the former case, the switch would involve an intermediate step whereby the triplet would encode either threonine (ACN) or cysteine (TGY), residues that are ionically and sterically different from serine (6), so such changes are unlikely to be tolerated in critical functional or structural sites of a protein. Nevertheless, TCN↔AGY switches have been observed at sites encoding extremely conserved serine residues, for example in ubiquitin (7) and in the active site of serine proteases (8). Switches at these sites seem most likely to result from simultaneous double-nucleotide mutations, which in this context are synonymous and most likely selectively neutral.

To investigate the generality and frequency of such switches, we studied 23 data sets of homologous proteins containing serine residues absolutely conserved over a wide range of eukaryotes and/or prokaryotes (Fig. 1A). We analyzed the distribution of TCN and AGY codon types in these conserved serines, inferring the position and frequency of codon switches during evolution (illustrated in Fig. 1B) (9). Our analysis reveals a widespread occurrence of codon switches at such sites (Table 1), with an estimated frequency of about 0.1 per site per billion years (94/774 = 0.12 per site per Gyr). This rate appears to be consistent among different phylogenetic lineages and different genes (Fig. 2). Rate estimates from bacteria and eukaryotes are very similar, 0.11 and 0.12 per site per billion years (Gyr), respectively.

Figure 1

(A) Overview of the phylogeny and divergence times used for the analysis of serine codon switches. The phylogeny is based on a number of recent phylogenetic analyses (20, 24), with points of uncertainty shown as unresolved polychotomies. Times of common ancestors are indicated in Gyr before present. (B) Determination of serine codon switches. The data set of glutamine fructose-6-phosphate transaminase is shown as an example. There are three sites where serine is absolutely conserved in the protein sequence (alignment sites 780, 953, and 990). At least four codon switches can be observed. The time sampled by this data set (sum of branch lengths) is 3 × 14.71 Gyr.

Figure 2

Rate of observed serine codon switches for 23 proteins. Data is from Table 1. The line has a slope of 0.12 switches per site per Gyr.

Table 1

Rates of serine codon switches in 23 data sets of highly conserved proteins. The phylogenetic assemblages (species) represented in each data set are indicated by numbers as specified inFig. 1A. The inferred number of codon switches and estimated time sampled by each data set (in Gyr) are indicated.

View this table:

Of the 70 switches where the direction of change could be inferred (by parsimony and with reference to outgroups), 60 were in the TC→AG rather than the AG→TC direction. However, independent rate estimates for each direction are very similar, 0.10 and 0.11 per site per Gyr, respectively. The bias therefore reflects a preponderance of TCN-type codons as potential targets, rather than a bias in the direction of mutation [this points to a strong codon bias in the ancestral representation of serines (10)].

Most codon switches at such highly conserved serines appear to result from simultaneous double-nucleotide mutations. However, it is conceivable that these switches could occur by two separate single-nucleotide mutations, through intermediates that encode threonine or cysteine. Kimura suggested that slightly deleterious intermediates may sometimes survive to be rescued by rapidly selected compensatory mutations (11), but there are a number of observations that argue against this possibility in this case. First, Kimura's model applies to situations where compensatory mutations are relatively frequent (e.g., when many different mutations can have a compensatory effect) or when the selective coefficient against the intermediates is rather low, which seem very unlikely. Second, if deleterious alleles were involved, we would expect these to survive much more frequently in the presence of additional copies of the gene, but we observe very similar rates of codon switches in haploid and diploid genomes, as well as in proteins that belong to multigene families (12). Moreover, we have also noticed TCN↔AGY switches among codons encoding highly conserved serines in closely related sequences, with no evidence of a transition through nonserine intermediates (13).

Other mechanisms have also been proposed that could explain switches in serine codons through nondeleterious intermediates (8, 1416). For example, a transient substitution of serine by another amino acid could be complemented by the presence of a neighboring serine residue (16), an alternative genetic code may have allowed TGN to encode serine (15), or the two types of serine codon may reflect independent origins from a different ancestral amino acid (8). These explanations may apply in special cases and could contribute to a small proportion of codon switches. However, they are unlikely to account for the widespread distribution of codon switches, as observed in diverse phylogenetic lineages, in different proteins, and in serine residues whose position and identity has been absolutely conserved.

In our second approach, we examined double-nucleotide substitutions among noncoding sequences of closely related species. In these sequences, substitutions are expected to accumulate in a manner that is unbiased by selection, and so directly reflect mutational processes. We compared a long (about 7 kb) noncoding sequence from the pseudo eta globin locus of seven closely related catarrhine primates (Fig. 3) (17) to determine whether mutations in that region involve a significant fraction of clustered nucleotide changes (18). Using parsimony analysis, we determined the number of single- and double-nucleotide changes that have occurred during the evolution of these species and found a significant excess of double-nucleotide substitutions relative to what would be expected by coincidence of single-nucleotide changes alone (Table 2). The excess, apparently simultaneous, dinucleotide mutations are estimated to have occurred at a rate of 0.1 per site per Gyr (19), on average, at any nucleotide doublet.

Figure 3

Phylogeny of the catarrhine primates (17) used in the analysis of pseudo eta globin sequences (Table 2). Branch lengths are not to scale.

Table 2

Analysis of single- and double-nucleotide substitutions in the pseudo eta globin locus on each branch of the catarrhine primate phylogeny (Fig. 3). Positions of substitutions were inferred by parsimony. L, number of aligned nucleotides; ObsS, ObsD, numbers of changes observed as single- or double-nucleotide substitutions, respectively; ExpD, number of doublet substitutions expected by coincidence of two separate single-nucleotide substitutions; RealD, number of excess double changes, inferred to have occurred as simultaneous double-nucleotide substitutions.

View this table:

These two analyses are complementary: they examine double-nucleotide substitutions in different contexts and over very different time scales. Any concerns that the serine codon switches might have involved compensatory changes via nonserine intermediates are offset by the observation of similarly high levels of doublet changes in closely related noncoding sequences. Equally, although the rates for all dinucleotide changes were estimated from just one particular region of the primate genome, the rates of TC↔AG changes estimated from serine switches apply to a wide range of loci from diverse organisms. Both approaches point to the conclusion that the rate of double-nucleotide substitutions is high compared to expectations based on the coincidence of individual neutral nucleotide substitutions, which typically occur at a rate of around 1 to 10 per site per Gyr (20, 21).

We expect that the rates of different doublet mutations will vary considerably depending on a cell's exposure to different mutational mechanisms. For example, we would expect to see a much higher incidence of dipyrimidine lesions in cells that are exposed to ultraviolet light (e.g., exposed unicellular organisms, skin cells) than in cells that are not (e.g., the germ line of large multicellular animals). Such differences might explain why the estimated frequency of specific TC→AG and AG→TC substitutions in serine codons, which may involve dipyrimidines (TC in the coding strand or CT in the noncoding strand, respectively), is higher than would be predicted by the average frequency of double-nucleotide substitution estimated from the eta globin pseudogene. The sequence-specificity of mutational mechanisms could result in different rates of substitution among various doublets in different cell types. These observations may be important in the context of models of molecular evolution and phylogenetic reconstruction, as well as mutational mechanisms of human disease.

  • * To whom correspondence should be addressed. E-mail: averof{at} (M.A.) or paul{at} (P.M.S.)


View Abstract

Navigate This Article