The Evolutionary Fate and Consequences of Duplicate Genes

See allHide authors and affiliations

Science  10 Nov 2000:
Vol. 290, Issue 5494, pp. 1151-1155
DOI: 10.1126/science.290.5494.1151


Gene duplication has generally been viewed as a necessary source of material for the origin of evolutionary novelties, but it is unclear how often gene duplicates arise and how frequently they evolve new functions. Observations from the genomic databases for several eukaryotic species suggest that duplicate genes arise at a very high rate, on average 0.01 per gene per million years. Most duplicated genes experience a brief period of relaxed selection early in their history, with a moderate fraction of them evolving in an effectively neutral manner during this period. However, the vast majority of gene duplicates are silenced within a few million years, with the few survivors subsequently experiencing strong purifying selection. Although duplicate genes may only rarely evolve new functions, the stochastic silencing of such genes may play a significant role in the passive origin of new species.

Duplications of individual genes, chromosomal segments, or entire genomes have long been thought to be a primary source of material for the origin of evolutionary novelties, including new gene functions and expression patterns (1–3). However, it is unclear how duplicate genes successfully navigate an evolutionary trajectory from an initial state of complete redundancy, wherein one copy is likely to be expendable, to a stable situation in which both copies are maintained by natural selection. Nor is it clear how often these events occur.

Theory suggests three alternative outcomes in the evolution of duplicate genes: (i) one copy may simply become silenced by degenerative mutations (nonfunctionalization); (ii) one copy may acquire a novel, beneficial function and become preserved by natural selection, with the other copy retaining the original function (neofunctionalization); or (iii) both copies may become partially compromised by mutation accumulation to the point at which their total capacity is reduced to the level of the single-copy ancestral gene (subfunctionalization) (1–12). Because the vast majority of mutations affecting fitness are deleterious (13), and because gene duplicates are generally assumed to be functionally redundant at the time of origin, virtually all models predict that the usual fate of a duplicate-gene pair is the nonfunctionalization of one copy. The expected time that elapses before a gene is silenced is thought to be relatively short, on the order of the reciprocal of the null mutation rate per locus (a few million years or less), except in populations with enormous effective sizes (11, 12).

These theoretical expectations are only partially consistent with the limited data that we have on gene duplication. First, comparative studies of nucleotide sequences suggest that although both copies of a gene may often accumulate degenerative mutations at an accelerated rate following a duplication event, selection may not be relaxed completely (14–16). Second, the frequency of duplicate-gene preservation following ancient polyploidization events, often suggested to be in the neighborhood of 30 to 50% over periods of tens to hundreds of millions of years (17–20), is unexpectedly high.

Further insight into the rates of origin of duplicate genes and their evolutionary fates can now be acquired by using the genomic databases that have emerged for several species. We focused on nine taxa for which large numbers of protein-coding sequences are available through electronic databases: human (Homo sapiens), mouse (Mus musculus), chicken (Gallus gallus), nematode (Caenorhabditis elegans), fly (Drosophila melanogaster), the plants Arabidopsis thaliana and Oryza sativa (rice), and the yeastSaccharomyces cerevisiae. For each of these species, the complete set of available open reading frames was screened to eliminate sequences that were unlikely to be functional proteins (21). Each sequence retained after this initial filtering was then compared against all other members of the intraspecific set to identify pairs of gene duplicates, which were then analyzed for the degree of nucleotide divergence (21). The analyses for C. elegans,D. melanogaster, and S. cerevisiae were based on the complete genomic sequences available for these species.

The traditional approach to inferring the magnitude of selective constraint on protein evolution focuses on codons, comparing the rates of nucleotide substitution at replacement and silent sites (7,15, 16). With this sort of analysis, only the cumulative pattern of nucleotide substitution is identified, making it difficult to determine whether duplicate genes typically undergo different phases of evolutionary divergence, e.g., an early phase of near neutrality followed by a later phase of selective constraint. Some clarification of this issue can be achieved by considering the features of sets of gene duplicates separated by an array of divergence times.

Under the assumption that silent substitutions are largely immune from selection and accumulate at a stochastic rate that is proportional to time, we take the number of substitutions per silent site, S, separating two members of a pair of duplicates to be a measure of the relative age of the pair. Letting Rdenote the number of substitutions per replacement site, a net (cumulative) selective constraint since the time of origin of a pair of duplicates will be reflected in an R/S ratio < 1, whereas a net acceleration of protein evolution will be revealed by an R/S ratio > 1. Complete relaxation of selection will result in R/S ≈ 1. For the duplicate genes that we have identified, there is often considerable scatter around the neutral expectation whenS < 0.05 (Fig. 1), suggesting that early in their history, many gene duplicates experience a phase of relaxed selection or even accelerated evolution at replacement sites. The progressive decline of R/S beyond this point reflects a gradual increase in the magnitude of selective constraint. The vast majority of gene duplicates with S > 0.1 exhibits an R/S ratio ≪ 1.

Figure 1

Cumulative numbers of observed replacement substitutions per replacement site as a function of the number of silent substitutions per silent site. Each point represents a single pair of gene duplicates. The dashed line denotes the expectation under the neutral model, whereas the solid line is the least-squares fit ofEq. 2 to the data (22). Open points denote gene pairs for which the ratio R/S is not significantly different from the neutral expectation of 1.

From the qualitative behavior of the cumulativeR/S ratio, some insight into the temporal development of increasing selective constraint on duplicate-gene evolution can be obtained by considering a simple model in whichR declines relative to S, according to the functionEmbedded Image(1)Under this model, assuming positive m, the ratio of rates of replacement to silent substitutions initiates with an expected value of 1/(ab) at S= 0 (reflecting the evolutionary properties of newly arisen duplicates) and declines to 1/a as S → ∞ (reflecting ancient duplicates). Integrating this equation, the expected cumulative number of substitutions per replacement site (R) can be described as a function of the cumulative number of substitutions per silent site (S),Embedded Image(2)The parameters a, b, andm can then be estimated by performing least-squares analysis on the pairwise gene-specific estimates of R andS (22).

Given the inherently stochastic nature of molecular evolutionary processes, Eq. 2 describes the average rate of accumulation of amino acid–replacing substitutions fairly well, explaining more than 50% of the variance in the data in all cases (Fig. 1). Moreover, the pattern is quite similar across species. The estimates ofdR/dS at low S are all < 1, with a narrow range of 0.37 to 0.46 and a mean value of 0.43 (SE = 0.01), and dR/dS gradually declines to asymptotic values in the range of 0.022 to 0.106 (mean = 0.053, SE = 0.009) (Table 1). These results imply that, early in their evolutionary history, duplicate genes tend to be under moderate selective constraints with the rate of amino acid substitution averaging about 43% of the neutral expectation. The efficiency of purifying selection subsequently increases approximately 10-fold, to the point at which only about 5% of amino acid–changing mutations are able to rise to fixation.

Table 1

Fitted coefficients for the function describing cumulative replacement substitutions per replacement site versus silent substitutions per silent site, Eq. 2, and for the function describing the rate of loss of young duplicates, Eq. 3. The valuer 2 gives the proportion of variance in the observed values described by the model; standard errors are in parentheses.

View this table:

Some caveats in the interpretation of these results are in order. First, the nucleotide divergence statistics describe the average pattern of molecular evolution. Individual codons may, in many cases, deviate substantially from the norm. Second, for gene pairs withS > 1, potentially large inaccuracies in the estimates of nucleotide divergence are expected to result from multiple substitutions per site. Nevertheless, as can be seen in Fig. 1, the patterns that we describe are fully apparent within the subset of gene duplicates with S < 1. Third, although we have taken special precautions to avoid the inclusion of nonfunctional gene duplicates in our analyses (21), in the absence of actual expression pattern data, we cannot be certain that all of the genes we have included are functional. However, the fact that most of the pairs that we have identified have R/S < 1 and that many pairs with small S have R/S≫ 1.0 suggests that we have not inadvertently included many pseudogenes in our analyses.

Assuming that the number of silent substitutions increases approximately linearly with time, the relative age-distribution of gene duplicates within a genome can be inferred indirectly from the distribution of S (23). For all species, the highest density of duplicates is contained within the youngest age classes, with the density dropping off very rapidly with increasing S (Fig. 2). For Arabidopsis, there is a conspicuous secondary peak in the age distribution centered around S = 0.8, which is consistent with conclusions from comparative mapping data that the lineage containing this species experienced an ancient polyploidization event (24). Using an estimated rate of silent-site substitution of 6.1 per silent site per billion years (25), this event dates to approximately 65 million years ago. Unfortunately, this type of analysis cannot shed much light on the debate over whether complete genome duplications preceded the divergence of ray-finned fishes and tetrapods (13, 2628). With a divergence time between these two lineages at approximately 430 million years ago (29), the average S for a pair of older duplicates would be expected to be in excess of 1.0. Levels of substitution of this magnitude are estimated with a great degree of inaccuracy, which would weaken the signature of ancient genome-duplication events.

Figure 2

Frequency distributions of pairs of duplicates as a function of the number of silent substitutions per silent site.

For levels of divergence less than S = 0.25, problems with saturation effects in the estimation of substitutions per site should be minimal, and the time scale is short enough that it is reasonable to expect the rate of evolution at silent sites to be approximately constant. If the origin and loss of duplicates is then viewed as having been an essentially steady-state process over the time period S = 0 to 0.25, the rate of loss of gene duplicates can be estimated by using the survivorship functionEmbedded Image(3)where N S is the number of duplicates observed at divergence level S, andN 0 and d are fitted constants obtained by linear regression of the log-transformed data (Fig. 3) (30). For the species for which adequate data are available for analysis, the loss coefficients fall in the range of d = 7 to 24, with a mean value of 13.0 (SE = 2.8) (Table 1). Ford = 7, 13, and 24, the half-life of a gene duplicate on the scale of S is 0.099, 0.053, and 0.029, respectively, and 95% loss is expected at 4.3 times these S values. Thus, assuming they are not nonfunctional at the time of origin, most gene duplicates are apparently nonfunctionalized by the time silent sites have diverged by only a few percent.

Figure 3

Survivorship curves for gene duplicates, based on the complete genomic sequences of C. elegans (•), D. melanogaster (○), and S. cerevisiae (▴). The fitted parameters for these and other species are contained in Table 1.

Some insight into the absolute time to duplicate-gene loss can be acquired for the groups in which estimated rates of nucleotide evolution at silent sites are available. The average estimate ofd for mouse and human is 18.9, which, using an average rate of silent substitution in mammalian genes of 2.5 per silent site per billion years (31), translates to 7.3 million years. The estimates of d for the two invertebratesDrosophila and Caenorhabditis are very similar, averaging to 7.6. Although a direct estimate of the rate of silent substitution is not available for nematodes, indirect evidence suggests that the rate of molecular evolution in C. elegans is elevated relative to that in other invertebrates (32). Using the estimated rate of silent-site substitution in Drosophila of 15.6 per silent site per BY (7), we obtain a possibly upwardly biased estimate of 2.9 million years as the average half-life of duplicate genes in invertebrates. For Arabidopsis, d = 17.6, which translates into a half-life of 3.2 million years using the silent substitution rate cited above.

Finally, we note that for the three species for which the complete genomic sequence is available, the rate of origin of gene duplicates can be estimated from the abundance of the very youngest pairs. For D. melanogaster, there are 10 pairs of duplicates with S < 0.01, which translates to a rate of origin of approximately 31 new duplicates per genome per million years, or by using the estimated 13,601 genes per genome (33), to 0.0023 per gene per million years. There are 32 identifiable duplicates in yeast with S < 0.01. Although no direct estimates of the rate of nucleotide substitution exist for fungi, there is no evidence that the fungal rate is very different from that of animals or plants either. Using the average silent substitution rate for mammals,Drosophila, and vascular plants (8.1 per nucleotide site per BY), the crudely estimated number of new duplicates arising in the yeast genome per million years is 52; with a total genome of approximately 6241 open reading frames, this translates to 0.0083 per million years. The rate of origin of gene duplicates in C. elegans over the past few hundred thousand years appears to be substantially greater than that for D. melanogaster andS. cerevisiae. There are 164 pairs of gene duplicates withS < 0.01 in C. elegans. Again using the rate of silent-site substitution from Drosophila, the rate of origin of new duplicates in this species is at least 383 per genome per million years; with a genome size of approximately 18,424 open reading frames (33), this translates to a per-gene rate of duplication of 0.0208 per million years.

These estimated rates of origin of new gene duplicates could be inflated if gene conversion keeps substantial numbers of older duplicates appearing as if they were younger. Of the young duplicates identified in the previous paragraph, 100% of those inDrosophila, 56% of those in Saccharomyces, and 71% of those in Caenorhabditis are located on the same chromosome. However, although significant, the correlation betweenS and the physical distance between duplicates residing on the same chromosome tends to be quite weak, and many spatially contiguous gene duplicates are highly divergent (see figure In addition, a genome-wide analysis of C. elegans suggests that gene-conversion events arise only rarely in duplicate genes and are largely concentrated in multigene families (34). Such multigene families have been excluded from our analyses (21).

These results suggest a conservative estimate of the average rate of origin of new gene duplicates on the order of 0.01 per gene per million years, with rates in different species ranging from about 0.02 down to 0.002. Given this range, 50% of all of the genes in a genome are expected to duplicate and increase to high frequency at least once on time scales of 35 to 350 million years. Thus, even in the absence of direct amplification of entire genomes (polyploidization), gene duplication has the potential to generate substantial molecular substrate for the origin of evolutionary novelties. The rate of duplication of a gene is of the same order of magnitude as the rate of mutation per nucleotide site (7).

However, the fate awaiting most gene duplicates appears to be silencing rather than preservation. For the species that we have examined, the average half-life of a gene duplicate is approximately 4 million years, consistent with the theoretical predictions mentioned above (11, 12). The contrast between the high rate of silencing observed in this study and the high level of duplicate-gene preservation that occurs in polyploid species (17–20) may be reconciled if dosage requirements play an important role in the selective environment of gene duplicates. Polyploidization preserves the necessary stoichiometric relationships between gene products, which may be subsequently maintained by stabilizing selection, whereas duplicates of single genes that are out of balance with their interacting partners may be actively opposed by purifying selection.

Despite the rather narrow window of opportunity for evolutionary exploration by gene duplicates, such genes may play a prominent role in the generation of biodiversity by promoting the origin of postmating reproductive barriers (35, 36). Consider a young pair of functionally redundant duplicate genes in an ancestral species. If a geographic isolating event occurs, a random copy will be silenced in the two sister taxa with very high probability within the next one to 2 million years. The probability that alternative copies will be silenced in the two sister taxa is 0.5, so if the copies are unlinked and the two taxa are then brought back together, there will be a 0.0625 probability that an F2 derivative will be a double-null homozygote for the two loci. With tens to hundreds of young, unresolved gene duplicates present in most eukaryotic genomes, such genes may provide a common substrate for the passive origin of isolating barriers. Moreover, this process does not simply rely on gene duplicates in ancestral species. With rates of establishment of 0.002 to 0.02 duplicates per gene per million years and a moderate genome size of 15,000 genes, we can expect on the order of 60 to 600 duplicate genes to arise in a pair of sister taxa per million years, many of which will subsequently experience divergent resolution.

The passive build-up of reproductive isolation induced by gene duplicates, with no loss (and in most cases, no gain) of fitness in sister taxa, provides a simple mechanism for speciation that is consistent with the Bateson-Dobzhansky-Muller model (37), without requiring the presence of negative epistatic interactions between gene products derived from isolated genomes. The microchromosomal repatterning induced by recurrent gene duplication is also consistent with the chromosomal model for speciation (38), without requiring the large-scale rearrangements that are typically thought to be necessary (39). Finally, the time scale of the process is consistent with what we know about the average time to postreproductive isolation (40, 41).

  • * To whom correspondence should be addressed. E-mail: mlynch{at}


View Abstract

Stay Connected to Science

Navigate This Article