Very Low Gene Duplication Rate in the Yeast Genome

See allHide authors and affiliations

Science  19 Nov 2004:
Vol. 306, Issue 5700, pp. 1367-1370
DOI: 10.1126/science.1102033


The gene duplication rate in the yeast genome is estimated without assuming the molecular clock model to be ∼0.01 to 0.06 per gene per billion years; this rate is two orders of magnitude lower than a previous estimate based on the molecular clock model. This difference is explained by extensive concerted evolution via gene conversion between duplicated genes, which violates the assumption of the molecular clock in the analyses of duplicated genes. The average length of the period of concerted evolution and the gene conversion rate are estimated to be ∼25 million years and ∼28 times the mutation rate, respectively.

Gene duplication is considered to be an important mechanism for the generation of genomic novelty (1, 2). A crucial question is the rate at which gene duplication occurs. Lynch and Conery (3) estimated this rate with the use of complete genome sequences of three model eukaryote species, including yeast. Because comparative genomic data were not available at that time, the molecular clock model (4) was assumed to estimate the rate from a single genomic sequence. Their estimates of the gene duplication rate per gene were surprisingly high—roughly on the order of one per 100 million years. However, although the molecular clock model should be reasonably accurate when applied to sequence data between species (2), it is not clear whether the molecular clock works for the divergence between duplicated genes, as gene conversion may homogenize interlocus variation. This nonindependent evolution of copy members in a multigene family is known as concerted evolution (57). If concerted evolution is a common phenomenon, it is expected that a molecular clock–based estimate of the gene duplication rate should be inflated (8). Here, we report a method for estimating the gene duplication rate without the molecular clock assumption, and we derive a much lower rate of gene duplication.

The complete genome sequences of Saccharomyces cerevisiae (9) and six of its relatives (10, 11) were used to estimate the gene duplication rate without assuming the molecular clock model. On the basis of these genomic data, a highly reliable species tree is provided (12) (Fig. 1A). Ks, the synonymous nucleotide divergence, is used to measure the time of the speciation events, which are denoted by T1, T2, T3, T4, T5, and T6 in chronological order. In this study, we “mapped” the timings of recent gene duplication events in the intervals between the nodes on the species tree, and from this we derived the rate of duplication.

Fig. 1.

(A) Species tree for the seven yeast species (S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, S. bayanus, S. castellii, and S. kluyveri) estimated from 106 widely distributed orthologous genes (12). Using these 106 genes, the average synonymous nucleotide divergence (Ks) was estimated by the PAML package (27). (B) Strategy for identifying the orthologs of a pair of duplicated genes, X and Y. (C) Frequency distribution of Ks between gene pairs in S. cerevisiae.

In the S. cerevisiae genome, we identified 68 complete duplicated genes (i.e., two-copy gene families) with Ks < 1.05 (13). Duplicated copies for most of them are located on different chromosomes. Because Ks = 1.05 approximately corresponds to the divergence between S. cerevisiae and S. bayanus, most of these duplications should have occurred after T4 under the prediction of the molecular clock model. The two duplicated genes for each pair were randomly denoted by X and Y (Fig. 1B). The two adjacent genes (A and B for X; C and D for Y) were used to examine whether the orthologs of X and Y exist in the whole-genome draft sequences of the six relatives of S. cerevisiae (13). The results appear in table S1 (a portion of the results is shown in Table 1).

Table 1.

List of the studied duplicate gene pairs of S. cerevisiae and their orthology in its six relatives. See table S1 for the full list of the investigated gene pairs.

X Y KsS. paradoxusS. mikataeS. kudriavzeviiS. bayanusS. castelliiS. kluyveriTmView inline
1View inline YHR053C YHR055C 0 AXB/0 AXB/0 AX-/0 AXB/0 ---/0 ---/0 0
2 YCL066W YCR040W 0 AXB/-YD/2 ---/CYD/2 -XB/CYD×2/1 AXB/-YD/1 A-B/---/0 ---/---/0 4
3 YNL019C YNL033W 0 AXB/-YD/1 ---/-YD/0 ---/CYD/0 AX-/CYD/0 ---/---/0 ---/---/0 1
4 YAR064W YHR213W-B 0 ---/-YD/3 ---/---/0 ---/---/1 ---/---/0 ---/---/0 ---/---/0 0
5 YFL061W YNL335W 0.0053 ---/-YD/0 ---/---/1 ---/---/0 ---/---/0 ---/---/0 ---/---/0 0
6 YGL135W YPL220W 0.0209 AXB/CYD/0 AXB/CYD/0 -XB/-YD/0 AXB/CY-/0 A-B/CYD/0 AX-/---/0 4
7 YOR390W YPL279C 0.0209 AXB/---/0 A-B/---/1 -XB/---/1 ---/---/1 ---/---/0 ---/---/0 0
8 YDL136W YDL191W 0.0265 AXB/CYD/0 -XB/CY-/0 AXB/CY-/0 AXB/CYD/0 A-B/CYD/0 ---/CYD/0 4
9 YBR181C YPL090C 0.0298 ---/CYD/1 AX-/CYD/1 AXB/CYD/0 AXB/---/1 ---/CYD/1 ---/---/1 3
10 YNL018C YNL034W 0.0329 AXB/CY-/1 ---/CY-,-YD/0 ---/CYD/0 ---/CYD/1 ---/---/0 ---/---/0 1
11 YBR031W YDR012W 0.0354 AXB/CYD/0 AXB/CY-, -YD/0 AXD/CYB/0 AXD/CYB/0 AXD/---/0 AX-/---/0 4
12 YHR141C YNL162W 0.0677 AXB/CYD/0 -XB/CYD/0 AXB/CYD/0 AXB/CYD/0 AXB/-YD/0 ---/---/1 5
13 YHR203C YJR145C 0.0699 AXB/CYD/0 AXB/CYD/0 AXB/---/2 AXB/CY-/0 ---/---/2 ---/---/1 4
14 YER074W YIL069C 0.0699 AXB/CYD/0 AXB/CYD/0 AXB/CYD/0 AXB/CYD/0 ---/CY-/1 ---/---/1 4
15 YBL072C YER102W 0.0947 AXB/CYD/0 AX-/CYD/0 AX-/CYD/0 AX-/---/1 ---/CY-/1 ---/---/1 3
16 YBR009C YNL030W 0.1339 AXB/CYD/0 AXB/CYD/0 ---/---/2 -XB/CYD/0 -XB/CYD×2/0 ---/CY-/1 5
17 YHL001W YKL006W 0.1343 AXB/CYD/0 AX-/-YD/1 AXB/CYD/0 AXB/CY-/0 -XB/CYD/0 ---/---/1 5
18 YIL018W YFR031C-A 0.1412 AXB/CYD/0 AX-/CYD/0 AXB/CYD/0 AXB/CY-/0 -XB/---/1 ---/CY-/0 4
19 YGR085C YPR102C 0.1445 AXB/CYD/0 AX-,-XB/CYD/0 AXB/CYD/0 AXB/CYD/0 ---/---/1 ---/---/1 4
20 YHL033C YLL045C 0.1457 AXB/CYD/0 AXB/CY-/0 AXB/CYD/0 AXB/CYD/0 ---/---/1 ---/---/0 4
  • View inline* Only the subscript of Tm is shown.

  • View inline Tandem duplicated genes for which the gene order of X, Y, and markers is given by AXYB in S. cerevisiae.

  • Each entry in Table 1 (table S1) consists of candidates of the orthologs of X and Y and the number of BLAST hits of the focal duplicates (X and Y) for which the orthology was not successfully estimated. For example, the ninth pair of duplicated genes (X: YBR181C; Y: YPL090C) indicates that the focal pair has two BLAST hits in S. kudriavzevii, one with A and B and the other with C and D (AXB/CYD/0). Therefore, the synteny around X and Y is assumed to be conserved in S. kudriavzevii, which suggests that the duplication event is older than T3, because independent duplications inserted into the same gene interval should be extremely unlikely. Support is provided by the result for S. mikatae (AX-/CYD/1), where orthologous parts of both of the focal gene pairs are found (with an extra BLAST hit for the duplicates), although we were not able to identify the ortholog of X in S. paradoxus (---/CYD/1). For S. bayanus, S. castellii, and S. kluyveri, it was not possible to obtain sufficient evidence that the synteny around X and Y is conserved. In this way, we can estimate Tm, the minimum age of the duplication event (in this case, T3).

    The gene duplication rate can be directly estimated from Tm without assuming the molecular clock model. Only one duplication event must have occurred between T0 and T1. For this pair (first gene pair, X: YHR053C; Y: YHR055C), the ancestral state at T1 was inferred to be AXB and a tandem gene duplication event created AXYB in S. cerevisiae. Although there are several pairs with Tm = T0, it is not clear whether they were created after T1. Therefore, the number of gene duplication events between T0 and T1 may be from one to five. Given that the average Ks between S. cerevisiae and S. paradoxus is ∼0.36, the ratio of the gene duplication rate per genome to the synonymous substitution rate per site is estimated to be (1 to 5)/0.18 = 5.6 to 28. If the synonymous substitution rate per site per year is assumed to be 8.1 × 10–9 (3) and the number of single-copy genes in the yeast genome is ∼3500 (9), the gene duplication rate is ∼0.01 to 0.06 per billion years.

    We also used our data to estimate the gene duplication rate with the molecular clock–based method, because our estimate cannot be directly comparable to the estimate reported by Lynch and Conery (3). Because there are five gene pairs with Ks < 0.01, the gene duplication rate is estimated to be 2.3 per billion years, two orders of magnitude larger than our nonclock-based estimate. This difference is highly significant. If we suppose that our nonclock-based estimate is correct, the expected number of duplicated genes with Ks < 0.01 is less than 0.14. Then, the probability of observing five or more gene duplication events is less than 4 × 10–7. Note that by “gene duplication rate” we mean the rate at which a duplicated gene is created by mutation and becomes fixed in the population. The fixation probability of duplicated genes should be largely affected by natural selection (14).

    The difference between the molecular clock–based and nonclock-based estimates could be explained by extensive concerted evolution via gene conversion. With gene conversion, many old duplicated genes can appear as if they are young. Note that the molecular clock predicts that duplicated genes with low nucleotide divergence are “young.” This fact causes the inflation of a molecular clock–based estimate of gene duplication rate, because it depends on the number of duplicated genes that look young. Evidence for extensive gene conversion is provided by two estimated gene trees (Fig. 2). The two duplicates in each species are more closely related to each other than to the orthologs of other species with high bootstrap support. Therefore, we tested the applicability of the molecular clock model to duplicated genes. The frequency distribution of Ks between two duplicated genes for each Tm is shown in Fig. 1C. It is obvious that most pairs (55/68 = 81%) have TmT4, whereas the molecular clock predicts that most duplicated genes should be younger than T4. We used 66 gene pairs with Ks + 1.96 × SD < 1.05, where SD is the standard deviation of Ks. For these pairs, because the probability that the real age predates T4 may be <0.025 under the null hypothesis, the expected number of gene pairs with TmT4 should be smaller than 66 × 0.025 = 1.65. Even with this conservative value, our observation of 55 pairs with TmT4 clearly rejects the null hypothesis (P = 10–76). Figure 1C indicates that duplicated genes created before T4 may have very small values of Ks; this suggests that it is very difficult to estimate the age of a duplication event from Ks, although some gene pairs with low Ks may be young. Another piece of evidence for extensive gene conversion is that many of the analyzed gene pairs should have been created by the whole-genome duplication event that occurred 100 to 150 million years ago (15). Of the 68 gene pairs, we found that 37 were in the genome duplication blocks recently identified by Kellis et al. (16). Some of them have very low values of Ks, in contrast to the expectation of Ks ≫ 1 under the molecular clock hypothesis.

    Fig. 2.

    Evidence for extensive concerted evolution shown in neighbor-joining (NJ) gene trees. (A) Gene tree for the orthologs of YGL135W and YPL220W (the sixth gene pair in Table 1). (B) Gene tree for the orthologs of YDL136W and YDL191W (the eighth gene pair in Table 1).

    Many old duplicate genes that look young also cause the inflation of a molecular clock–based estimate of the gene loss rate (8). The gene loss rate can be estimated by the distribution of the age of duplicated genes in a genome, so that an estimate is very sensitive to estimates of ages. If we underestimate ages [e.g., by assuming the molecular clock (3)], the distribution is skewed toward younger genes, causing an overestimation of the gene loss rate. Although we were not able to estimate the gene loss rate from our data, the gene loss rate may not be much higher than the gene duplication rate. We have estimated the gene duplication rate from duplicated genes with TmT1 and TmT2, and these estimates (0.007 to 0.05 per billion years and 0.005 to 0.04 per billion years, respectively) are similar to the estimate from genes with Tm = T0, which is not expected if the gene loss rate is much higher than the gene duplication rate.

    The data also allow us to quantify the duration of concerted evolution and the level of gene conversion. The question of the duration of concerted evolution is addressed according to the recent theoretical result of Teshima and Innan (8), who showed that the period of concerted evolution approximately follows an exponential distribution with parameter 1/τ, where τ is the expected length of concerted evolution. The probability (f) that the duration of concerted evolution from a certain time point (ts) exceeds another time point (te) is given by exp[–(tets)/t]. To estimate τ assuming a constant τ for all gene pairs, we considered two time points, T4 and T1, on the species tree (Fig. 1A). We focused on the 51 gene pairs for which concerted evolution was likely occurring at T4. For each of these 51 gene pairs, we considered whether concerted evolution was still going on at T1 by comparing Ks among four gene sequences, two from S. cerevisiae and two from S. paradoxus. We found smaller values of Ks between the paralogs within species than between orthologs for nine gene pairs that were considered to be under concerted evolution at T1. We could then estimate f = 9/51, from which an estimate of τ = 0.2 was obtained by solving exp(–0.35/τ) = f, where 0.35 is the time between T1 and T4 measured in units of 1/Ks. Assuming the synonymous substitution rate Ks = 8.1 × 10–9, the estimate yields τ = 25 million years (13).

    The rate of gene conversion is one of the important factors in determining the period of concerted evolution. The gene conversion rate can be directly estimated from the nucleotide divergence between gene pairs currently under concerted evolution. Because it was not possible to determine such gene pairs, we used the nine gene pairs that are likely under concerted evolution at T1 as a proxy, for which the average d = 0.036. The expectation of d is given by μ/c, where μ is the mutation rate per site and c is the gene conversion rate per site (17, 18); hence, we estimate that the gene conversion rate is ∼28 times the mutation rate, assuming that c is constant for all duplicated genes (13). This is within the range of estimates (10 to 100) in Drosophila duplicated genes (18).

    Our demonstration of extensive concerted evolution via gene conversion on a genome scale is consistent with molecular genetic studies showing frequent interlocus gene conversion in yeast (19). Although yeast is a model species for studying gene conversion, there is no reason to believe that the effect of gene conversion in duplicated genes is negligible in other organisms. Increasing evidence for gene conversion (interlocus as well as intralocus) is also available in higher eukaryotes, such as humans (2023), Drosophila (18, 24, 25), and other species (26).

    Supporting Online Material

    Materials and Methods

    SOM Text

    Table S1

    Figs. S1 and S2


    References and Notes

    Stay Connected to Science

    Navigate This Article