Research Article

The Origins of Genomic Duplications in Arabidopsis

See allHide authors and affiliations

Science  15 Dec 2000:
Vol. 290, Issue 5499, pp. 2114-2117
DOI: 10.1126/science.290.5499.2114


Large segmental duplications cover much of theArabidopsis thaliana genome. Little is known about their origins. We show that they are primarily due to at least four different large-scale duplication events that occurred 100 to 200 million years ago, a formative period in the diversification of the angiosperms. A better understanding of the complex structural history of angiosperm genomes is necessary to make full use of Arabidopsis as a genetic model for other plant species.

A. thaliana has one of the smallest angiosperm genomes (1). It is a well-behaved diploid with only five haploid chromosomes (2). Despite this, much of the genome is internally duplicated (3–14). It has been hypothesized that the duplicated blocks originated in a single polyploidy event and have since been scrambled by chromosomal rearrangements (5). This hypothesis predicts that each region of the Arabidopsisgenome should be present in exactly two copies. Recent comparative mapping results suggest that some regions are present in three or more copies (6, 8), but it is not clear how prevalent such regions are. Here, we use the nearly complete genome sequence ofArabidopsis to study the evolutionary origins of duplicated blocks on a genome-wide scale. Sequence and map data used in our analyses, along with more detailed results, are available on the Internet (15).

Duplicated blocks were identified by the presence of neighboring genes with high sequence similarity to neighboring genes elsewhere in the genome. We considered only protein-coding genes because little conservation exists between noncoding duplicated regions inArabidopsis (5, 16). We used BLAST to identify genes with high sequence similarity (17,18). Our data set contained 20,269 composite open reading frames (cORFs), of which 2796 represented tandem arrays of related genes or the same gene present on overlapping clones (19). After removing low-quality matches (20), there were matches between 18,569 pairs of cORFs; 64% of the cORFs had at least one match. To identify duplicated blocks, we considered the proximity and transcriptional orientation of matches in both segments (21). We allowed singleton (nonmatching) genes within duplicated blocks because gene loss is known to follow duplication (8, 22) and because some genes may be transposed from their original positions. We also did not prohibit inversions within duplicated blocks because small-scale inversions are not uncommon in eukaryotic genomes (8, 23). To ensure that spurious duplicated blocks would not be identified, we chose a conservative set of parameter values based on the outcome of randomization tests (24).

We identified 103 duplicated blocks containing seven or more matching cORFs (Fig. 1). Candidates with fewer than seven genes are much more abundant in real than randomized data, suggesting that many smaller blocks may be present. There are duplications between all chromosomes except chromosome 2 with itself. Over 81% of cORFs fall within the bounds of at least one block. However, only 28% of these are actually present in duplicate. Interestingly, pericentromeric genes account for much of the genome that is not covered by blocks. Nearly 25% of all cORFs fall within two or more blocks, and one region, near ATAP22 on chromosome 4, falls within five blocks. Such extensive overlap among blocks provides prima facie evidence for multiple duplication events.

Figure 1

Genomic map of duplicated blocks inArabidopsis. The two copies of each putative duplicated block (e.g., 1a and 1b) are shown. Color denotes age class (red, A; blue, B; green, C; purple, D; orange, E; gray, F). Centromeres (9, 10, 42) are shown with black circles, and ribosomal DNA with white circles. Direction of arrowhead indicates the predominant relative orientation of duplicated cORFs within each block (right, direct; left, inverted). Landmarks are given at 200 cORF intervals.

The number of independent duplication events can be inferred from patterns of sequence divergence between duplicated genes. A single polyploidization event will produce a unimodal distribution of divergence estimates with homogeneity among blocks. Many small independent events can also result in unimodality but with heterogeneity among blocks. A limited number of asynchronous, independent duplication events will produce a multimodal distribution.

The median estimated amino acid divergence (d A) was between 0.325 and 0.725 amino acid substitutions per site (Δaa/site) for all but nine blocks (Fig. 2) (25). Excluding these nine, there is still significant among-block heterogeneity ind A (26). Thus, we reject the hypothesis they belong to a single age class. The best-fit mixture of normal distributions to these 94 medians is trimodal (27). Each block can be assigned to one of the three age classes (labeled C, D, and E in progressing order of age) with greater than 50% posterior probability. The remaining blocks can be assigned, ad hoc, to two younger age classes (A and B) and one older age class (F). Some spatial overlap remains between duplicated blocks within age classes (Fig. 1), but this may be an artifact of erroneous age class assignments and spatially overextended blocks.

Figure 2

Mutimodal distribution of block ages. The distribution of median d A values among blocks. Age classes are marked with arrows. An approximate time scale is given.

The blocks in age class C collectively bound 48% of the cORFs in the genome. Adding the number of duplicated pairs of cORFs to singleton cORFs, we estimate that more than 9000 cORFs were duplicated at this time (28). This is far larger than chromosome 1, the largest of the present complement, which we estimate to contain fewer than 6000 genes (29). Thus, it is likely that age class C represents either a whole-genome polyploidy event or the near-simultaneous duplication of multiple chromosomes.

The progressively older blocks in age classes D, E, and F bound 39, 11, and 3% of the cORFs in the genome, respectively. The true extent of these more ancient duplications is likely to be much greater because gene loss after each duplication event tends to obscure older blocks. Thus, these older age classes may also represent very large-scale duplication events.

We estimated the absolute ages of the duplicated blocks by assuming that the average extent of amino acid substitution (d A) is linearly related to time (30). The average d A values for age classes B through F (Table 1) yield age estimates of approximately 50, 100, 140, 170, and 200 million years ago (Mya), respectively. Thus, age classes C through F appear to date from the Mesozoic Era (65 to 245 Mya). Age class E is marginally older than the reported age of divergence between rosid and asterid eudicots, 112 to 156 Mya, whereas age class F is within the estimated time window for the divergence of monocots and dicots, 180 to 220 Mya (31–33). Thus, the older duplicated blocks reported here are likely to be a common feature of diverse groups of angiosperms. Regions contained within blocks 45, 48, 85, 88, and 100 were recently found to share common ancestry with a 105-kilobase genomic sequence from tomato, an asterid (8). Phylogenetic analysis in that study suggested that the rosid-asterid divergence occurred before the events leading to age classes C and E (duplicated blocks 45 and 85, respectively) and was nearly contemporaneous with age class E. Thus, the divergence and duplication dates are consistent. The two blocks assigned to age class A are likely to be artifacts of erroroneous genome assembly. In both cases, the copies are nearly identical even at the nucleotide level and are restricted to individual large-insert clones (34). Thus, the youngest duplicated block appears to be the sole member of age class B.

Table 1

Features of the five age classes of duplicated blocks.d A is the minimum change in amino acids per between dispersed duplicated cORFs, averaged among all cORFs. Retained duplicates, ratio of presently duplicated to inferred ancestral cORFs. Block size, mean number of cORFs (including singletons) per copy.

View this table:

If the Arabidopsis genome has experienced multiple large-scale duplications, then the present complement of five chromosomes suggests a history of chromosome fusions. In fact, three such fusions have occurred since A. thaliana diverged from its closest extant relatives (35). Subchromosomal rearrangements, such as inversions and translocations, are expected to cause the average size of duplicated blocks to decrease with age, as is observed for blocks C through F (Table 1). Inversions can, in some cases, be inferred from our data set by the orientation of neighboring blocks or by gene order and orientation within blocks. Reciprocal translocations would be expected to conserve the orientation of blocks relative to the centromere (22) but, contrary to an earlier report (5), we do not see an excess of blocks with identical orientations in any age class.

We have estimated the number of deleted genes in each duplicated block by counting the number of singleton cORFs (36). The proportion of deleted genes increases with the inferred age of the duplication event (Table 1). A small number of blocks deviate significantly from a 1:1 distribution of singletons between the two copies, suggesting that the loss of duplicate genes between segments may sometimes be biased, as previously observed (5).

The 103 duplicated blocks account for only 15% of the matches in the data set. Some proportion of the remaining matches may lie within undetected duplicated blocks or may have been transposed from their original position. Still, the remaining matches are not randomly dispersed in the genome, suggesting the presence of a separate gene duplication process. Matches not in blocks are 20% more likely to occur on the same chromosome than if the distribution were proportional to size. Of those matches on the same chromosome, the average distance is 86% of that expected between two random points. Similar findings have been reported for Caenorhabditis elegans(37), a genome that lacks large-scale duplications.

Many insertion mutants in Arabidopsis have no obvious phenotypic effect (38). This may be due, in part, to redundant functions among duplicated genes. One example appears to be the shatterproof genes SHP1 and SHP2, MADS-box regulatory factors that must be simultaneously removed before fruit nondehiscence is observed (39). SHP1 and SHP2, on chromosomes 3 and 2, are within duplicated block 67 in age class C.

Our analysis implies that regions of the Arabidopsis genome homoeologous to genomes that diverged from Arabidopsisbefore ∼100 Mya will be small, generally less than ∼10 centimorgans in size. This, coupled with massive gene loss inArabidopsis, has likely been responsible for the difficulty in identifying regions homoeologous between Arabidopsis and rice (40). Knowledge of the duplication history ofArabidopsis should facilitate such mapping efforts. For example, it has recently been proposed that segments ofArabidopsis chromosome 4 and rice chromosome 2 are homoeologous (41). The Arabidopsis segment is within duplicated block 92 (age class D), implying that the rice segment is also homoeologous to a part of Arabidopsischromosome 5.

Our understanding of plant evolution and our use ofArabidopsis as a genetic model for other plants will clearly depend on a deeper appreciation for the complex duplication history of this small genome.

  • * To whom correspondence should be addressed. E-mail: tv23{at}


View Abstract

Navigate This Article