Research Article

Genetic Definition and Sequence Analysis of Arabidopsis Centromeres

See allHide authors and affiliations

Science  24 Dec 1999:
Vol. 286, Issue 5449, pp. 2468-2474
DOI: 10.1126/science.286.5449.2468


High-precision genetic mapping was used to define the regions that contain centromere functions on each natural chromosome inArabidopsis thaliana. These regions exhibited dramatic recombinational repression and contained complex DNA surrounding large arrays of 180–base pair repeats. Unexpectedly, the DNA within the centromeres was not merely structural but also encoded several expressed genes. The regions flanking the centromeres were densely populated by repetitive elements yet experienced normal levels of recombination. The genetically defined centromeres were well conserved among Arabidopsis ecotypes but displayed limited sequence homology between different chromosomes, excluding repetitive DNA. This investigation provides a platform for dissecting the role of individual sequences in centromeres in higher eukaryotes.

Centromeres mediate chromosome segregation during mitosis and meiosis by nucleating kinetochore formation, providing a target for spindle attachment, and maintaining sister chromatid cohesion. Although centromere function in lower eukaryotes requires defined DNA sequences, identifying similar essential elements in higher eukaryotes has been a long-standing challenge (1, 2). Different criteria have been used to define centromeres in higher organisms. Cytogeneticists and cell biologists have used specific DNA probes and distinct DNA binding proteins to characterize the primary chromosomal constriction, delimiting regions encompassing several megabases of DNA. In contrast, geneticists and evolutionary biologists have characterized centromere activity by monitoring reduced recombination and chromosome segregation. Here, we used a genetic method, tetrad analysis, to localize regions conferring centromere activity within naturalArabidopsis thaliana chromosomes. This technique reveals crossovers between genetic markers and centromeres, pinpointing regions on paired homologous chromosomes that always migrate to opposite poles during meiosis I (3). We used tetrad analysis to monitor marker assortment on each chromosome in >1000 meioses, identifying the boundaries of all five Arabidopsis centromeres.

In some multicellular organisms, artificial chromosome constructs (4) or chromosome fragments (5) can recapitulate centromere functions, which suggests that specific DNA sequences are necessary. However, complete DNA sequence information is not available for these constructs, which makes it difficult to discern the contributions of individual sequence elements and chromosome context. Alternatively, centromere function in multicellular organisms also may be determined by epigenetic factors, including DNA modification, secondary structure, association with specialized chromatin components, or differential timing of replication (6). Here we analyzed the virtually complete sequence of two Arabidopsis centromeres, providing an unparalleled view of centromere composition and enabling a comprehensive analysis of the sequence motifs, DNA modifications, and structural features that contribute to centromere function.

Physical Mapping of Centromeric Regions

Previously, DNA fingerprint and hybridization analysis of two bacterial artificial chromosome (BAC) libraries enabled the assembly of physical maps covering nearly all single-copy portions of theArabidopsis genome (7). However, repetitive DNA near the Arabidopsis centromeres, including 180–base pair (bp) repeats, retroelements, and middle repetitive sequences (8, 9), complicated efforts to anchor contiguous centromeric BAC clones (contigs) to particular chromosomes. We used genetic mapping to unambiguously assign these unanchored contigs to specific centromeres (Fig. 1), scoring polymorphic markers in 48 plants with crossovers informative for the entire genome (3). In this manner, we connected several centromeric contigs to the physical maps of the chromosome arms and simultaneously generated a large set of DNA markers useful for defining centromere boundaries. For chromosomes II and IV, DNA sequence analysis confirmed the structure of these contigs (10).

Figure 1

Physical maps of the genetically defined Arabidopsis centromeres. Physical sizes were derived from DNA sequencing (chromosomes II and IV) or from estimates based on BAC fingerprinting (chromosomes I, III, and V) (7). Positions of markers used to confirm contig structure (above), the number of tetratype/total tetrads at those markers (below), the boundaries of the centromere (thick black bars), and the name of contigs derived from fingerprint analysis are indicated for each chromosome (7). For each contig, more than two genetic markers, developed from the database of BAC-end sequenc- es (27), were scored. PCR primers corresponding to these sequences were used to identify size or restriction site polymorphisms in the Columbia and Landsberg ecotypes (28); primer sequences are available (29). Tetratype tetrads/total tetrads resulting from treatments that stimulate crossing over (boxes); positions of markers in centimorgans shared with the recombinant inbred (RI) map (ovals) (14); and sequences bordering gaps in the physical map that correspond to 180-bp repeats (open circles), 5S rDNA (black circles), or 160-bp repeats (gray circles) are indicated.

Although this analysis substantially extends the understanding of the centromeric regions, gaps in the physical maps remain at each centromere. BAC clones near these gaps have end sequences corresponding to repetitive elements that likely constitute the bulk of the DNA between contigs, including 180-bp repeats, 5S ribosomal DNA (rDNA), or 160-bp repeats (Fig. 1). Fluorescence in situ hybridization has shown that these repetitive sequences are abundant components ofArabidopsis centromeres (8). Genetic mapping and pulsed-field gel electrophoresis indicate that many 180-bp repeats reside in long arrays that measure between 0.4 and 1.4 Mb in the centromeric regions (11); sequence analysis revealed additional interspersed copies near the gaps (10).

Genetic Mapping of Centromere Functions

To determine which portions of the centromeric regions participate in centromere function, we used tetrad analysis, monitoring centromere marker assortment through individual meioses (Fig. 1). InArabidopsis, this is possible with quartet1(qrt1), a mutation that causes the four products of male meiosis to be released as a tetrad of pollen grains (12). We generated plants useful for tetrad analysis by crossing qrt strains from the Landsberg and Columbia ecotypes and pollinating Landsberg stigmas with individual pollen tetrads from F1 plants (3). Crosses typically yielded three or four progeny plants per tetrad; we analyzed the assortment of DNA polymorphisms in the progeny from >1000 tetrads.

Monitoring the position of crossovers in this population identified chromosomal regions that could be separated by recombination from centromeres (tetratype) as well as regions that always cosegregated with centromeres (ditype) (3). Tetratype frequencies decrease to zero at the centromere; consequently, we defined centromere boundaries as the positions that exhibited small but detectable numbers of tetratype patterns. Scoring the segregation of centromere-linked markers in about 400 tetrads localized centromeres 1 to 5 (CEN1–CEN5) to regions on the physical map corresponding to contigs of 550, 1445, 1600, 1790, and 1770 kb, respectively. Analysis of polymorphisms corresponding to 180-bp repeats (RCEN markers) (11) confirmed that these repeats map within the genetically defined centromeres (13). In genetic units, the centromere intervals averaged 0.44 centimorgan (cM) (percent recombination = 1/2 tetratype frequency), reflecting recombination rates at least 10 to 30 times below the genomic average of 221 kb/cM (14).

The low recombination frequencies typically observed near higher eukaryotic centromeres may be due to DNA modifications or unusual chromatin states (15, 16). We attempted to modify these states and thus improve centromere mapping resolution by raising recombination frequencies. F1 Landsberg/Columbia plants were treated with one of a series of compounds known to cause DNA damage, modify chromatin structure, or alter DNA modifications (17). We crossed tetrads from treated plants with Landsberg stigmas and recovered and analyzed progeny from 8 to 107 tetrads subjected to each treatment, yielding >600 additional tetrads. These tetrads exhibited higher recombination in regions immediately flanking the centromeres (1.6% versus 3.4% recombination in untreated and treated plants, respectively), although the sample size was insufficient to determine whether any individual treatment had a profound effect. These efforts refined the map locations of centromeres on chromosomes 2 to 5 (Fig. 1), yielding intervals spanned by contigs of 880, 1150, 1260, and 1070 kb, respectively, with all tetrads consistently localizing centromere functions to the same region.

Efforts to increase recombination yielded a large number of tetrads with crossovers near the centromeres; these crossovers clustered within a narrow region at the centromere boundaries. Five crossovers occurred over a 70-kb region near CEN2, and seven occurred over a 200-kb region near CEN1, yet no crossovers were detected in the adjacent centromeric intervals of 880 and 550 kb, respectively (Fig. 1). Thus, the centromeres were found within large domains that restrict recombination machinery activity; the transition between these domains and surrounding, recombination-proficient DNA is remarkably abrupt (Fig. 2, A and K). Although analysis of more tetrads would yield additional recombination events, the observed distribution of crossovers suggests that centromere positions would not be significantly refined.

Figure 2

Properties of the chromosome II and IV centromeric regions. (Top) Drawing of genetically defined centromeres (gray shading: CEN2, left; CEN4, right), adjacent pericentromeric DNA, and a distal segment of each chromosome, scaled in megabases as determined by DNA sequencing (gaps in gray shading correspond to gaps in the physical maps) (10). Physical distances (megabases) starting at the telomeres of the short arm and also at the centromeric gaps are shown, as are centimorgan positions (RI map). (Bottom) The density of each feature is plotted relative to chromosomal position (megabases). (A and K) Centimorgan positions of markers on the RI map (solid squares) compared with the genomic average of 1 cM/221 kb (dashed line). One crossover within CEN4 occurred in the RI mapping population (14), perhaps reflecting a difference between male meiotic recombination (this study) and recombination in female meiosis. (B to E and L toO) Percent DNA occupied by repetitive elements in a 100-kb sliding window at 10-kb intervals: (B and L) 180-bp repeats; (C and M) sequences with similarity to retroelements, including del, Ta1, Ta11, copia, Athila, LINE, Ty3, TSCL, 106B (Athila-like), Tat1, LTRs, and Cinful (10); (D and N) sequences with similarity to transposons, including Tag1, En/Spm, Ac/Ds, Tam1 MuDR, Limpet, MITES, and mariner (10); (E and O) previously described centromeric repeats including 163A, 164A, 164B, 278A, 11B7RE, mi167, pAT27, 160- and 500-bp repeats, and telomeric sequences (8, 9). (F and P) Percent A+T in a 50-kb sliding window at 25-kb intervals. (G to J and Qto T) Number of predicted genes or pseudogenes in a 100-kb sliding window at 10-kb intervals. Predicted genes (G and Q) and pseudogenes (I and S) typically not found on mobile DNA elements; predicted genes (H and R) and pseudogenes (J and T) often carried on mobile DNA, including reverse transcriptase, transposase, and polyproteins (10). Annotation was obtained from GenBank records, from the AGAD database, and by BLAST comparisons with the database of repetitive Arabidopsis sequences (24, 30); annotation in progress is noted (dashed lines) (10). Although updates to annotation records may change individual entries, the overall structure of the region will not be significantly altered.

Centromeric DNA Content

The virtually complete and annotated sequence of chromosomes II and IV allows analysis of centromeres at the nucleotide level (10). We examined the sequence composition within the genetically defined centromere boundaries and compared it with adjacent pericentromeric regions (Fig. 2). Analysis of two centromeres facilitates comparisons of sequence patterns and identification of conserved sequence elements. CEN2 and CEN4 are particularly well suited for this analysis; they reside on structurally similar chromosomes with 3.5-Mb rDNA arrays on their distal tips, regions measuring 3 and 2 Mb, respectively, between the rDNA and centromeres, and 16- and 13-Mb regions on their long arms (10,18).

Repeat Abundance in Centromeric Regions

All higher eukaryotes examined to date contain repetitive DNA in their centromeric regions. If this DNA is sufficient for centromere functions, its abundance outside a genetically defined centromere is likely to be dramatically reduced, thus preventing the formation of multiple centromeres along the chromosome arm. TheArabidopsis 180-bp repeat sequences have such properties. They were found in the gaps of each centromeric contig (Figs. 1 and 2, B and L), with few repeats and no long arrays elsewhere in the genome (10, 11). DNA hybridization experiments have shown that other repeats, including retrotransposons, middle repetitive elements, and telomeric sequences map near Arabidopsis centromeres (8, 9). The annotated sequence of chromosomes II and IV identified regions with homology to these repeats (10), within both the functional centromeres and the adjacent regions (Fig. 2, B to E and L to O).

Sequences resembling retrotransposons are rare onArabidopsis chromosome arms, yet these elements are abundant in centromeric regions (10). In a 4.3-Mb sequenced region that includes CEN2 and a 2.7-Mb sequenced region that includes CEN4, retrotransposon homology accounted for >10% of the DNA sequence, with maxima of 62% and 70%, respectively (Fig. 2, C and M). Sequences with similarity to transposons or middle repetitive elements occupied a similar zone but were less common (11% and 29% maximum density for chromosomes II and IV, respectively) (Fig. 2, D, E, N, and O). Finally, low-complexity DNA, including microsatellites, homopolymer tracts, and A+T–rich isochores are enriched in Drosophila and Neurospora centromeres (19) but not within Arabidopsis centromeres. NearCEN2, simple repeat sequence densities were comparable to those on chromosome arms, occupying 1.5% of the sequence within the centromere and 3.2% in the flanking regions and ranging from 20 to 319 bp in length (71-bp average). Except for an insertion of mitochondrial DNA at CEN2 (10) the DNA in and around the centromeres did not contain any large regions that deviate significantly from the genomic average of 64% A+T (Fig. 2, F and P) (20).

Repetitive Elements and Centromere Functions

Unlike the 180-bp repeats, all other repetitive elements near CEN2 and CEN4 were less abundant within the genetically defined centromeres than in the flanking regions. The high concentration of repetitive elements outside the functional centromere domain indicates that they are insufficient for centromere activity. Thus, identifying segments of the Arabidopsis genome enriched in these repetitive sequences does not pinpoint regions that provide centromere function; a similar situation may occur in other higher eukaryotic genomes.

Repetitive DNA flanking the centromeres may play an important role, forming an altered chromatin conformation that nucleates or stabilizes centromere structure. Alternatively, other mechanisms could result in the accumulation of repetitive elements near centromeres. Although evolutionary models predict that repetitive DNA accumulates in regions of low recombination (16), manyArabidopsis repetitive elements are more abundant in the recombinationally active pericentromeric regions than in the centromeres themselves. Instead, retroelements and other transposons may insert preferentially into regions flanking centromeres or may be eliminated from the rest of the genome at a higher rate.

Abundance of Genes in Centromeric Regions

Expressed genes are located within 1 kb of essential centromere sequences in Saccharomyces cerevisiae, and multiple copies of tRNA genes reside within an 80-kb fragment necessary for centromere function in Schizosaccharomyces pombe(21). In contrast, genes are thought to be relatively rare in the centromeres of higher eukaryotes, although there are notable exceptions. The Drosophila light, concertina,responder, and rolled loci all map to the centromeric region of chromosome 2, and translocations that removelight from its native heterochromatic context inhibit gene expression (22). In contrast, many Drosophila and human euchromatic genes become inactive when they are inserted near a centromere (22). Thus, genes that reside near centromeres likely have special control elements that allow expression. The sequences of Arabidopsis CEN2 and CEN4 provide a powerful resource for understanding how gene density and expression correlate with centromere position and associated chromatin.

Annotation of chromosomes II and IV identified many genes within and adjacent to CEN2 and CEN4 (10) (Figs. 2 and 3). The abundance of mobile elements resulted in a relatively high frequency of reverse transcriptase, retroviral polyprotein, and transposase genes compared with chromosome arms (Fig. 2, H and R). However, other genes typically not associated with transposable elements were also predicted in the centromeric regions (Fig. 2, G and Q). The density of predicted genes on Arabidopsis chromosome arms averaged 25 per 100 kb (20), and in the repeat-rich regions flankingCEN2 and CEN4 this decreased to 9 and 7 genes per 100 kb, respectively. Many predicted genes also were found within the recombination-deficient, genetically defined centromeres. WithinCEN2, there were 5 predicted genes per 100 kb; CEN4was strikingly different, with 12 genes per 100 kb.

There is strong evidence that several predicted centromeric genes are transcribed. The phosphoenolpyruvate phosphate translocator gene (CUE1) defines one CEN5 border; mutations in this gene cause defects in light-regulated gene expression (23). Within the sequenced portions of CEN2 and CEN4, 17% (27/160) of the predicted genes share >95% identity with cloned cDNAs (expressed sequence tags), with threefold more matches inCEN4 than in CEN2 (Table 1) (10, 24). Twenty-four of these genes have multiple exons, and four correspond to single-copy genes with known functions (Table 1). To investigate whether the remaining 23 genes are uniquely encoded at the centromere, we queried the database of annotated genomic Arabidopsissequence. With two exceptions, no homologs with >95% identity were found elsewhere in the 80% of the genome that has been sequenced (Table 1). The number of independent cDNA clones corresponding to a single-copy gene provides an estimate of gene expression levels. On chromosome II, predicted genes highly similar to entries in the cDNA database (>95% identity) matched an average of four independently derived cDNA clones (range, 1 to 78). Within CEN2 andCEN4, 11 of 27 genes exceeded this average (Table 1). Finally, most genes encoded at CEN2 and CEN4 were not related to each other, nor did they correspond to genes predicted to play a role in centromere functions; instead they have diverse roles (Table 1).

Table 1

Predicted genes within CEN2 and CEN4that correspond to the cDNA database. EST = expressed sequence tag.

View this table:

Many genes in the Arabidopsis centromeric regions appear to be nonfunctional because of early stop codons or disrupted open reading frames, whereas few pseudogenes reside on the chromosome arms (10). Although many of these pseudogenes exhibit similarity to mobile elements, some correspond to genes that typically are not mobile (Fig. 2, I, J, S, and T). Within the genetically defined centromeres there were 1.0 (CEN2) and 0.7 (CEN4) nonmobile pseudogenes per 100 kb; the repeat-rich regions bordering the centromeres had 1.5 and 0.9 genes per 100 kb, respectively. The distributions of pseudogenes and transposable elements are overlapping, which suggests that DNA insertions in these regions contributed to gene disruptions.

Model for Centromere Expansion

Arabidopsis centromeres contain a central region of 180-bp repeats surrounded by moderately repetitive DNA with dramatically reduced recombination. Flanking this genetically defined centromere are regions with normal recombination levels that are highly enriched in mobile elements. The abundance of repetitive sequences suggests that insertion events have contributed to substantial structural change of the centromeric regions over evolutionary time. Modern Arabidopsis centromeres potentially evolved from smaller domains composed of unique or low-copy sequences, and the accumulation of insertions presumably generated the larger, more repetitive regions observed today. In this view, integration of mobile elements could separate domains important for centromere function.

There may be constraints that limit centromere growth. For example, essential centromeric domains may require an arrangement suitable for assembly into higher order structures. In addition, mechanisms that inhibit crossing over in centromeric regions are likely required to prevent unequal sister chromatid exchanges that cause imbalances in critical DNA elements (16). Reduced recombination might be achieved by delaying replication or pairing of centromeric DNA or by limiting access of the recombination machinery through specialized chromatin structures. Thus, centromere structure is likely shaped by the action of mobile DNA element insertions balanced by selective pressures that maintain centromere function.

Conservation of Centromeric DNA

Saccharomyces cerevisiae centromeres on homologous chromosomes are highly conserved among different strains (25). However, the abundance of mobile DNA elements atArabidopsis centromeres could contribute to a high sequence divergence among ecotypes. To investigate the conservation ofCEN2 and CEN4, we designed polymerase chain reaction (PCR) primer pairs corresponding to unique regions in the Columbia sequence and surveyed the centromeric regions of Landsberg and Columbia at about 20-kb intervals (Fig. 4). We obtained amplification products of the same length in both ecotypes for most primer pairs (80%), which indicates that the amplified regions are highly similar (Fig. 4), and 5% of the products revealed a size polymorphism. In the remaining cases, primer pairs amplified Columbia but not Landsberg DNA, even at low stringencies. In these regions, additional primers were designed to determine the extent of nonhomology. In addition to a large insertion of mitochondrial DNA inCEN2 (10), we identified two other nonconserved regions (Fig. 4). Because this DNA is absent from the Landsberg centromeres, it is unlikely to be required for centromere function; consequently, the relevant portion of the centromeric sequence is reduced to 577 kb (CEN2) and 1250 kb (CEN4). Extensive sequence conservation between Landsberg and Columbia centromeres indicates that reduced recombination frequencies are not the result of large regions of nonhomology but instead are a property of the centromeres themselves.

Figure 3

Sequence features at CEN2(A) and CEN4 (B). Central bars depict annotated genomic sequence of indicated BAC clones (10); black, genetically defined centromeres; white, regions flanking the centromeres; //, gaps in physical maps. Sequences corresponding to genes and repetitive features, filled boxes (above and below the bars, respectively), are defined as in Fig. 2.

Figure 4

Conservation of centromere DNA. BAC clones (bars) sequenced in CEN2 (A) and CEN4(B) are indicated; arrows denote boundaries of the genetically defined centromeres. PCR primer pairs yielding products from only Columbia (filled circles) or from both Landsberg and Columbia (open circles); BACs encoding DNA with homology to the mitochondrial genome (gray bars); 180-bp repeats (gray boxes); unannotated DNA (dashed lines); and gaps in the physical map (double slashes) are shown.

Sequence Similarity Between CEN2 and CEN4

Discerning the rules that govern centromere function will likely require analysis of both primary DNA sequence and higher order structures. As a first step, we searched for previously unidentified sequence motifs shared between CEN2 and CEN4, excluding retroelements, transposons, characterized centromeric repeats, and coding sequences resembling mobile genes (10). After masking additional repetitive sequences, including homopolymer tracts and microsatellites, we compared contigs of 417 kb (CEN2) and 851 kb (CEN4) with BLAST (25).

This comparison showed that the complex DNA within the centromere regions is not highly homologous; only 16 DNA segments inCEN2 matched 11 regions in CEN4 with >60% identity (Fig. 5). These homologous sequences comprise a total of 17 kb (4%) of CEN2, have an average length of 1017 bp, and have an A+T content of 65%. On the basis of their similarity, we sorted the matching sequences into groups including two families that contain eight sequences each (AtCCS1 and AtCCS2), three sequences from a small family (AtCCS3), and four sequences found once within each centromere (AtCCS4-AtCCS7), one of which (AtCCS6) corresponds to predicted CEN2 and CEN4 proteins with similarity throughout their exons and introns (Fig. 5). Searches of the Arabidopsis genomic sequence database demonstrated that AtCCS1 to AtCCS5 are moderately repetitive sequences that appear in centromeric and pericentromeric regions; the remaining sequences are present only in the genetically defined centromeres.

Figure 5

Sequences common to CEN2 andCEN4. Genetically defined centromeres (bold lines), sequenced BAC clones (thin lines), and unannotated BAC clones (dashed lines) are displayed as in Fig. 4. Repeats AtCCS1 (A. thaliana centromere conserved sequence) and AtCCS2 (closed and open circles, respectively); AtCCS3 (triangles); and AtCCS4-7 (4–7) are indicated (GenBank accession numbers AF204874 to AF204880) and were identified with BLAST 2.0 (31).

Similar comparisons of all 16 S. cerevisiaecentromeres defined a consensus consisting of a conserved 8-bp CDEI motif, an A+T–rich 85-bp CDEII element, and a 26-bp CDEIII region with seven highly conserved nucleotides (25). In contrast, surveys of the three S. pombe centromeres revealed conservation of overall centromere structure but no universally conserved motifs (2). Additional tests will be required to determine whether the conserved sequences we identified in theArabidopsis centromeres contribute to function.


By combining genetic analysis with investigation of DNA sequence, we have defined chromosomal regions with specific properties. They confer meiotic centromere activity and are dramatically deficient in recombination. Structurally, they are composed of moderately repetitive DNA and a core of 180-bp repeats embedded in a highly repetitive pericentromeric region. Further analysis of these regions will yield insights into the role of specific binding proteins, assembly of unique chromatin structures, and altered patterns of DNA modification, replication, and pairing. Moreover, these studies provide a platform for identification of the minimal sequence that provides centromere function. Such sequences might be spread across the entire genetically defined region, be concentrated at a discrete point, or exist as redundant copies within the centromere (26).

The centromeres of other multicellular eukaryotes, like those ofArabidopsis, may harbor numerous expressed genes that specify important functions. Investigating how genes are maintained in recombination-deficient, repeat-rich regions will improve the understanding of genome evolution. Obtaining the sequences of centromeres from a diverse array of organisms will elucidate the general mechanisms that govern centromere function.

  • * To whom correspondence should be addressed. E-mail: dpreuss{at}


View Abstract

Navigate This Article