Report

In-Depth View of Structure, Activity, and Evolution of Rice Chromosome 10

See allHide authors and affiliations

Science  06 Jun 2003:
Vol. 300, Issue 5625, pp. 1566-1569
DOI: 10.1126/science.1083523

This article has a correction. Please see:

Abstract

Rice is the world's most important food crop and a model for cereal research. At 430 megabases in size, its genome is the most compact of the cereals. We report the sequence of chromosome 10, the smallest of the 12rice chromosomes (22.4 megabases), which contains 3471 genes. Chromosome 10 contains considerable heterochromatin with an enrichment of repetitive elements on 10S and an enrichment of expressed genes on 10L. Multiple insertions from organellar genomes were detected. Collinearity was apparent between rice chromosome 10 and sorghum and maize. Comparison between the draft and finished sequence demonstrates the importance of finished sequence.

Rice (Oryza sativa) has been cultivated for more than 9000 years and is a major food staple for over 50% of the human population. Rice is considered a model system for plant biology largely because of its compact genome (430 Mb) and evolutionary relationships with other large genome cereals such as sorghum (750 Mb), maize (2500 Mb), barley (5000 Mb), and wheat (15,000 Mb) (1, 2). Both whole-genome draft (35) and finished sequences for rice chromosomes 1 and 4 have been reported (6, 7). Here, we present a detailed analysis of rice chromosome 10, which is the smallest of the 12 rice chromosomes with a genetic length of 83.7 centimorgans (cM) (8) and an estimated physical length of 23 Mb (∼5.2% of the genome) (9).

A pseudomolecule representing 22,422, 563 base pairs (bp) of unique, non-overlapping sequence of chromosome 10, derived from 202 clones (10), was constructed by resolving discrepancies between overlapping bacterial artificial chromosomes (BACs), trimming regions of overlap, and linking the unique sequences to form a contiguous sequence. The pseudomolecule contains seven physical gaps totaling less than 1 Mb (4% of the total) in addition to gaps at the telomeres and centromere, which represent 130 and ∼500 kb, respectively (Fig. 1). Three physical gaps may be potentially closed by the identification of BACs from a new MboI BAC library. The remaining gaps are located in the highly repetitive heterochromatic portion of chromosome 10, and BAC clones have not been identified in multiple libraries, possibly owing to the instability of the sequences in Escherichia coli or the lack of appropriate restriction enzyme sites in these regions. The complete sequence and annotation of rice chromosome 10 can be found in GenBank (accession number AE016959).

Fig. 1.

Distribution of features on rice chromosome 10. (Top) A pachytene chromosome 10 image. The green signal is derived from fluorescence in situ hybridization with the CentO satellite repeat that is specific to rice centromeres (25). (Bottom) Features of chromosome 10 represented along the length include the position of genetic markers (in cM), physical gaps (centromere in red), gene density, expression density, Tos17 insertion sites, transposable elements (TE), long terminal repeats (LTR), MITEs, tRNA genes, chloroplast insertions (CP; green), and mitochondrial insertions (MT; red). From left to right, the approximate size of each physical gap (in kb) is 30, 119, 69, 500 (centromere, in red), 102, 104, 124, and 30. Note that gaps at the telomere are not shown and are estimated at 80 kb (10S) and 50 kb (10L).

A total of 3471 genes (including transposable elements) and 67 tRNA genes were identified on chromosome 10 (Table 1). The average gene density (1 gene per 6.46 kb) on rice chromosome 10 is comparable to that on rice chromosomes 1 and 4 (6, 7) but lower than in the dicotyledonous plant Arabidopsis thaliana. Compared with Arabidopsis, the G + C content of rice exons is substantially higher and, although the average exon number per gene is lower, the average exon length in rice is longer. We were able to assign annotation to 51.4% of the genes, leaving 8.3% of the genes that only match expressed sequence tags (ESTs) and 40.3% of the genes as hypothetical proteins. On average, the gene density on 10L is higher than on 10S, with regions close to the centromere having the lowest gene density (Fig. 1). With the exception of two tRNA genes located at ∼0.32 Mb (Fig. 1), the remaining tRNA genes are located on 10L. There are two clusters (10.2 Mb and 19.7 Mb) of tRNA genes with 28 and 11 genes, respectively, that are derived from chloroplast-insertion events. Based on high stringency alignments to ESTs, 856 (24.7%) of the 3471 genes on chromosome 10 are expressed (Fig. 1).

Table 1.

Statistics of rice chromosome 10. The rice chromosome 10 statistics were generated in this study. The Arabidopsis genome numbers are from the TIGR ATH1 database (Release 3.0; www.tigr.org/tdb/e2k1/ath1).

Features Rice chromosome 10 Arabidopsis whole genome
Total number of BACs/PACs 202 1,578 (5 chromosomes)
Total BAC length (Mb) 28.1 132.9
Total non-overlapping sequence (Mb) 22.4 117.3
Short arm (Mb) 7.6
Long arm (Mb) 14.8
G + C Content
    Overall 43.5% 35.9%
    Exons 53.6% 43.8%
    Introns 39.4% 32.6%
    Intergenic regions 41.2% 31.7%
    Protein-coding DNA 53.6% 44.0%
Total number of genes 3,471 29,084
Integrated genetic markers 140 (loci)
Average gene size (bp) 2,556 1,975
Average gene density (bp per gene) 6,464 4,008
Average number exons per gene 4 4.9
Average exon size (bp) 344 275
Average intron size (bp) 389 163
Known genes 1,785 (51.4%) 19,301 (66.4%)
Average known gene size (bp) 3209 2253
Unknown genes 288 (8.3%) 4,725 (16.2%)
Average unknown gene size 2322 1967
Hypothetical genes 1,398 (40.3%) 5,058 (17.4%)
Average hypothetical gene size 1771 1784
tRNA genes 67 611

Repetitive sequences account for ∼18% of the chromosome 10 sequence (table S1; fig. S1). More than 54% of the repetitive sequences were similar to retrotransposons. Of the 2922 sequences that matched retrotransposons, 1473 matched gypsy-like sequences, whereas 244 matched copia-like sequences. However, only 139 gypsy-like sequences and 173 copia-like retrotransposon proteins were flanked by long terminal repeats. Retrotransposon sequences were enriched on 10S and the pericentromeric regions relative to 10L (Fig. 1). MITEs (miniature inverted repeat transposable elements) account for 4.3% of chromosome 10 sequence and are enriched on 10L, consistent with their tendency to locate near genes. Other repetitive sequences present on chromosome 10 are minor in their representation, whereas simple sequence repeats account for ∼3% of the total chromosome 10 sequence (table S2).

Collectively, the frequencies of genes, transcripts, tRNA genes, and repetitive sequences are consistent with cytological data that indicate the short arm of chromosome 10 is one of the most heterochromatic regions of the rice genome (9). Much of chromosome 10, including the entire short arm, is composed of regions where little comparative data exists, perhaps because of the lack of markers and recombination in the area. Also, rice chromosome 10 seems wholly derived from the same common source as a region in the center of wheat chromosome 1 and oat linkage group A (2). Taken together, these data are consistent with the expectation of a large percentage of heterochromatin in rice chromosome 10. In Arabidopsis, 61.3% of the genes are expressed as determined by alignment to ESTs (11, 12), and the low percentage of expressed genes in rice chromosome 10 is striking. This is unlikely to reflect a bias in EST representation between the two species, because the 124,635 rice ESTs used in this analysis were derived from 114 libraries, similar to the 172,783 Arabidopsis ESTs derived from 94 libraries. It may be that genes found in heterochromatin in Arabidopsis are less likely to be expressed than are those in euchromatic regions (13, 14), and this may be true for rice chromosome 10 in which there is a high degree of heterochromatin.

Approximately 67% of rice chromosome 10 proteins have homologs in Arabidopsis (fig. S2). Fewer homologs were found for rice chromosome 10 proteins in other prokaryotic and eukaryotic model organisms. Within the entire rice chromosome 10 protein data set, 29% of the proteins could be assigned a function and 26% could be assigned to a process. The top three function associations are enzyme, nucleic acid binding and ligand binding (fig. S3), whereas the top three process associations are cell growth and maintenance, cell communication, and development. Not surprisingly, of the proteins with no functional assignment, there was an enrichment of hypothetical and unknown genes. Pfam domains were detected in 39.6% of the rice chromosome 10 proteins, whereas 55.3% of the Arabidopsis proteins contained Pfam domains. Because repetitive sequences account for ∼18% of chromosome 10, it was not surprising that transposable element–related domains accounted for four of the top five Pfam families identified. Excluding transposable element–related proteins, Arabidopsis and rice chromosome 10 share three of their top five Pfam domains (protein kinase, leucine-rich repeat, and cytochrome P450) and 6 of their top 10 Pfam domains (table S3). Enriched Pfam domains in rice chromosome 10 relative to the Arabidopsis genome include zinc knuckle, BTB/POZ, and two glutathione S-transferase families that reflect the localized duplication of glutathione S-transferases present on rice 10L (15).

Based on a stringent two-tiered clustering method for paralogous family classification that involves Pfam domains and BLASTP similarities, a total of 398 paralogous families (fig. S4) of 2 or more members containing a total of 1482 proteins were present on chromosome 10, leaving 1983 singletons. Most of the families that have more than 15 members are transposonrelated proteins, kinases, cytochrome P-450–related proteins, and disease-resistance proteins, consistent with Pfam domain classification alone. The amplified gene copies account for 24.5% of the total genes (without counting transposon open reading frames) in rice chromosome 10 compared with 17% of the total Arabidopsis gene pool, consistent with a previous assessment (16).

Rice chromosome 10 contains 43 candidate disease resistance genes (R-genes), including examples of the four major classes of plant R-genes (17, 18) (table S4), but no evidence of the TIR-NBS-LRR type of R-genes, which have only been found in dicots (19). The NBS superclass dominates the list, including 16 CC-NBS-LRR, 4 NBS-LRR and 3 CC-NBS members, with 9 Cf-like genes and 9 Xa21-like genes also present. Similar to patterns of R-gene distribution observed in other species (19, 20), 34 R-genes (79%) are located in three clusters (fig. S5).

A large collection of insertion lines has been generated for rice with the endogenous Tos17 retrotransposon (21). A total of 48 insertion sites were identified in chromosome 10 (Fig. 1): 17 insertions were in intergenic regions, 1 in a retrotransposon, and 30 in putative genes. There were four examples of genes with multiple hits, reducing the number of tagged genes to 24. Because chromosome 10 represents ∼5.2% of the genome, the relative number of genes affected by the current collection is low. The most likely explanation is that the Tos17 insertions occur preferentially in low–copy number sequences (21), and because of the large proportion of repetitive sequences on chromosome 10, there is an underrepresentation of insertion events on this chromosome.

We analyzed chromosome 10 for the presence of intrachromosomal segmental duplications. A total of 186 duplicated blocks were identified, most of them quite small, with only 14 blocks containing eight or more genes. Chromosome 10 appears to contain no megabase-scale duplicated segments, and alignments between chromosomes 1, 4, and 10 indicate no recent large-scale duplications between any of these three chromosomes. A total of 28 chloroplast DNA fragments (80 bp minimum) in chromosome 10 had high sequence identity (>95%) with plant chloroplast DNA (Fig. 1), with two large insertions of ∼131 and 33 kb. As described previously (15), the 33-kb chloroplast insertion was derived from the large, single copy and the IRA regions of the rice chloroplast genome, whereas the 131-kb chloroplast insertion contained nearly the entire chloroplast genome sequence. In contrast, only small mitochondrial DNA fragments (57 fragments ranging from 80 to 2552 bp with >95% sequence identity) were present (Fig. 1). Although the chloroplast insertions were enriched on 10L, the mitochondrial insertions were randomly distributed throughout the chromosome. We do not anticipate that the inserted organellar genes will be expressed, because organellar transcription resembles prokaryotic transcription and the requisite machinery is not present in the rice nucleus. Nor do we anticipate that the organellar proteins would be functional, because they would require the appropriate transit sequences for targeting to the respective organelles.

Rice is highly syntenic with other cereals (2), and a comparison of the rice chromosome 10 sequence to genetic maps from sorghum (22) and maize (23) identified matches to sequence-tagged sites (STSs) from these closely related grasses (Fig. 2). Seventy-four out of 1397 sorghum STSs had best matches to rice chromosome 10 (table S5), of which 46 (62%) mapped to sorghum linkage group C. Eighty-three out of 1362 maize STSs had best matches to rice chromosome 10 (table S6), of which 34, 21, and 8 mapped to maize chromosomes 1, 5, and 9, respectively. Hence, these results support previous analyses of synteny among the cereals (2).

Fig. 2.

Correspondence of rice chromosome 10 to sorghum and maize chromosomes. For sorghum and maize, respectively, among 74 and 83 genetically mapped STSs that showed best BLAST matches of ≤1 × 1010 with rice chromosome 10 sequences (as compared to all available rice BAC sequence), 46 map to sorghum linkage group C (connected by green lines), and 34 and 21 to homoeologous regions of maize chromosomes 1 (red) or 5 (blue). The sequence of rice chromosome 10 (scaled to its physical length, but with cM addresses of selected markers shown) shows general conservation of gene order, although with many localized rearrangements to sorghum linkage group C and to maize chromosomes 1 and 5 (both scaled to their recombinational lengths, with addresses of selected markers shown). Relatively few corresponding genes are found on the heterochromatic short arm of rice chromosome 10 (from 0 cM to the centromere, indicated by the open circle).

In our study, we annotated 3471 genes on rice chromosome 10. Previous reports with draft sequences estimated 1724 genes on chromosome 10 (4), whereas another draft report (5) did not report anchoring of any genes to the chromosomes. This difference is likely due to restrictive counting of genes based on homology to known proteins, missing transposonbased genes as a result of the need to filter whole-genome data, and lack of coverage provided by the draft sequence.

To assess the coverage and impact of draft versus finished sequence, we determined the fraction of rice chromosome 10 covered by the O. sativa L. ssp. indica sequence (5). In the draft sequences, typically only 4% of a given 1-Mb region of chromosome 10 is not covered, although this rises to ∼9% in the heterochromatic short arm (fig. S6). Although 96% coverage seems to indicate that the draft sequence differs very little from the complete sequence, further analysis indicates otherwise. An analysis of the coding fraction of our chromosome 10 sequence reveals that a large proportion of the genes are interrupted in the indica sequence (Fig. 3). In many of the 1-Mb windows, half or more of the genes are interrupted in the draft sequence. Simulations indicate that an increase in the predicted gene number occurs when a sequence is interrupted by gaps (11). These data are consistent with other analyses such as average gene size versus average contig size (5). In the chromosome 10 pseudomolecule, the median predicted protein size is 333 amino acids, whereas in the indica draft (5), the median predicted protein size is 232 amino acids, indicating that a large number of the predicted genes based on the indica sequence represent gene fragments (24). This greatly complicates both accurate gene counts and functional assignment of genes.

Fig. 3.

Percentage of intact predicted genes found in indica scaffolds. With the same mapped scaffolds as in fig. S6, the percentages of FGENESH-predicted genes (total genes in blue) that were found to be either completely intact within a single scaffold (green) or completely contained by one or more scaffolds (red) were calculated from 1-Mb windows across chromosome 10 (26).

Our analyses revealed that, with the exception of a high frequency of repetitive elements, the coding portion of rice chromosome 10 is very similar to that of the Arabidopsis genome, suggesting substantial similarities between monocotyledonous and dicotyledonous angiosperms. The high frequency of repeats is consistent with a high fraction of heterochromatic DNA seen cytologically in chromosome 10. In contrast to the high frequency of large-scale duplication reported in Arabidopsis (12), we found no evidence of large-scale duplication within rice chromosome 10 or between rice chromosomes 1, 4, and 10. Analysis of contiguous sequences of the remaining rice chromosomes will be required to determine whether chromosome 10 is exceptional in this regard. We demonstrated a high degree of collinearity between rice and two major cereal crop species, validating the utility of rice as a foundation species in cereal comparative genomics. Although the previously published rice draft sequences (4, 5) provide a useful first look at the rice genome, the finished sequence yields more complete and accurate information essential to our understanding of plant biology.

Supporting Online Material

www.sciencemag.org/cgi/content/full/300/5625/1566/DC1

Materials and Methods

Supplemental Text

Figs. S1 to S6

Tables S1 to S6

References

References and Notes

View Abstract

Navigate This Article