Collection, Mapping, and Annotation of Over 28,000 cDNA Clones from japonica Rice

See allHide authors and affiliations

Science  18 Jul 2003:
Vol. 301, Issue 5631, pp. 376-379
DOI: 10.1126/science.1081288

This article has a correction. Please see:


We collected and completely sequenced 28,469 full-length complementary DNA clones from Oryza sativa L. ssp. japonica cv. Nipponbare. Through homology searches of publicly available sequence data, we assigned tentative protein functions to 21,596 clones (75.86%). Mapping of the cDNA clones to genomic DNA revealed that there are 19,000 to 20,500 transcription units in the rice genome. Protein informatics analysis against the InterPro database revealed the existence of proteins presented in rice but not in Arabidopsis. Sixty-four percent of our cDNAs are homologous to Arabidopsis proteins.

Rice (Oryza sativa) is an important food crop; it is also a good model for studies of monocot plants because its genome (430 Mb) is small relative to other crop plants of the Poaceae species. Draft sequences of the Oryza sativa L. ssp. indica (1) and japonica (2) genomes by the “whole-genome shotgun” sequencing method have been published, as have essentially complete sequences of chromosomes 1 and 4 by a physical mapping (“clone-by-clone”) method (3, 4). In addition to genomic data, full-length cDNA clones are necessary to identify exonintron boundaries and gene-coding regions within genomic sequences and for comprehensive gene-function analyses at the transcriptional (transcriptomic) and translational (protein informatic) (5) levels.

Here we describe the collection, grouping, sequencing (6), mapping, and functional annotation of full-length cDNA (FL-cDNA) clones from ssp. japonica (cv. Nipponbare). All the data we present are available to the public through our Web site, the Knowledge-based Oryza Molecular Biological Encyclopedia (KOME,, and the 28,469 complete sequences have been assigned GenBank accession numbers AK058203 through AK074028 and AK98843 through AK111488. (An example of a KOME report is in fig. S1.) Clones will be distributed ( from the Rice Genome Resource Center (Tsukuba, Japan).

We used the BLASTN and BLASTX programs (both of which use the Basic Local Alignment Search Tool) for the gene homology searches between monocot and dicot plants (6). We found 2603 cDNA clones that were identical to already-known rice genes. We identified 5607 clones as gene products from paralogs to already-known rice genes. Of the rice FL-cDNA clones, 12,527 were homologous to already-known genes of other plants, and 859 were homologous to already-known genes in organisms other than plants. In total, these homology searches enabled us to assign potential functions to 21,596 (75.86%) of our FL-cDNA clones. Homology-based comparative analysis of calcium signal-transduction proteins between plants and animals suggested that some important families, such as voltage-dependent calcium channel proteins, do not exist in plants (7).

We mapped the 28,469 FL-cDNA clones to the rice genome sequences—the indica draft genome sequence (1), the japonica draft genome sequence (2), and the japonica BAC/PAC (bacterial artificial chromosome and P1-derived artificial chromosome) clones (6) (Table 1). More than 94% of our FL-cDNA clones could be mapped to rice genomic (japonica and indica) sequences. Mapping results of our japonica-originated clones to the indica genome show that the nucleotide sequences of the gene-coding regions are very similar in these two subspecies. Genome sequence alone could not correctly identify the gene structure, but mapping of cDNA clones and comparison of genome sequences indicate the correct structure of the genes in rice. Rice FL-cDNA clones are useful not only to determine rice gene structure but also to understand gene structure in other Poaceae species by the integrative analyses with the expressed sequence tag (EST) clones of these species.

Table 1.

Results of mapping FL-cDNA clones to rice genomic sequences. Mbp, million base pairs.

Genomic sequence Source Size (Mbp) No. of mapped clones (%) No. of nonredundant TUs
japonica draft genome (View inline) 390 26,930 (94.6) 18,933
indica draft genome (View inline) 363 26,784 (94.1) 19,036
BAC/PAC clones from IRGSP 368 22,162 (77.8) 15,523

Of the 18,933 transcription units (TUs) located on the Syngenta genome sequence (2), 5045 are multi-exon TUs that contain two or more transcripts. We searched for alternative forms in those 5045 loci. By pairwise comparison of transcripts mapped to a given locus, we identified alternative 5′ or 3′ ends, cryptic exons (exons present in one transcript and entirely absent from another), and those exons flanked by alternative donor/acceptor sites (Table 2). We identified putative alternative transcripts in 2471 loci (45.6% of the redundant TUs and 13.1% of the total TUs). This is notable variability, given that our strategy selects against alternative forms. Alternative initiation sites were shown by 1673 loci (8.8%), whereas variation of termination sites was observed at only 853 loci (4.5%). This imbalance occurred despite the fact that our having used 3′ single-pass sequences in clustering clones to select a single representative for the full-length sequencing; this result implies more frequent variation in initiation sites than in termination sites in the rice genome. We detected alternatively spliced internal exons in 94 (0.5%) TUs. Exons were spliced at internal donor and acceptor sites in 180 (1.0%) and 241 (1.3%) loci, respectively. The alternative 5′ or 3′ ends, cryptic exons, and internal donor/acceptor sites occurred in coding regions [the longest open reading frame (ORF)] in 1937 (78.4%) loci, and these variations may lead to a functional change in the coded proteins. (This estimate may be affected by our having used the longest ORF as the coding region.)

Table 2.

Number of transcription units (TUs) with alternative structures. Data represent the 5045 multi-exon TUs that contain at least two transcripts.

Alternative structure No. (% of total) of TUs with alternative structure
Initiation sites 1673 (8.8%)
Internal exons 94 (0.5%)
Termination sites 853 (4.5%)
Splice donor sites 180 (1.0%)
Splice acceptor sites 241 (1.3%)

Antisense RNA inhibits gene expression (8) and genetic imprinting (9). To identify antisense RNA genes, we searched for pairs of transcripts that are transcribed bidirectionally from an overlapping genomic region. We found 902 transcript pairs (1443 cDNA clones), the exons of which overlap at least one nucleotide on opposite strands. The clustering patterns of the variants of the transcripts on the same TU are shown on our Web site. An example is in fig. S2.

From the 5′-end sequences of mRNAs, the promoter sequences can be obtained by comparison with the rice genomic sequences. In a search of the PLACE database, we found cis-acting elements in genomic sequences 1000 base pairs upstream from the 5′ termini of each mapped FL-cDNA clone (10). The data are available on our Web site (fig. S1). We identified 20,403 potential promoters.

We searched the InterPro database (6, 11) to compare the profiles of proteins encoded in the rice genome with those in the Arabidopsis genome. InterPro domains found in each ORF amino acid sequence can be seen on the KOME report page (fig. S1).

Protein domains found in rice but not in Arabidopsis (table S2b) and those more frequent in rice than in Arabidopsis (table S2c) could be categorized as those from tissue-specific proteins (e.g., pollen: major pollen allergen Lol pI; seed: Bowman-Birk serine protease inhibitor domain and BURP domain) or related to environmental stresses (e.g., cold; antifreeze protein type I; and drought; ABA/WDS–induced protein). The abundance of these proteins may be characteristic of rice. In contrast, domains found more frequently in Arabidopsis than in rice could be categorized into components of signal transduction (e.g., receptor; TIR domain; and diacylglycerol-related signal transduction; DC1 domain) and transposon-related proteins (e.g., putative plant transposon protein, reverse transcriptase, retrotransposon gag protein, integrase, catalytic domain, and plant MuDR transposase). The abundance of domains associated with transposable element–related proteins is a peculiarity of Arabidopsis.

Transcription factors are important proteins in rice. Our InterPro search yielded 18 DNA binding domains related to 1336 transcription factors (Table 3). Zinc finger–type transcription factors are most numerous, followed by Myb-type factors; these results are similar to those for Arabidopsis (12). To analyze membrane-spanning domains, cellular localization, glycosylation sites, and phosphorylation sites, we used the Membrane Protein Structure And Topology (MEMSAT) (13) and PSORT (14, 15) programs (tables S4 and S5). All the clone-by-clone results of analyses by these programs can be seen on the KOME report page.

Table 3.

Transcription factors identified through InterPro search. An InterPro search yielded 18 DNA-binding domains related to 1336 transcription factors, and the proteins with these domains are shown. Zinc finger–type transcription factors are most numerous, followed by Myb-type factors; these results are similar to those for Arabidopsis (12).

Category of domain No. of FL-cDNA clones Comment
Zn finger 588 Including RING, C2H2, Cx8Cx5C3H, Dof, GATA, CONSTANS, and NF-X1
Myb 158
ERF 83 Including one ERF
NAM 74
Homeobox 73 Including ELK, KNOX1, and KNOX2
bZIP 63 Not including Zn finger
AUX/IAA 47 Including four TFB3
TFB3 27 Not including ERF or AUX/IAA
GRAS 27 Not including Zn finger
HSF 25
Tubby 24
BRCT 19 Not including Zn finger
Fungal TF 18
SBP 17
Jumonji 11 Including jmjC and JmjN; not including Zn finger
TCP 10
Total 1336

To understand the profile and complexity of rice proteins, we compared the 28,444 ORF amino acid sequences from our clones with 27,288 predicted coding sequences (CDS) of the Arabidopsis genome (using the same set of data as in the InterPro analysis); (table S4). At an Expect Value (E) E < 107 (BLASTP), 18,900 FL-cDNA clones (12,996 TUs; 64%) showed homology to Arabidopsis-predicted CDS, whereas 9544 (7263 TUs; 36%) did not (Fig. 1). There were 20,473 Arabidopsis genes (75%) that had a homolog in our rice FL-cDNAs, whereas 6815 (25%) did not. For the indica genome (1), when both cDNAs and predicted genes were compared against the genomic sequence, 49.4% of predicted rice genes had a homolog in Arabidopsis, and 80.6% of Arabidopsis genes had a homolog in rice. If we assume that the numbers of rice genes in the japonica and indica genomes are nearly the same, these results suggest that more than 5% of Arabidopsis homologous genes (the difference between 80.6% and 75%) remain to be collected and that the rice-specific genes to be collected could reduce the fraction of Arabidopsis-common TUs to 49.4%. It remains difficult to estimate the correct number of rice genes from the genomic DNA sequence, but this method of comparison with data from other species might help us to estimate the number of genes.

Fig. 1.

Homology between rice FL-cDNA clones and Arabidopsis genes. (A) Comparison of the genes predicted from FGeneSH in indica rice genome (1) (53,398 genes) and the annotated genes from Arabidopsis (2) (25,426 genes) at BLASTP (E < 107). We found that 49.4% of predicted rice genes had a homolog in A. thaliana, and 80.6% of A. thaliana genes had a homolog in rice, as described in (1). (B) A similar comparison from our FL-cDNA data (28,444 ORFs; 20,259 TU clusters). Sixty-four percent of rice TUs had a homolog in A. thaliana (1), and 75% of A. thaliana genes had a homolog in rice (2). Although it is still difficult to estimate the number of rice genes from the genomic DNA sequence, and the result shown in the indica sequence paper (1) remains to be determined, we still have to collect the genes specific to rice and the genes in common with Arabidopsis, as indicated by the dotted rectangles.

To classify the genes according to their putative function, we used the Gene Ontology (GO) term attached to the InterPro domain names (Fig. 2). The InterPro GO assignment yielded 9734 clones with GO terms associated with “biological processes.” The total number of GO terms associated with “biological processes” in our clones is 21,708; these can be divided into 12 categories (Table 4). The InterPro GO assignment yielded 12,346 FL-cDNA clones with GO terms associated with “function.” The total number of GO terms associated with “function” was 18,877. The number of terms that shared the description “function” is shown in table S6. Classification of protein kinases was done by GO term (table S7). Finally, the InterPro GO assignment yielded 4025 FL-cDNA clones with GO terms associated with “cellular component,” and the total number of GO terms associated with “cellular component” was 4051 (table S8). GO terms attached to each clone can be seen at the KOME page (fig. S1).

Fig. 2.

Flow chart of the classification of FL-cDNA clones by GO term. We classified 21,708 FL-cDNA clones with InterPro GO terms according to the accession number of the associated GO term “biological process.” The results of the number of clones in each category are shown in Table 4.

Table 4.

GO terms of FL-cDNAs with “biological processes” associated with InterPro domains. The total number of GO terms associated with “biological processes” in our clones is 21,708; they can be exclusively classified into 12 categories. The process of classification is as shown in Fig. 2.

Category No. of FL-cDNA clones %
Unclassified 11,974 55.1
Metabolism 5,397 24.9
Transport 1,283 5.9
Translation 884 4.1
Transcription 640 2.9
Cell communication 556 2.6
Communication, defense 249 1.1
Energy 245 1.1
Cell growth/maintenance 207 1.0
Developmental process, aging, death 130 0.6
DNA replication 107 0.5
Others 36 0.2
Total 21,708 100.0

We compared the FgeneSH-predicted genes for the whole indica genome with FL-cDNA clones in terms of functional classification. Updated InterPro and GO annotations for the indica genome were downloaded from the Bioverse Web site (16). The FgenesH-predicted genes in each category and the FL-cDNA clones in each category are nearly equally distributed, thus demonstrating that there were no functional biases in acquisition of the cDNA data (fig. S4). We also compared the Bioverse annotations of the predicted Arabidopsis genes with rice FL-cDNA, and these also show a similar distribution among different functional categories (fig. S5).

Analysis of transcriptional products, as represented by cDNA clones, adds to a greater understanding of genome function and contributes to improving gene prediction tools.

Supporting Online Material

Materials and Methods

Figs. S1 to S5

Tables S1 to S8


References and Notes

View Abstract

Stay Connected to Science

Navigate This Article