Functional Annotation of a Full-Length Arabidopsis cDNA Collection

See allHide authors and affiliations

Science  05 Apr 2002:
Vol. 296, Issue 5565, pp. 141-145
DOI: 10.1126/science.1071006


Full-length complementary DNAs (cDNAs) are essential for the correct annotation of genomic sequences and for the functional analysis of genes and their products. We isolated 155,144 RIKENArabidopsis full-length (RAFL) cDNA clones. The 3′-end expressed sequence tags (ESTs) of 155,144 RAFL cDNAs were clustered into 14,668 nonredundant cDNA groups, about 60% of predicted genes. We also obtained 5′ ESTs from 14,034 nonredundant cDNA groups and constructed a promoter database. The sequence database of the RAFL cDNAs is useful for promoter analysis and correct annotation of predicted transcription units and gene products. Furthermore, the full-length cDNAs are useful resources for analyses of the expression profiles, functions, and structures of plant proteins.

Arabidopsis thaliana has been adopted as a model organism in the study of plant biology because of its small size, short generation time, and high efficiency of transformation (1). To sequence its small genome [125 megabases (Mb)] (2), scientists in Japan, Europe, and the United States collaborated in theArabidopsis genome sequencing project (3). Two of five chromosomes (chromosomes 2 and 4, except for the nucleolar organizer regions and centromeres) were sequenced in 1999 (4, 5), and the remaining three chromosomes were sequenced in 2000 (2).

About 127,000 expressed sequence tags (ESTs) fromArabidopsis had been deposited in the EST database (dbEST) as of May 2001, including sequences from large-scale EST projects promoted by laboratory consortia in France (6,7), the United States (8, 9), and Japan (10). These projects have produced EST data from different tissues, organs, seeds, and developmental stages (6–10). However, these EST projects are based on cDNA libraries in which most of the inserts are not full-length. ESTs are useful for making a catalog of expressed genes, but not for further study of gene function. Consequently, genome-scale collections of the full-length cDNAs of expressed genes become important for the analysis of the structure and function of genes and their products in the functional genomics era.

We previously made full-length cDNA libraries using the biotinylated CAP trapper method (11, 12) fromArabidopsis plants (13). Here, we constructedArabidopsis full-length cDNA libraries from plants grown under different conditions as reported previously (11–15) by the biotinylated CAP trapper method using trehalose-thermoactivated reverse transcriptase. We used λZAP (11, 13) and λFLC (16) vectors for construction of the cDNA libraries. The λFLC vectors accommodate cDNAs in a broad range of sizes and are useful for the high-efficiency cloning of long cDNA fragments (16). The λFLC vectors can also be bulk-excised by a Cre-lox–based system free of size bias to produce the plasmid libraries. In the construction of full-length cDNA libraries [RIKEN Arabidopsis full-length (RAFL) 12, 13, 14, 15, 16, 17, 18, 19, and 21 (Table 1)], we used a single-strand linker ligation method (17), which uses DNA ligase to add a double-stranded (ds) DNA linker to single-stranded (ss) full-length cDNA. Subsequent sequencing of clones and translation of proteins from full-length cDNA are easier and more efficient because of the elimination of the GC tail. Normalization and subtraction procedures (18) were also introduced in the construction of full-length cDNA libraries [RAFL11, 12, 13, 17, 18, 19, and 21 (Table 1)] to reduce the representation of highly expressed mRNAs in the library and to remove cDNAs already categorized by means of one-pass sequencing, respectively. The method is based on hybridization of the first-strand full-length cDNA with several RNA drivers, including starting mRNA as the normalizing driver and run-off transcripts from rearrayed clones as subtracting drivers. This method should dramatically enhance the discovery of new cDNAs. The overall strategy for preparing cDNA libraries, including standard, normalized, and subtracted libraries, has been described previously (19). We constructed 19 full-length cDNA libraries from Arabidopsis plants grown under various stress, hormone, and light conditions from plants at various developmental stages and from various plant tissues.

Table 1

Summary of 3'-end single-pass sequencing of RAFL cDNA clones isolated from A. thalianafull-length cDNA libraries. 155,144 RAFL cDNA clones were clustered by mapping of the 3'-end single-pass-sequencing data on the genomic sequence to produce more than 14,668 cDNA groups. n.d., not determined; UV, ultraviolet; ABA, abscisic acid; JA, jasmonic acid; SA, salicylic acid; GA, gibberellin; BTH, benzo-(1,2,3)-thio-diazole-7-carbothionic acid S-methyl ester.

View this table:

We performed single-pass sequencing of the cDNA clones from the 3′ end. The 155,144 3′ ESTs were clustered and then mapped onto theArabidopsis genome (Fig. 1 and supplemental text) (15). Finally, 14,668 nonredundant RAFL cDNA clones were identified and mapped on the Arabidopsisgenome (Table 1 and Fig. 1). The information on the 14,668 RAFL cDNA clones (the “RAFL cDNA” genes) is available in Web tables 1 and 2 (20). Assuming that the total number ofArabidopsis genes is about 25,000, the RAFL clones should account for about 60% of all Arabidopsis genes. Our evaluation of 349 RAFL cDNA clones by single-pass sequencing showed that ∼98% of the clones contained both start and stop codons. Thus, the cDNA libraries constructed by the biotinylated CAP trapper contained a very high proportion of full-length cDNAs.

Figure 1

Strategy for clustering of the RAFL cDNA clones. A total of 155,144 RAFL cDNA clones isolated from 19 full-length cDNA libraries were subjected to single-pass sequencing from the 3′ ends of the cDNA. The 3′-end single-pass sequencing data were used in the two steps for clustering as described in supplemental methods (15). After the second clustering, the best quality sequence was chosen as the representative of the group. The 3′ EST of each representative clone was then mapped onto theArabidopsis genome as described in the supplemental text (15). As a result, 14,878 nonredundant representative 3′ ESTs were mapped on the Arabidopsis genome. Next, the 14,878 cDNA clones were subjected to single-pass sequencing from the 5′ end of the cDNA. The 5′ end sequencing data were then mapped onto theArabidopsis genome with the BlastN program (15). Finally, the 14,668 nonredundant RAFL cDNA clones mapped on theArabidopsis genome were identified.

From the 5′-end sequences of mRNAs, the promoter sequences can be obtained by comparison with the Arabidopsis genomic sequences. We also obtained 5′ ESTs of 14,034 RAFL cDNA clones and constructed a promoter database (21) using the PLACE database (22). The Arabidopsis promoter database shows genomic sequences 1000 base pairs (bp) upstream from the 5′ termini of each RAFL cDNA clone and about 300 cis-acting elements known from plants (Web table 1) (20).

Of the 14,668 RAFL cDNA clones mapped onto the Arabidopsisgenome, 13,831 were matched to Munich Information Center for Protein Sequences (MIPS) protein entry codes (Fig. 2), leaving 837 RAFL cDNA clones unmatched (Fig. 2, Web fig. 1C, and Web table 3-3) (20). These 837 RAFL cDNAs have not yet been predicted by the Arabidopsis Genome Initiative (AGI) and thus represent false negatives in the genome annotation.

Figure 2

Current compilation of expressed genes inArabidopsis. The left-hand Venn diagram shows the two classes of the 17,956 experimentally identified genes. Of these genes, there are 14,668 RAFL cDNA genes isolated in this study (red and pink circles) and 14,682 reported EST or cDNA genes (yellow circle), including EST genes identified by EST analysis, CERES cDNAs, andArabidopsis expressed genes that Arabidopsisresearchers have cloned and sequenced by traditional cloning. Of 14,668 RAFL cDNA genes, 837 newly identified genes that were not predicted and 2437 newly identified genes that were predicted existed. The right-hand Venn diagram shows the intersection between the total number of predicted genes (26,285, blue circle) and the experimentally identified genes (17,956, pink circle). The green region of intersection shows the 17,119 experimentally identified genes that have been predicted. The blue region of nonintersection shows the 9286 predicted genes that have not been experimentally confirmed yet. The pink region of nonintersection shows the 837 identified genes that are not predicted by AGI. In addition, in some cases, pairs of seemingly separate predicted genes correspond to a single experimentally identified gene. Conversely, single predictions sometimes correspond to more than two experimentally identified genes. The last two facts explain why 17,119 genes correspond to 16,999 predicted genes.

To analyze all known expressed Arabidopsis genes, we used data from: (i) 5100 complete cDNAs that Arabidopsisresearchers have sequenced and deposited in GenBank as of 18 August 2001 (23), (ii) 127,031 Arabidopsis ESTs identified as of 22 May 2001 (24), and (iii) 5000Arabidopsis full-length cDNAs that Ceres, Inc., released to The Institute for Genomic Research on 19 December 2000 (25). Altogether, these genes (the “reported EST or cDNA” genes) were subjected to homology search (26) against the sequence database of its corresponding MIPS protein entry code using the BlastN program. The reported EST or cDNA genes covered a total of 14,551 MIPS protein entry codes (Fig. 2). Also, 2437 of the RAFL cDNAs mapped to the MIPS protein entry codes were novel genes not identified so far (Fig. 2). ESTs or cDNA genes have been reported for 3288 MIPS protein entry codes, but no RAFL cDNA genes have been identified (Fig. 2). A total of 11,394 genes corresponded to both reported EST or cDNA and RAFL cDNA genes. These results bring the total number ofArabidopsis genes whose expression has been experimentally confirmed to 17,956 (Fig. 2). In comparison, AGI lists 17,119 experimentally confirmed genes, of which 16,999 were predicted (Fig. 2). The discrepancies are likely due to two predicted genes corresponding to a single experimentally identified gene (Web fig. 1A) (20), or single predicted genes corresponding to more than two experimentally identified genes (Web fig. 1B) (20). Some RAFL cDNA clones correspond to each of these circumstances (Web tables 3-1 and 3-2) (20).

We conclude that 9286 predicted genes need further data to be confirmed as expressed genes or unidentified genes (Fig. 2). Because these unidentified genes have not been confirmed by any ESTs, some of the predicted genes represent false positives or pseudogenes. Alternatively, these unidentified genes might have remained undetected by the EST approach because of their weak expression in specific tissues.

The biological roles and biochemical functions of RAFL cDNA clones were identified by homology search using the BLAST program (Table 2). The results show that cDNA clones of some functional categories, such as energy production, protein synthesis, and ion homeostasis are well represented in RAFL. More than 80% of cDNAs for genes involved in energy production, protein synthesis, and ionic homeostasis were found in RAFL, and ∼70% of cDNAs for genes involved in metabolism, protein destination, cellular transport and transport mechanisms, and cellular organization were found in RAFL. It has been estimated that ∼1500 transcription factor genes (27) and about 1000 protein kinase genes (28) exist in the Arabidopsis genome. The RAFL cDNA collection includes 1087 transcription factor and 506 protein kinase genes (Table 2).

Table 2

Functional classification of RAFL cDNA clones.

View this table:

Although many algorithms have been written to predict a transcription unit from genomic sequence data, the accuracy of their predictions is still limited. A more direct and efficient approach to identifying coding sequences is to sequence full-length cDNAs. Complete sequences of RAFL cDNAs will be useful for gene identification and positional cloning. The RAFL cDNA clones are publicly available from the RIKEN Bioresource Center.

  • * To whom correspondence should be addressed. Laboratory of Plant Molecular Biology, RIKEN Tsukuba Institute, 3-1-1 Koyadai, Tsukuba 305-0074, Japan. E-mail: sinozaki{at}


View Abstract

Navigate This Article