Research Article

A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. japonica)

See allHide authors and affiliations

Science  05 Apr 2002:
Vol. 296, Issue 5565, pp. 92-100
DOI: 10.1126/science.1068275

This article has a correction. Please see:


The genome of the japonica subspecies of rice, an important cereal and model monocot, was sequenced and assembled by whole-genome shotgun sequencing. The assembled sequence covers 93% of the 420-megabase genome. Gene predictions on the assembled sequence suggest that the genome contains 32,000 to 50,000 genes. Homologs of 98% of the known maize, wheat, and barley proteins are found in rice. Synteny and gene homology between rice and the other cereal genomes are extensive, whereas synteny with Arabidopsis is limited. Assignment of candidate rice orthologs to Arabidopsis genes is possible in many cases. The rice genome sequence provides a foundation for the improvement of cereals, our most important crops.

Cereal crops constitute more than 60% of total worldwide agricultural production (1), and rice, wheat, and maize are the three most important cereals. More than 500 million tons of each are produced annually worldwide; per capita consumption averages as high as 1.5 kg per day (2). Most rice grown is consumed directly by humans, and about one-third of the population depends on rice for more than 50% of caloric intake (3).

The cereals have been evolving independently from a common ancestral species for 50 to 70 million years (4), but despite this long period of independent evolution, cereal genes and genomes display high conservation. Comparisons of the physical and genetic maps of the grass genomes show conservation of gene order and orientation, or synteny (5–7). Despite gene similarity and genome synteny, cereal genome sizes vary considerably. The genomes of sorghum, maize, barley, and wheat are estimated at 1000, 3000, 5000, and 16,000 megabase pairs (Mbp), respectively. Rice has a much smaller genome, estimated at 420 Mbp. The small genome and predicted high gene density of rice make it an attractive target for cereal gene discovery efforts and genome sequence analysis.

Over the past several years, selected regions of the japonica and indica rice genomes have been sequenced. The International Rice Genome Sequencing Project (IRGSP) was organized to achieve >99.99% accurate sequence using a mapped clone sequencing strategy (8). In addition, expressed gene sequencing has been actively pursued. More than 104,000 expressed sequence tags (ESTs) from a variety of rice tissues have been entered into the EST database (9). Other rice genome sequencing projects have been reported by Monsanto Co. (10) and by the Beijing Genomics Institute (11).

The two major groups of flowering plants, monocots and dicots, diverged 200 million years ago (12). In late 2000, the 125-Mbp genome of the dicot model plant Arabidopsis thaliana was reported (13–15). Similar high-accuracy sequencing projects of important cereals would be expensive and slow because their genomes are so large. Recent improvements in automated DNA sequencing have made whole-genome shotgun sequencing an attractive approach for gene discovery in both small and large genomes (16–18). Here, we describe the random-fragment shotgun sequencing of Oryza sativa L. ssp. japonica (cv. Nipponbare) to discover rice genes, molecular markers for breeding, and mapped sequences for the association of candidate genes and the traits they control. Also reported are the linkages of sequence assemblies to rice bacterial artificial chromosome (BAC) end sequences and fingerprints (19–22), anchoring of the physical and genetic maps, and the syntenic relationship between rice and other plants. The finding that most cereal genes have strong rice homologs suggests that the rice genome will be useful as a foundation for sequencing the genomes of related cereals. Synteny among cereals should allow placement of low-copy cereal genes on a rice genome framework.

Sequence generation, assembly, coverage analysis, and repeats.

A shotgun library of the rice genome was constructed in a low-copy plasmid from purified, sheared nuclear DNA. Clones were sequenced from both ends as described (Web site link 1; for this and other supplementary Web data, see Science Online More than 2.5 billion bases with a 98% probability of being correct were generated, representing more than sixfold coverage of the estimated 420-Mbp rice genome (23). About 80% of the sequencing reads were linked to a second sequence generated from the opposite end of the same template. After removal of an estimated 38 Mbp of repetitive DNA, more than 5.5 million sequences assembled into 42,109 contiguous sequences (contigs) with a total coverage of 389,809,244 bp (93% of the predicted 420-Mbp rice genome) and a GC content of 44%. This sequence assembly will be referred to as Syd (Syngenta draft sequence; data access information is available at

Contigs of nonrice origin were identified by sequence homology to known bacteria, high GC content, lack of homology to rice BAC end sequences, and/or depth of coverage. Sequence analysis identified 6 Mbp as originating from two related bacterial species (Xanthomonadales), likely representing endophytes present in the plant material used for DNA isolation. These sequences were not included in Syd; however, some bacterial contigs may remain (Web site link 2).

Syd data were compared to almost 1 million bases of IRGSP's completed rice genome sequence to determine coverage and quality. Syd coverage ranged from 98.7 to 99.8% on different clones (Table 1). Sixty-three gaps were found, totaling 6400 bp or 0.6%; the average gap size was 101 bp. Most of the Syd gaps were the result of conservative assembly; more than 70% of the gaps could be closed by manual editing. A single base pair difference and an insertion-deletion difference (indel) were found once every 1000 and 2000 bp, respectively; these findings indicate that Syd is 99.8% accurate. Gene coverage was assessed by comparison of 495 full-length rice genes with Syd. Of 648,792 bp, 643,528 bp (99.2%) were found in Syd (Web site link 3). Most genes were represented at more than 90% of their full length. The only three genes not found in Syd were determined to be misannotated. The analysis of coverage and quality indicates that Syd contains the overwhelming majority of rice genes, although some gene sequences are partial and/or in more than one contig. Unedited Syd sequence was used for the analysis presented below.

Table 1

Genome coverage in Syd. Six fully sequenced rice BACs (GenBank) were compared to sequences in Syd to determine coverage and analyze gaps. Sequences were aligned and gaps analyzed for length and number. Manual gap closure was performed on contigs covering two BACs, and the analysis was repeated.

View this table:

Simple sequence repeats (SSRs) are highly polymorphic sequences found throughout plant and animal genomes. SSR repeat unit lengths are easily detectable and have made SSRs a popular type of codominant molecular marker for accelerated breeding. Syd analysis revealed a total of 48,351 dinucleotide SSRs (eight repeat units minimum), trinucleotide SSRs (five repeat units minimum), and tetranucleotide SSRs (four repeat units minimum), or about one SSR every 8000 bp (Table 2). Di-, tri-, and tetranucleotide SSRs account for 24%, 59%, and 17%, respectively, of the SSRs found in rice. The frequency of specific SSRs is not random nor representative of the genome GC content. The most frequent dinucleotide SSR is AG/CT and variants, representing 58% of all dinucleotides. The most frequent trinucleotide is CGG/CCG and variants, at 44% of all trinucleotide SSRs. ATCG/CGAT is the most common tetranucleotide repeat unit (Web site link 4). More than 7000 SSRs were found in predicted genes. Most of these SSRs (92%) are trinucleotides, so length changes should maintain the open reading frame. In addition to SSRs, ∼38 Mbp of long repetitive DNA and 150 Mbp of short repetitive DNA were identified. A detailed analysis of the repetitive DNA, retrotransposons, organellar DNA, rDNA, and tRNA genes, as well as miniature inverted-repeat transposable elements (MITEs) in the rice genome, can be found at Web site link 5.

Table 2

Di-, tri-, and tetranucleotide SSRs in Syd. Simple sequence repeats in Syd were identified computationally and classified as di-, tri-, or tetranucleotide SSRs. The most frequent SSR types do not include base-shifted variants (e.g., the frequency for CGG/CCG does not include GGC/GCC and GCG/CGC SSRs). For distributions with variants, see Web site link 4.

View this table:

Prediction and classification of Syd genes.

Gene prediction in silico is an imperfect process (24), and no single gene-prediction algorithm was found to be highly accurate on Syd; therefore, several approaches were combined to identify genes. Five different combinations of gene prediction programs and training models were used to identify genes (Web site link 6). Additional genes were predicted on the basis of homology to plant and fungal genes. Timelogic's Decypher FrameSearch algorithm was used to detect and guide the correction of frameshifts caused by indels. For each predicted gene, the fraction of the length with homology to known genes, predicted genes from other species, Prosite motifs (25), or Pfam domains (26) was used as a confidence score. When predicted genes overlapped, the one with the highest confidence score was selected. Predicted genes were separated into three categories: high (Hgenes), with confidence scores of ≥75%; medium (Mgenes), with confidence scores from 1 to 75%; and low (Lgenes), with confidence scores of <1%. Predicted genes were often found to be incomplete, lacking either an NH2-terminal or COOH-terminal coding region (Web site link 6).

The number of genes identified in Syd depends on the minimum gene length chosen. More than 78% of Lgenes are shorter than 500 bp, whereas only 42% of Mgenes and 28% ofHgenes are shorter than 500 bp. Including Hgenes and Mgenes longer than 500 bp and Lgenes longer than 1000 bp yields 17,164 Hgenes, 12,030 Mgenes, and 3083 Lgenes, for a total estimate of 32,277 genes (Web site link 6). Alternatively, if the minimum gene length forLgenes is also set at 500 bp, then the total estimate is 40,884 genes. The more inclusive set of 61,668 HMLgenes longer than 300 bp (HMLgenes300) was used for the rest of this study, unless noted otherwise. Determining the definitive number of rice genes will require substantial efforts in functional genomics.

Translated HMLgenes300 were classified with the software package INTERPRO (27, 28). INTERPRO output was filtered to create sets of the longest protein domain for each associated protein, and domains were categorized using Gene Ontology (GO) software (29). The results of these classifications are shown in Fig. 1; about 44% ofHgenes, 32% of Mgenes, and 5% ofLgenes were classified, respectively. Most of the classified proteins fall into the categories of metabolism and cell communication/signal transduction.

Figure 1

Rice gene prediction classifications.HMLgenes300 were classified with Interpro and GO software (27–29); the categories generated are shown.

Gene and chromosomal duplications.

Global duplication of predicted genes was determined by comparison (using BLAST) of all Hgenes and Mgenes (37,777) longer than 300 bp against one another. Of these, 77% (29,226) were found to be homologous to at least one other predicted gene (TBLASTX E value ≤ –20). HMgenes300 comprise about 15,000 distinct gene families, similar to the 11,000 to 15,000 families predicted for Arabidopsis (13),Caenorhabditis elegans (30), andDrosophila (16). Local duplications were determined by comparing Hgenes300 andMgenes300 that were mapped to a single BAC contig. A total of 791 BAC contigs, averaging 500,000 bp, contained 25,728 genes. The fraction of locally duplicated genes ranged from 15.4 to 30.4%, depending on the chromosome (Table 3).

Table 3

Local duplication of Hgenes300and Mgenes300 in Syd. Genes from individual BAC contigs were compared to each other (TBLASTX), and pairs with E values ≤ –20 were defined as paralogs representing locally duplicated genes.

View this table:

Chromosomal duplications were identified by comparison (BLASTN) of more than 2000 mapped rice cDNA markers (31) to the anchored portion of Syd. At a minimum of 80% identity over 100 bp, 851 markers (41%) were single loci, 509 (24%) had two copies, and the remainder (35%) had three or more copies. Duplications were plotted by chromosomal region (Web site link 7). The smallest conserved evolutionary units (SCEUS) method (32, 33) was applied to determine the extent of genome duplication. Most SCEUS carry four or fewer mapped markers, suggesting extensive recombination/rearrangement since the duplication events.

The amino acid substitution rate (dA ) was used to estimate the age of genome duplications (34–37). Whereas a maize whole-genome duplication is reported to have occurred 11.4 million years ago (38), an apparent rice whole-genome duplication occurred 40 to 50 million years ago (Web site link 8). The largest chromosomal duplication is on chromosomes 11 and 12 (31,39). In an effort to estimate the age of this duplication, proteins on chromosome 11 were compared to their homologs on chromosome 12. The distribution ofdA's indicates a major duplication about 25 million years ago (Web site link 9) (34).

Syd genes compared to Arabidopsis genes.

Eighty-five percent of Arabidopsis predicted proteins (21,590 of 25,554) were significantly homologous toHMLgenes300 predicted proteins; of these, 2565 show very strong conservation between Arabidopsis and rice (Fig. 2). The overall mean identity of Arabidopsisproteins to rice contigs is 49.5%, with a mode of 33% (Web site link 10). About 30% of the highly conserved genes are classified as “hypothetical,” “unknown,” or “putative” (13), which suggests that many important plant proteins remain completely uncharacterized. One-third (∼8000) of Arabidopsispredicted proteins are found in rice, but not in Drosophila,C. elegans, Saccharomyces, or sequenced bacterial genomes. These genes are likely to represent the plant-specific set. About 4000 predicted Arabidopsisproteins lack significant homology toHMLgenes300 or to Syd. Most of these are classified as “hypothetical” or “unknown” and are not found in genome sequences of other organisms, which suggests that some may be inaccurately predicted and others could be dicot-specific. Homologs of more than 13,000 HMLgenes300 are not found in other nonplant sequenced genomes but are found in theArabidopsis genome. About 3886Hgenes300 and 31,387HMLgenes300 are not found with significant homology (BLASTP E value ≤ –6) in the Arabidopsisgenome, but most of these are low-evidence predictions.

Figure 2

Similarity of 25,554 Arabidopsisproteins and best rice homologs. Predicted Arabidopsisproteins (October 2001, were compared (BLASTP E value ≤ –6) withHMLgenes300 translations. The expectation values range from E < −180 (high homology) to E > −6 (low homology) and are depicted in intervals spanning 10 exponents (e.g., < −180, −180 to −171, −170 to −161, etc.).

The Arabidopsis genome is reported to lack several classes of genes found in other sequenced eukaryotic genomes (13). These gene classes are also not found in Syd, nor are members of the following families: nuclear steroid receptor, Janus kinase (JAK)/signal transducers and activators of transcription (STAT), Notch/Lin12, transforming growth factor–β/SMAD, Rel, forkhead/winged helix, POU, IF, Wingless/Ent, caspase, p53, and Hedgehog. Specific gene classes are overrepresented in rice and Arabidopsis. For example, RING zinc finger proteins and F-box domain proteins are overrepresented in the Arabidopsisgenome relative to yeast, Drosophila, and C. elegans (13). These proteins are involved in intracellular protein degradation pathways and their regulation. RING zinc finger and F-box domain proteins are also overrepresented in Syd. More than 840 predicted proteins in HMLgenes300were found to contain RING zinc finger domains, and 150 F-box–containing proteins were identified. This finding is consistent with the speculation that protein turnover and the regulation of protein degradation in plants play an important role in the maintenance of plant homeostasis.

The small phytochrome gene family is underrepresented in rice relative to Arabidopsis. Syd contains only three phytochromes (phyA, phyB, and phyC; Web site link 11) of the five genes (phyA to phyE) found in Arabidopsis; this finding confirms previous work demonstrating that grasses contain a subset of the five phytochromes found in dicotyledonous plants (40). The absence of phyD and phyE, which are partially redundant in function with phyB in Arabidopsis, has intriguing evolutionary and regulatory implications, and it suggests that increased gene number in rice relative to Arabidopsis is not uniform across the genome.

Synteny between rice and Arabidopsis.

Rice andArabidopsis diverged from a common ancestor about 200 million years ago (12). Although the existence ofArabidopsis-rice synteny has been controversial (41,42), evolutionary models based on estimated mutation rate predict some syntenic relationships between distantly related species such asArabidopsis and rice (43). To address this issue, all annotated Arabidopsis proteins were compared to anchored Syd sequence contigs. This approach links Arabidopsisproteins to related mapped rice sequences, forming syntenic groups (44). When a 99.9% significance threshold is applied, 137Arabidopsis-rice syntenic groups are found at 75 rice chromosomal locations (Table 4) (44) throughout the genome, with no discernible pattern. This is a conservative estimate; reducing the significance threshold to 99% increases the number of syntenic groups to 508 (Table 4). Of the 137 high-confidence syntenic groups, the largest mapped toArabidopsis chromosome 5 (from 0 to 26 Mbp) and rice chromosome 4 (from 116 to 129 cM). This syntenic block includes 119Arabidopsis proteins. The predicted roles of these proteins do not suggest an obvious reason for their conservation.

Table 4

Rice-Arabidopsis synteny. Rice andArabidopsis syntenic groups were detected at various significance thresholds (44).

View this table:

Within the 137 high-confidence syntenic groups, several rice blocks map to more than one site in theArabidopsis genome. One such block maps to all fiveArabidopsis chromosomes, 8 map to fourArabidopsis chromosomes, 10 map to threeArabidopsis chromosomes, 14 map to twoArabidopsis chromosomes, and the remaining 42 map to a single Arabidopsis chromosome. This observation suggests that multiple rounds of duplication occurred within theArabidopsis genome, and it is consistent with the results of studies comparing distantly related dicot pairs such as tomato-Arabidopsis (45) and soybean-Arabidopsis (46).

Syntenic protein pairs are two proteins found in close proximity in both rice and Arabidopsis, excluding tandem duplications. Only 2% of the syntenic protein pairs on Arabidopsischromosome 5 are adjacent to one another (Web site link 12). Fifty-two percent of the syntenic protein pairs are separated by 1 to 150 intervening proteins. This distribution of related protein pairs in rice and Arabidopsis is not random, providing further support for a syntenic relationship between Arabidopsis and rice. Selective gene loss and large-scale chromosomal duplication during Arabidopsis genome evolution (45) could be responsible for the distribution observed.

These observations support previous hypotheses that detectable synteny exists between monocots and dicots even after 200 million years of divergence, although the conservation is less extensive than previously predicted (43). The rice andArabidopsis genomes are rearranged to such an extent that constructing a monocot-dicot comparative framework based on these two genomes would be difficult. The low but detectable synteny between rice and Arabidopsis may provide clues to orthologous gene identification in future functional genomics studies.

Genes involved in disease resistance: Conservation betweenArabidopsis and rice.

Disease resistance genes (R genes) are responsible for early and specific recognition of pathogen attack and initiation of signal transduction, leading to deployment of defense mechanisms (47). R genes fall into two major and three minor structural classes. The largest class of known R gene products contains characteristic nucleotide binding (NB) sites, leucine-rich repeats (LRRs), and an apoptosis-resistance-conserved (ARC) domain. The rice genome has ∼600 genes that encode proteins with clear NB-ARC homology (48). In dicots, NB-LRR genes that encode TIR (Toll–interleukin 1 receptor resistance) motifs at their NH2-termini are very common. For example, among 128 NB-LRR genes in the Arabidopsis genome, 85 belong to the TIR-NB-LRR subclass (13). In contrast, the rice genome lacks obvious TIR-encoding genes. The regions NH2-terminal to the NB-ARC domains in five predicted rice NB-LRR proteins have very weak homologies to TIR. Possibly these domains function like TIR domains in dicots but are highly diverged. Clearly, TIR-NB-LRR genes are not a major class of R genes in rice. This class likely evolved after the divergence of monocots and dicots.

R genes encoding extracellular LRRs and either short cytoplasmic tails or COOH-terminal serine-threonine protein kinase domains constitute another major class. Rice has ∼450 extracellular LRR genes, and about half encode COOH-terminal protein kinase domains. In contrast to NB-LRR genes, this structural class is known to include proteins with functions unrelated to disease resistance (47). Minor classes of R genes include those encoding the cytoplasmic serine-threonine kinases Pto and PBS1, each of which have 14 rice homologs; Hs1pro-1, which has one rice homolog; and RPW8, which has one rice homolog.

Several plant genes controlling disease resistance signal transduction cascades are known, mainly from Arabidopsis(49). The site of action of these genes in the signal transduction network is presented at Web site link 13. One rice homolog was found for each of the Arabidopsis disease signal transduction genes NDR1, PAD4, andEDS1, as well as the barley gene RAR1. Three rice homologs were found for COI1, a gene required for responses to the signal molecule jasmonic acid; six for NPR1, a gene required for responses to the signal molecule salicylic acid (one closely related and five distantly related); and six forLSD1, a gene required for control of programmed cell death (Web site link 14). Many rice homologs for the Arabidopsismitogen-activated protein kinase (MAPK) gene MPK4 and the MAPK kinase kinase gene EDR1 were identified, preventing the assignment of putative orthologs. No rice gene similar toArabidopsis SNI1 was found. Nearly all genes known to control disease resistance responses inArabidopsis have putative orthologs in rice, suggesting extensive conservation of disease resistance signaling between monocots and dicots.

Flowering-time and flower development genes.

Flowering in Arabidopsis is initiated by flowering-time genes that activate floral meristem identity genes, leading to the patterned expression of floral organ identity genes (Fig. 3). Rice contains single-copy homologs of the Arabidopsisflowering-time genes GI, CO, LD, andFCA (Web site link 15). The rice CO-like flowering-time gene Hd1 (50) is more similar to the uncharacterized Arabidopsis CO-related geneAt3g02380 than to CO. Tandemly duplicatedFRI homologs exist in rice, and they are more similar to theArabidopsis FRI-related gene At5g48390 than toFRI. This suggests that FRI homologs in rice andArabidopsis play distinct roles not necessarily restricted to the vernalization response. Four rice homologs of theArabidopsis GA1 gene that encodes a rate-limiting step in GA biosynthesis were identified (Web site link 16). The FT/TFLgene family encodes proteins with homology to phosphatidylethanolamine binding proteins that have been shown to be involved in major aspects of whole-plant architecture (51, 52). The rice genome contains 17 members of the FT/TFL gene family; one member is most similar to TFL, and nine are more similar to FT(Web site link 17). Putative orthologs of FLC andAGL20, both MADS domain genes, could not be identified among the large rice gene family. Genes from other cereals that have been shown to affect flowering time, including Id1 in maize (53) and Rht-B1/d8 in wheat and maize (54,55), are represented by clear homologs in the rice genome. Although rice homologs for most of the Arabidopsisflowering-time genes can be identified, it is currently not clear whether the genetic network that integrates them is conserved.

Figure 3

Flowering-time and flower development genes in the rice genome. A simplified model shows the predicted genetic network regulating flowering time and flower development inArabidopsis, with gene names color-coded to indicate clear identification of an ortholog in the rice genome (red) or no clear identification (white). In Arabidopsis there are three genetic pathways that control flowering time (100, 101). The long-day pathway represented by GI and CO and the autonomous pathway represented by LD, FCA, andFLC are likely integrated through FT andAGL20 to promote activation of meristem identity genesLFY, AP1, and CAL. The vernalization pathway, represented by FRI, feeds into the autonomous pathway upstream of FLC. The GA pathway, represented byGA1, leads to the activation of LFY. TFL serves to restrict the expression of the meristem identity genes to floral meristems, where they promote the patterned expression of floral organ identity genes AP2, AP3, PI, andAG, which are also affected by the regulatory genesANT, UFO, and SUP (102,103).

Cereal and Arabidopsis flowers differ in perianth structure and in their arrangement on the flowering stem, but they likely develop under similar genetic control mechanisms (56). Rice homologs of the Arabidopsis floral meristem and organ identity genesLFY, AP3, PI, and AG have been described [RFL (57), osMADS16(58), osMADS 2 and 4 (59,60), and osMADS3 (60), respectively]. A search of Syd with the Arabidopsis sequences confirmed that the rice genes previously identified are those most closely related to their Arabidopsis homologs. The rice osMADS7 andosMADS1 genes are the most similar to theArabidopsis meristem identity gene AP1 (Web site link 18). Definitive rice homologs of the Arabidopsis CAL,UFO, and SUP genes could not be identified within the large rice gene families. The Arabidopsis AP2 gene has a single homolog in rice, whereas the AP2-domain gene ANT is represented by two homologs (Web site link 19). In agreement with previous studies demonstrating conservation of organ identity genes between monocots and dicots (56, 60), homologs of mostArabidopsis meristem and organ identity genes can be identified in rice. Functional analysis is required to demonstrate conservation of specific regulatory activities for the remaining genes (57).


Sequence homology suggests that about 25% of rice genes are involved in metabolism. Syd contains genes for all central metabolic processes (glycolysis; citric acid cycle; pentose phosphate pathway; photosynthesis and respiration; synthesis and degradation of amino acids, nucleotides, fatty acids, and lipids; cofactors; carbohydrates; cell wall materials) and nutrient exchange (assimilation of carbon, nitrogen, sulfur, and phosphorus; absorption of minerals). In rice, as in Arabidopsis, extensive gene redundancy exists across all metabolic pathways (Web site link 20). Multiple-copy genes may facilitate the tightly regulated expression of specific isoenzymes in specialized tissues, at certain developmental stages, or in response to environmental challenges (61, 62). Large gene families exist for enzymes putatively involved in the biosynthesis of secondary metabolites. These structurally diverse compounds are generated by only a few types of reactions (63), which are catalyzed by (i) enzymes forming core structures (e.g., chalcone synthase, encoded by a gene family with four members in Arabidopsis and 26 members in rice), (ii) redox enzymes (e.g., cytochrome P450s, encoded by >250 genes in Arabidopsis and >200 genes in rice), and (iii) substitution enzymes (e.g., O-methyltransferases, encoded by a gene family with >15 members in Arabidopsis and >10 members in rice). Metabolic diversity in plants is partly due to the occurrence of multifunctional enzymes. For example, terpene synthases (encoded by a gene family with >40 members in Arabidopsisand >15 members in rice) are known for their ability to synthesize multiple products from a single substrate (64), and 2-oxoglutarate–dependent dioxygenases (encoded by a gene family of >80 members in Arabidopsis and >50 members in rice) can typically accept multiple substrates and produce multiple products (65).

Specific classes of secondary metabolites produced in different plant lineages function as signal molecules, attract pollinators, or defend against herbivores and pathogens (66). For example, rice synthesizes sakuranetin (a flavanone); momilactones A and B (diterpenes); and oryzalexins A, B, C, D, E, F, and S (diterpenes) as the predominant antifungal phytoalexins (67). In contrast, pathogen-challenged Arabidopsis accumulates glucosinolate-derived isothiocyanates and the indole camalexin (68, 69). Interestingly, the genomes of rice andArabidopsis contain gene families putatively encoding strictosidine β-glucosidase, berberine-bridge enzyme, and strictosidine synthase (Web site link 21). Biochemically characterized members of these enzyme families are involved in several pathways of alkaloid biosynthesis that are not known to operate in rice orArabidopsis (70). However, repeated evolution—a process that leads to orthologous or paralogous genes with modified biochemical functions—appears to be a common theme in secondary metabolism (66). Hence, it is often impossible to assign the catalytic function of a novel enzyme solely on the basis of sequence similarity. Enzymes encoded by such gene families should be regarded as representatives of enzyme classes with common catalytic mechanisms (e.g., berberine-bridge enzyme is a C-C bond-forming oxidoreductase), the functions of which need to be determined biochemically.

Phosphate transporters in the rice genome.

Improved nutrient assimilation is becoming increasingly important for modern crops. A subset of transporters involved in the acquisition and translocation of phosphate (Pi) is important for uptake of this often limiting macronutrient. Plants have both high- and low-affinity phosphate transporter systems (71, 72). Low-affinity phosphate transporters are constitutively expressed and operate at millimolar Pi concentrations, whereas most high-affinity phosphate transporters are transcriptionally induced at limiting phosphate concentrations and operate at micromolar concentrations. High- and low-affinity Pi transporter genes have been cloned from a number of different plant species by homology to known Pi transporters from yeast (73, 74). For the high-affinity transporters, six genes fromA. thaliana, three from Solanum tuberosum, one from Nicotiana tabacum, two from Lycopersicon esculentum, two from Medicago truncatula, and one fromCatharanthus roseus have been reported (71, 75). In contrast, only one low-affinity Pi transporter, preferentially expressed in leaves, has been isolated from A. thaliana(76). Plant Pi transporter genes were identified and compared by searching Syd and the Arabidopsis genome for homologs of Arabidopsis high-affinity transporter AtPT1 (77, 78) and low-affinity transporter Pht2;1 (76,79). Arabidopsis and Syd were found to have 9 and 13 members, respectively, of the high-affinity Pi transporter gene family (Web site link 22). Both Syd and Arabidopsis have only one detectable low-affinity Pi transporter gene (Web site link 23).

Transcription factors in the rice genome.

The complement of rice transcription factors (TFs) appears quite similar to that of Arabidopsis and shows many of the same overall biases of family types and numbers (80). A comparison (TBLASTN) using the TRANSFAC database againstHMLgenes revealed 1306 TF genes (Table 5). The number identified is similar to the 1533 TFs reported in Arabidopsis (81), although it must be an underestimate because some plant-specific TF families (e.g., Aux/IAA, NAC, GARP) are not represented in TRANSFAC. BLAST analysis was used to estimate the TF family sizes reported (80).

Table 5

Transcription factor (TF) classes. Family sizes of select TF classes found in Syd are based on homology with genes in TRANSFAC (80).

View this table:

The MYB superfamily in rice is quite large relative to other sequenced organisms, with 156 readily identifiable genes, although smaller than the family of 190 MYB and MYB-related genes identified inArabidopsis (81). The MADS box family, which also appears to have been amplified in plants, comprises 71 genes in rice, comparable to the 82 genes found in Arabidopsis. The C2H2 zinc finger class in rice has 125 members, close to the number inC. elegans and slightly more than in Arabidopsis.

Rice harbors all other plant-specific TF families identified inArabidopsis. The C2C2 zinc finger class comprises several subtypes of proteins, including GATA/CO, Dof, and YABBY. Rice has 36 members identified as GATA/CO, whereas Arabidopsis has 61. The single zinc-finger Dof family comprises 21 proteins in rice and 36 in Arabidopsis. The YABBY family appears quite limited in rice, with only four identifiable members. Eighty-three WRKY proteins were identified in rice, slightly more than the 72 found inArabidopsis. The Ap2/ERF/RAV family appears to have a similar number of members in rice (143 genes) andArabidopsis (144 genes). This comparison alone provides limited functional information. However, as described above, queries with Arabidopsis TFs of known function can identify candidate rice orthologs with considerable similarity.

Rice as a model for other cereals.

Extensive synteny among cereals (7, 82–84) allows integration of their genetic and physical maps. Sequence-based markers from syntenic regions of one cereal can be used for fine mapping and candidate gene identification across cereals. The small genome of rice provides the genomic foundation for all cereals—enabling efficient identification of orthologous genes, regulatory regions, gene functions—and may facilitate the sequencing of other cereal genomes. The extent of gene conservation was determined by compiling a set of full-length, nonredundant complete coding sequences for each nonrice cereal species, and comparing these to Syd. At significant similarity levels, almost every cereal protein was found to have a related gene in rice (Table 6). At higher stringency, 80 to 90% of cereal gene queries identified rice homologs (Table 6). These observations suggest that most genes are conserved across cereals, and that phenotypic variation is due to a small number of different genes or functional differences within similar genes.

Table 6

TBLASTN comparison of rice versus other cereal proteins from GenBank. A set of full-length nonredundant cereal protein sequences was compiled using all available sequences from GenBank. Pairs of proteins with greater than 90% identity over an alignment of at least 100 amino acids were considered redundant and one of the two was removed.

View this table:

The level of synteny among cereals was determined by comparing anchored rice genomic sequence to mapped sequence from other cereals. Related regions of the rice and maize genomes were aligned (Fig. 4). Significant genomic alignment was also achieved using the limited number of available sequence-based mapped markers from other cereals (Web site link 24), consistent with previous reports (82, 85–87). Using such alignments, traits mapped in other cereals can be associated with rice sequences facilitating identification of the underlying genes. About 2000 cereal quantitative trait loci (QTLs) have been mapped (88–98) and can be placed on the rice genome map en masse. For example, many maize QTLs were associated with the top of rice chromosome 1 by aligning maize chromosomes 1, 2, and 7 with this region (Fig. 5A). As a more specific example, a QTL influencing grain yield (QTL 21) that maps to maize chromosome 1 (99) was localized to the syntenic region of rice chromosome 3, containing ∼220 HMLgenes300 and more than 120 rice SSRs (Fig. 5B). With the use of these genes, ∼100 unmapped maize cDNAs were identified by homology and are therefore candidate genes influencing yield.

Figure 4

Rice-maize synteny. Maize markers were mapped to the rice genome in silico. Maize map and sequence information were derived from MaizeDB (610 markers) and GenBank, respectively. Maize chromosomes are indicated along the vertical black lines; positions of specific markers and bins are defined by horizontal lines. Rice chromosomes are represented by numbered, colored rectangles. Significant homology (at least 80% identity, over 100 continuous base pairs, between a maize chromosomal region and a particular rice region) is indicated by a colored rectangle to the right of the maize chromosome. For a more detailed version of this map, see Web site link 24.

Figure 5

Maize QTLs mapped to the rice genome. (A) Rice-maize comparative QTL mapping. Portions of maize chromosomes, represented by numbered, colored rectangles, that show sequence similarity (at least 80% identity over 100 continuous base pairs) with specific regions of the top of rice chromosome 1 are shown. The rice map is from the IRGSP. Genetic distance is indicated by the numbers to the left of the rice chromosome (e.g., 1004.2 means 4.2 cM from the tip of chromosome 1); specific markers that map to this region are indicated to the right. Regions from maize chromosomes 1, 2, and 7 show similarity with the tip of rice chromosome 1 as shown, and maize QTLs in these regions are indicated. The region represented by the thick black line comprises ∼650 kbp in rice; each colored block represents varying amounts of maize DNA. (B) Detailed example of rice-maize comparative QTL mapping. Grain yield QTL 21 is mapped to maize map bin 1.03 between cDNA markers csu 710 and csu 392, and is syntenic with rice chromosome 3. Additional markers from the same maize bin confirm microsynteny in this target region, which contains ∼220 candidate genes and 120 SSR markers in rice. Dotted lines connect homologous genes with the indicated BLAST expectation values.


Efficient shotgun sequencing strategies have been applied to microorganisms and recently to organisms with larger genomes such asDrosophila, human, and mouse. The goal of this project was to create a database of mapped cereal genes and markers, and provide a foundation for cereal functional genomics studies. The rice genome was chosen as the appropriate model cereal genome and sequenced to greater than 99% coverage and accuracy. The resulting genomic information enables development of RNA profiling, proteomics, and accelerated crop breeding technologies. Homologs of most of the known cereal proteins were found in rice. Homologs of most predicted Arabidopsis proteins were also identified, although synteny between rice and Arabidopsis is limited. Several thousand genes were found to be present only in theArabidopsis and rice genomes, and are candidates for plant-specific genes. Many rice genes were assigned putative roles via comparison with Arabidopsis genes. Biosynthetic enzymes, signal transduction proteins, developmental regulators, and specific ion transporters were readily identified in the rice genome. Assembled sequence data were aligned to the rice physical and genetic maps, and anchored to heterologous cereal maps. The resulting universal cereal map allows placement of most mapped cereal QTLs and assignment of trait candidate genes. The draft genome sequence described here provides a foundation for cereal genomics; however, highly accurate, finished sequence should remain the ultimate goal for plant science. Continued application of genomic and biotechnology tools to crop improvement will be necessary to meet future food, health, and material challenges.

The rice genome sequence is available at Copies of the agreements governing access to the data are available onScience Online ( and A summary of the agreements is available at theScience Online URL.

  • * To whom correspondence should be addressed. E-mail: stephen.goff{at}

  • Present address: Illumina Inc., 9885 Towne Centre Drive, San Diego, CA 92121, USA.


View Abstract

Navigate This Article