Supplemental Data


Abstract
Full Text
A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. indica)
Jun Yu, Songnian Hu, Jun Wang, Gane Ka-Shu Wong, Songgang Li, Bin Liu, Yajun Deng, Li Dai, Yan Zhou, Xiuqing Zhang, Mengliang Cao, Jing Liu, Jiandong Sun, Jiabin Tang, Yanjiong Chen, Xiaobing Huang, Wei Lin, Chen Ye, Wei Tong, Lijuan Cong, Jianing Geng, Yujun Han, Lin Li, Wei Li, Guangqiang Hu, Xiangang Huang, Wenjie Li, Jian Li, Zhanwei Liu, Long Li, Jianping Liu, Qiuhui Qi, Jinsong Liu, Li Li, Tao Li, Xuegang Wang, Hong Lu, Tingting Wu, Miao Zhu, Peixiang Ni, Hua Han, Wei Dong, Xiaoyu Ren, Xiaoli Feng, Peng Cui, Xianran Li, Hao Wang, Xin Xu, Wenxue Zhai, Zhao Xu, Jinsong Zhang, Sijie He, Jianguo Zhang, Jichen Xu, Kunlin Zhang, Xianwu Zheng, Jianhai Dong, Wanyong Zeng, Lin Tao, Jia Ye, Jun Tan, Xide Ren, Xuewei Chen, Jun He, Daofeng Liu, Wei Tian, Chaoguang Tian, Hongai Xia, Qiyu Bao, Gang Li, Hui Gao, Ting Cao, Juan Wang, Wenming Zhao, Ping Li, Wei Chen, Xudong Wang, Yong Zhang, Jianfei Hu, Jing Wang, Song Liu, Jian Yang, Guangyu Zhang, Yuqing Xiong, Zhijie Li, Long Mao, Chengshu Zhou, Zhen Zhu, Runsheng Chen, Bailin Hao, Weimou Zheng, Shouyi Chen, Wei Guo, Guojie Li, Siqi Liu, Ming Tao, Jian Wang, Lihuang Zhu, Longping Yuan, and Huanming Yang

Supplementary Material


Web Supplement 1: Size, GC content, and AG (purine) content for exons, introns, and genes in multi-cellular eukaryotes with significant quantities of GenBank sequence data. One table (cDNA-to-genomic alignments) contains high-quality data based on cDNA-to-genomic alignments, restricted to genes where the entire cDNA is aligned. Only a few organisms can be studied this way, so we generated a second table (Parsed GenBank annotations) by parsing GenBank annotations, which are of admittedly uneven quality. Except for the A. thaliana, O. sativa, and H. sapiens data discussed in the main paper, the sequence data are mostly from GenBank release 123, April 15 2001. We define mean sequence content by meanN = Sum SymbolGCi /Sum Symbol1 and meanW =Sum SymbolLidot symbolGCi /Sum SymbolLi, where GCi and Li represent the GC content and size of the i-th segment. For mean size, no such distinction is made, and meanW is set equal to meanN. As a crude indication of the range of variation, we give size and sequence content for the 10th and 90th percentile. Since intron and gene size distributions are sensitive to contig/scaffold sizes, we indicate for each organism the N50 size above which half the genome sequence can be found. For D. melanogaster, we use Celera scaffolds because they give a reasonable estimate of the gap sizes between linked contigs. For H. sapiens, we had to break the scaffolds into their constituent contigs since the International Human Genome Sequencing Consortium does not estimate gap sizes. Contig/scaffold size is a major problem for mammalian genomes, which have extremely large genes, some up to a megabase or more. There is a method to compensate for contig/scaffold size in the estimate of mean intron and gene sizes [G.K.S. Wong, D.A. Passey, J. Yu, Genome Res.11, 1672 (2001)], but since the method cannot be applied to all data sets, we chose not to use it here. For the record, the corrected mean gene size is 72-Kb in human (and most mammals), but the cDNA-based human data has a mean gene size of only 31-Kb, and the annotation-based rat data has a mean gene size of just 4-Kb. Mean intron sizes are similarly distorted.


Web Supplement 2: Summary of 53,398 complete FGeneSH predictions with initial and terminal exons, classified by InterPro ("InterPro Classifications") and Gene Ontology Consortium ("Gene Ontology Consortium"). [See user instructions below before downloading these two compressed files.] Every rice gene is compared to A. thaliana by two methods. TblastN compares rice protein sequences to all six reading frames of the A. thaliana genome sequence, returning the quantities "extent of hit", "AA identity", and "hits per gene" that are depicted in the main paper as Figures 10 and 12. Extent of hit and AA identity are given as a percentage of the rice gene's coding region. BlastP performs a protein-to-protein comparison between the predicted gene sets for rice and A. thaliana, returning the functional assignments that are depicted in the main paper as Figures 9 and 13. Here, we reveal the "extent of hit" and "AA identity" for each assignment. Where possible, cDNA sequences are matched to rice and A. thaliana gene annotations, by DNA-to-DNA comparisons, and the number of identical bases is given as a percentage of the matched cDNA. The gene descriptions are adopted from rice cDNAs, arabidopsis cDNAs, or arabidopsis annotations, in that order of preference, with prefixes "rice-" or "arab-" to indicate when cDNAs are used. The predicted rice protein sequences are given in the rightmost column.


Web Supplement 3: Distributions for "number of genes" and "hits per gene", in rice and A. thaliana, decomposed by InterPro functional classifications. These are the counterparts of Figures 9 and 13 in the main paper. We would note that only 15.9% and 27.3% of the rice and A. thaliana genes are classified, respectively.


Instructions for downloading and decompressing data files (Web Supplement 2):

The two files that make up Web Supplement 2 are offered as compressed archives of Microsoft Excel files, in *.zip format. Users should download the compressed files to their machine and decompress the file on their local hard drive, using these instructions. The sizes of the compressed files are as follows:

Interpro_Classifications (13 MB .zip file; expands to 26 MB Excel file)
Gene_Ontology_Consortium (13 MB .zip file; expands to 26 MB Excel file)

  1. Create a temporary folder on your machine's hard drive.
  2. Save the .zip archive or archives to the temporary folder you created, using the links above.
  3. Decompress the compressed file in the temporary folder using decompression software such as WinZip (Windows; www.winzip.com) or StuffIt Expander (Windows and Mac; www.stuffit.com).