Report

A Draft Sequence for the Genome of the Domesticated Silkworm (Bombyx mori)

See allHide authors and affiliations

Science  10 Dec 2004:
Vol. 306, Issue 5703, pp. 1937-1940
DOI: 10.1126/science.1102210

Abstract

We report a draft sequence for the genome of the domesticated silkworm (Bombyx mori), covering 90.9% of all known silkworm genes. Our estimated gene count is 18,510, which exceeds the 13,379 genes reported for Drosophila melanogaster. Comparative analyses to fruitfly, mosquito, spider, and butterfly reveal both similarities and differences in gene content.

Silk fibers are derived from the cocoon of the silkworm Bombyx mori, which was domesticated over the past 5000 years from the wild progenitor Bombyx mandarina (1). Silkworms are second only to fruitfly as a model for insect genetics, owing to their ease of rearing, the availability of mutants from genetically homogeneous inbred lines, and the existence of a large body of information on their biology (2). There are about 400 visible phenotypes, and ∼200 of these are assigned to linkage groups (3). Silkworms can also be used as a bioreactor for proteinaceous drugs and as a source of biomaterials. Here, we present a draft sequence of the silkworm genome with 5.9× coverage.

B. mori has 28 chromosomes. More than 1000 genetic markers have been mapped at an average spacing of 2 cM (∼500 kb) (4). A physical map is being constructed through the fingerprinting and end sequencing of bacterial artificial chromosome (BAC) clones (5). Many expressed sequence tags (ESTs) have been produced (6), and a 3× draft sequence has just been announced by the International Lepidopteran Genome Project (7). Our project is independent of, but complementary to, that of the consortium. Our sequence has been submitted to the DNA Data Bank of Japan/European Molecular Biology Laboratory/GenBank (project accession number AADK00000000, version AADK01000000) and is also accessible from our Web site (http://silkworm.genomics.org.cn) (8). ESTs discussed in this Report can be found at GenBank (accession numbers CK484630 to CK565104).

DNA for genome sequencing is derived from an inbred domesticated variety, Dazao (posterior silk gland, fifth-instar day 3, on a mix of 1225 males). A whole-genome shotgun (9) technique was used, and our coverage is 5.9×. Including the unassembled reads, the total estimated genome size is 428.7 Mb, or 3.6 and 1.54 times larger than that of fruitfly (10) and mosquito (11). The N50 contig and scaffold sizes are 12.5 kb and 26.9 kb. Our assembly contains 90.9% of the 212 known silkworm genes (with full-length cDNA sequence), 90.9% of ∼16,425 EST clusters, and 82.7% of the 554 known genes from other Lepidoptera. Additional details of our quality analyses are given in the supporting online material (fig. S1 and tables S1 to S6).

We developed a gene-finder algorithm BGF (BGI GeneFinder) (fig. S2), based on GenScan and FgeneSH. To determine a gene count for silkworm, one must correct for erroneous and partial predictions (Table 1). The final corrected gene count for silkworm is 18,510 genes, which far exceeds the official gene count of 13,379 for fruitfly (our BGF-based procedures predict 13,366 genes for fruitfly). We find that 14.9% of predicted genes are confirmed by ESTs (based on aligning the ESTs to the genome and looking for a 100–base pair overlap with the predicted exons); 60.4% and 63.1% are confirmed by similarity to fruitfly genes and GenBank nonredundant proteins (BlastP at 10–6 E-value). Overall, 69.7% are confirmed by at least one method.

Table 1.

Number of predicted genes from BGF. We show the initial count, the number of erroneous predictions, and the gene count after likely errors are removed. There are four successive filters, which include rules to remove TEs and pseudogenes, as described in the SOM Text. The final gene count is computed as row 1 minus the sum of rows 2 to 5. Predictions are classified into single-exon genes, partial genes (no head = no start, no tail = no stop, neither) or complete genes. We correct for partial genes by stipulating that each is worth only half a gene. The final corrected gene count is then 18,510.

Single exon No head No tail Neither Complete All genes Corrected
Total predicted 10,512 6,366 4,903 550 21,199 43,530 37,621
CDS < 100 bp or max exon score < 0.2 107 974 299 15 84 1,479 835
RepeatMasker TEs or copy number >10 7,334 2,233 2,111 124 7,575 19,377 17,143
Similarity to TE-associated proteins 132 71 68 7 294 572 499
Processed “single-exon” pseudogenes 314 146 179 8 153 800 634
Final annotated 2,625 2,942 2,246 396 13,093 21,302 18,510

Not only did we find more genes in silkworm than in fruitfly, but we also found larger genes as a result of the insertion of transposable elements (TEs) in introns. For example, in calcineurin B (cnb), the silkworm gene was 12 times as large as that of fruitfly. To generalize, we compared annotations, found reciprocal best matches, and computed gene size ratios. Because prediction errors are unlikely to be alignable across species, we restricted our analysis to aligned regions, giving us a mean (median) ratio of 2.29 (2.75) (Fig. 1). This combination of more and bigger genes can explain 86% of the factor of 3.67 increase in genome size from fruitfly (116.8 Mb) to silkworm (428.7 Mb). Silkworm genes also had slightly more exons than fruitfly, with a mean (median) ratio of 1.15 (1.12) for number of exons per gene.

Fig. 1.

Comparison of gene size in silkworm-fruitfly orthologs. We use reciprocal best matches, and calculate a ratio over the aligned portion. Size is shown with (gene size) or without (CDS size) introns. The minor peak is due to single-exon alignments.

As shown by our TE annotations, most of this increase in the genome size of silkworm is relatively recent. Of the 21.1% of the genome that is recognizable as being of TE origins, 50.7% is from a single gypsy-Ty3–like retrotransposon (12) (table S7). Mean sequence divergence is 7.7%, which dates the initial appearance of this TE to 4.9 million years ago, if we use the fruitfly neutral rate of 15.6 × 10–9 substitutions per year (13). Most other TEs are comparably recent in origins (fig. S3). GC-rich regions contain a higher density of TEs, particularly LINEs (long interspersed nuclear elements), which is the exact opposite of what is reported for the human and mouse genomes.

Unlike silkworm, which is a lepidopteran, fruitfly and mosquito are dipterans. The two insect orders diverged about 280 to 350 million years ago (14). Comparisons of their genome content were done at the level of InterPro domains. Functional assignments were mapped according to Gene Ontology (GO). Domain clustering (15) (table S8) produced 8947 groups, with 2565 shared among insects and 1793 unique to silkworm (Fig. 2). Consistent with the observed TE expansion, domains like reverse transcriptase, integrase, and transposase stand out for their prevalence in silkworm. A complete list of predicted silkworm genes is shown in table S9, with a special indexing table for the genes discussed in this paper.

Fig. 2.

InterPro domain clusters shared among or unique to all possible combinations of silkworm, fruitfly, and mosquito. Clusters are constructed with the algorithm detailed in table S8, which is based on a similar earlier analysis (14).

The silk gland, essentially a modified salivary gland, is a highly specialized organ whose function is to synthesize silk proteins. We identified a set of 1874 annotated genes that are confirmed by silk gland ESTs. Only 45 of these genes had been previously described in B. mori. GO function categories for silk gland and 11 other tissue libraries were compared (fig. S4). Several hormone-processing enzymes are active in silk gland, which is of interest because hormones participate in regulation of silk protein genes (16). Not counting low expressed genes undetectable at current EST depths, genes found only in silk gland include juvenile hormone (JH) esterase, ecdysone oxidase, and JH-inducible protein 1. Ecdysteroid UDP (uridine 5′-diphosphate)–glucosyl transferase is found in silk gland, testis, and ovary. Fibroin forms the bulk of the cocoon mass. It has two major components, a heavy (350 kD) and a light chain (25 kD). We found 1126 ESTs for the light chain, but only 4 ESTs for the heavy chain, suggesting that the one-to-one ratio for light and heavy chains is maintained at the post-transcription level. The heavy chain has five predominant amino acids: Gly (45.9%), Ala (30.3%), Ser (12.1%), Tyr (5.3%), and Val (1.8%). A complete tRNA gene set (table S10) was detected, including 41 Gly-tRNA and 41 Ala-tRNA, twice as many as in the other two insects and consistent with the requirements for fibroin production.

Another well-studied silk-secreting arthropod is the spider. We compared those 1874 genes expressed in B. mori silk gland with all available spider data (1482 from GenBank) and identified 107 homologs, including four B. mori counterparts for the major ampullate gland peroxidase in spider, which is involved in silk fiber formation (17).

We found 87 neuropeptide hormones, hormone receptors, and hormone-regulation genes. Drosophila melanogaster and Anopheles gambiae have 101 and 73 such genes, respectively. For B. mori, 52 genes were unknown, and 35 others were previously reported. Ecdysone oxidase and ecdysteroid UDP–glucosyl transferase (UGT) are implicated in ecdysone metabolism. We classified 20 UGT genes into five major clades (fig. S5), similar to the 34 UGT genes analyzed for D. melanogaster (18). Juvenile hormone (JH), ecdysone hormone (EH), and prothoracicotropic hormone (PTTH) work in coordination of ecdysis and metamorphosis. We identified 18 EH-sensitive receptors and receptor-like transcription factors. Four BRC Z4 genes contain intact DNA binding BTB domains. One has two additional zinc finger C2H2 type domains, with a zinc-coordinating cysteine pair and a histidine pair. These are involved in completing the larval-pupal transition, and later morphogenetic defects, or in programmed cell death of larval silk glands (19). We found many neuropeptide hormone genes too, like diapause hormone (DH), pheromone biosynthesis activating neuropeptide (PBAN), adipokinetic hormone (AKH), eclosion hormone, and bombyxin (4K-PTTH). In addition, diuretic hormone precursor and its receptor, allatotropin, and allatostatin were found. There was also a homolog to Lymnaea stagnalis neuropeptide Y precursor, a gene with pancreatic hormone activity that had not been detected in D. melanogaster and other insects and may therefore be new to silkworm.

Developmental genes for D. melanogaster have been extensively studied. We focused on 83 genes (20) that include 41 maternal genes, 12 gap genes, 9 pair-rule genes, 12 segment polarity genes, and 9 homeotic genes. The maternal genes are subdivided into four groups according to their function in patterning the early embryos (anterior, posterior, terminal, and dorsalventral). Only six genes [oskar, swallow, trunk, fs(1)k10, gurken, and tube], all from the maternal group, were not detected in B. mori. This confirms that the basic mechanism of development is largely conserved across insects. It had been reported that swallow and trunk have no homologs in A. gambiae. We find that tube has no homolog in A. gambiae. Loss of the other three genes is interesting. Localization of the maternal determinant oskar at the posterior pole of the D. melanogaster oocyte provides positional information for pole plasm formation (21). Gurken encodes a ligand for torpedo (Egf-r), which triggers dorsal differentiation (22), whereas fs(1)k10 is a probable negative regulator of gurken translation.

Lepidopteran wing patterning has stimulated a number of experimental studies. Although domesticated silkworm moths have long lost their ability to fly, as well as their colorful wing patterns, we expected that many of these genes would still be found in the sequence. We detected 18 silkworm homologs of wing-patterning genes from other Lepidoptera, primarily Junonia coenia. They include the Distal-less homeodomain gene, which affects eyespot number, positions, and sizes (23); Ubx, which represses Distal-less expression and leads to haltere formation in D. melanogaster, but may not act in the same manner in butterfly (24); Hh signaling pathway genes like Hh, Ci, En, and Ptc, which are important in eyespot focus formation; Wg, which plays a key role in band formation; and EcR, which is expressed in prospective eyespots and is coexpressed with Distal-less (25). Many of these genes are shared with the Diptera. Of the 323 wing-development genes known in D. melanogaster, 300 are found in silkworm. Most are well conserved, in that 87% and 56% align at E-values of better than 10–20 and 10–50.

Silkworm is a female-heterogametic organism (ZZ in male, ZW in female). Sex in B. mori is determined by a dominant feminizing factor on W, as compared to the intricate X:A counting system known in D. melanogaster. A homolog of the D. melanogaster sex-determining gene dsx has been isolated in B. mori. It is called Bmdsx. Although structural features and splice sites are conserved in these two genes, regulatory mechanisms are not (26). The splicing regulator tra was not identified in B. mori. Neither was the TRA/TRA2 binding site for Bmdsx, suggesting that the upstream sex-determining cascade for B. mori and D. melanogaster differ. However, homologs for most known sex-determining factors can be found. Among daughterless (da), hermaphrodite (her), extra macrochaetae (emc), groucho (gro), sisterless A (sisA), scute (sc), outstretched (os), deadpan (dpn), and runt (run) (27), homologs for da, emc, gro, sc, dpn, and run were identified in B. mori. For D. melanogaster, dosage compensation is known to equalize transcription of X-chromosome genes between sexes. At least six genes (msl-1, msl-2, msl-3, mle, mof, JIL-1) are required, and of these, homologs of mle, mof, and msl-3 were found in B. mori, despite the growing evidence for absence of Z-linked dosage compensation in B. mori (28). In these and other cases in which insect genes were not found in B. mori, wemanually checked our automated procedures (see SOM Text). However, further experiments will be needed, given the incompleteness of the genome and the level of homology needed for detection.

Humoral immune factors together with wound healing, homeostasis, and adaptive humoral immune responses are important components of immunity and defense in insects (29). We identified a total of 69 such genes, including 34 antibacterial genes, of which 23 appear to be newly identified. They encode the innate immune factors synthesized in fat bodies and hemocytes, which kill bacteria by permeabilizing their membranes. One of them is the Lepidopteran moricin, a highly alkaline antibacterial peptide initially isolated from B. mori. A new cluster of 8 moricin genes was found, with amino acid sequence identities of greater than 90% among members, but only 20% similarity to known moricins. Defensins specific to Gram-positive bacteria were found, as were cecropins (30). We detected a previously unknown class of cecropins. Other found genes related to insect defense include lysozymes, hemolin, lectins, and prophenoloxidases. As a member of the immunoglobulin (Ig) family, hemolin is unique to the Lepidoptera. Lectins are abundant, with 29 found in B. mori, compared to 35 and 22 in D. melanogaster and A. gambiae (31), respectively. We also identified three prophenoloxidases, of which two were previously known.

Lepidoptera are unusual because they have holocentric chromosomes with diffuse kinetochores. This characteristic is a potential driver of evolution because of the ability to retain chromosome fragments through many cell divisions. The nematode also has diffuse kinetochores, and five key chromosomal proteins are known (32, 33): hcp-1, hcp-2, hcp-3, hcp-4, and hcp-6. (The prefix hcp stands for “holocentric protein.”) Hcp-3 is detected in all eukaryotic centromeres, similar to histone H3 in its histone-fold domain, but dissimilar in its N-terminal region. It is also known as Cse4p in yeast, Cid in fruitfly, and CENP-A in human. Their proteins are highly diverged. The putative homolog in silkworm has only 23% identity to the histone-fold domain of hcp-3, but their lengths are similar: 268 amino acids for silkworm and 288 amino acids for nematode. There are many homologs of hcp-1 and hcp-2—18 and 72, to be specific—making it difficult to determine which ones might be the true orthologs. We could not find a homolog for hcp-4, but we did identify a homolog for a related gene that is known as CENP-C and was previously found in human, mouse, and chicken. Finally, we were not able to identify the silkworm homolog for hcp-6.

Supporting Online Material

www.sciencemag.org/cgi/content/full/306/5703/1937/DC1

SOM Text

Figs. S1 to S5

Tables S1 to S10

References and Notes

View Abstract

Navigate This Article