A Drosophila Complementary DNA Resource

See allHide authors and affiliations

Science  24 Mar 2000:
Vol. 287, Issue 5461, pp. 2222-2224
DOI: 10.1126/science.287.5461.2222


Collections of nonredundant, full-length complementary DNA (cDNA) clones for each of the model organisms and humans will be important resources for studies of gene structure and function. We describe a general strategy for producing such collections and its implementation, which so far has generated a set of cDNAs corresponding to over 40% of the genes in the fruit flyDrosophila melanogaster.

Collections of full-length sequenced cDNAs corresponding to each gene in an organism are widely recognized to be of great utility (1). They allow expression of the encoded proteins in a variety of contexts, which facilitates comprehensive structural and functional studies. In addition, they allow the accurate prediction of gene structures, particularly of 5′ and 3′ untranslated regions (UTRs) that are refractory to computational prediction based on genomic DNA sequence alone. The first steps in producing such a collection are the generation of high-quality cDNA libraries and the identification of a full-length clone, or minimally a clone containing the full-length open reading frame (ORF), for each gene. Here we present a strategy that has so far allowed us to obtain such clones for over 40% of allDrosophila genes. We also discuss how clones corresponding to less highly expressed genes might be obtained.

Our approach is outlined in Fig. 1. We first constructed oligo(dT)-primed cDNA libraries from high-quality RNA isolated from a variety of developmental stages and tissues using well-established methods (2) (Table 1). We did not attempt to decrease the contribution of abundant mRNAs to these libraries by normalization because such protocols are difficult to perform without compromising cDNA length (3). We then generated expressed sequence tags (ESTs) (4) from the 5′ ends of 80,000 cDNAs (5). A comparison with the 13,600 genes predicted from the genomic sequence indicates that these ESTs represent 8900 different genes, 65% of all Drosophila genes (6).

Figure 1

Diagram of the process used to generate the DGC. See text for details.

Table 1

Summary of construction of theDrosophila Gene Collection. RNA for the various libraries was obtained from the following sources: LD, 0- to 22-hour embryos; GM, ovaries, stage 1 to 6 of oogenesis; HL and GH, adult head; LP, mixed larval and early pupal stages; and SD, Schneider L2 cell line. Sequence reads were quality trimmed before submission to GenBank essentially as described in (14); we estimate the accuracy of the high-quality region to be better than 99% and that of the additional bases included in the total submission to be 97%. A list of the clones that make up the current DGC can be found atwww.fruitfly.org/DGC.

View this table:

The use of 5′ ESTs allows us to evaluate the quality of each library rapidly and to identify the clone that extends farthest toward the 5′ end of each gene. We assessed the fraction of clones in each library likely to be full length by aligning the 5′ EST sequences derived from that library to the sequences of a test set of clones. This test set consists of the 9% of all Drosophila genes for which a cDNA clone having the full-length ORF exists in GenBank (see Fig. 2). About 80% of the clones that matched the test set contain the full-length ORF; for 33%, the 5′ extent is equal to or greater than the GenBank sequence (Table 1).

Figure 2

Estimating the quality of the LD cDNA library. 5′ ESTs derived from LD library clones were compared with the 1213 sequenced Drosophila cDNAs in GenBank that are reported to extend farther 5′ than the start of the ORF. When an EST corresponded to one of these 1213 genes, it was aligned to its GenBank counterpart with LALIGN (13). Each dot represents the result of one such alignment. A position of 0 on the xaxis indicates that the GenBank clone and the EST are the same length; a negative number indicates the EST extends farther 5′. As reported in Table 1, 33.6% of LD clones are as long as or longer than the corresponding GenBank clone. This percentage is higher for clones under 4 kb and drops off markedly with increasing clone size, although apparently full-length clones are seen up to 6.5 kb.

We then clustered the 5′ ESTs by sequence (7) and selected the one clone representing each gene that extends farthest 5′. We estimate that by selecting the longest clone for each gene, we increased the percentage of clones containing the full ORF to greater than 95%. We next obtained the sequence of the 3′ ends of 9080 of the selected clones (8). We performed two quality-control tests at this point: First, we discarded clones for which a polyadenylate [poly(A)] tail was not apparent. Second, we aligned the 5′ and 3′ sequences of each clone to the genomic DNA sequence (9) and discarded clones for which the two sequences were not in proximity. This eliminated clones that contain two unrelated cDNAs coligated into the same cloning vector, which occurred in about 5% of clones, as well as clones for which a data tracking error occurred such that the 5′ and 3′ reads of that clone were not appropriately associated in our database. We also determined the insert size of each clone (10).

We clustered the remaining clones on the basis of their 3′ end sequences. This allowed us to eliminate remaining duplicate clones; such clones might escape detection in the 5′ end clustering if they so differ in length that their 5′ ESTs do not overlap. In these cases, the longer clone was retained.

These steps resulted in the generation of a validated set of 5849 clones, estimated to represent 42% of all predictedDrosophila genes. The average size of the cDNAs in this set is 2.2 kb (Table 1). These clones are now being colony purified and arrayed to generate what we call the Drosophila Gene Collection (DGC), Release 1.0 (11). We are also selecting replacements, if they exist in our EST collection, for clones that failed to pass quality-control tests. We anticipate that this selection of replacements will increase our representation from 42% to over 50% of all genes.

We envision two complementary strategies to isolate cDNAs representing the remaining genes. Because we want to determine the sequence of the 5′ and 3′ UTRs, we do not intend to simply amplify predicted ORFs using reverse transcription polymerase chain reaction (RT-PCR). Given funding, we would propose to generate an additional 200,000 5′ ESTs from both existing as well as newly constructed libraries. Given the availability of a highly annotated genome sequence, we need only sequence 50 to 100 base pairs (bp) to obtain an unambiguous alignment with the genome. We can then computationally determine whether a particular EST is likely to derive from a clone containing a complete ORF not represented in our current DGC set. Promising clones would then be sequenced from the 3′ end and subjected to our other quality-control criteria. We anticipate that 100-bp ESTs can be generated for a fraction of the cost of the 500-bp ESTs used in our initial work and that 200,000 additional ESTs will be sufficient to bring our DGC set to 80% of all genes. The remaining 20% of clones can be isolated by library screening with PCR-based methods (12); our existing libraries have an estimated total complexity in excess of 5 million clones.

The approach we have demonstrated, as well as the extensions outlined above, will serve as a useful model for the generation of similar clone sets in other organisms. Indeed, some aspects of these ideas have already been adopted by the Mammalian Gene Collection project (1).


View Abstract

Navigate This Article