In this issue of Science on page 1304 and this week's issue of Nature appear versions of the sequence of the human genome (1, 2) that signal the dawn of a new era. For the research biologist, it is easy to think about the advantages of having the sequence of every gene of potential interest, but another thing altogether to think about how to find all of them and to validate their identities and structures. The use of genome sequences to solve biological problems has even been afforded its own label; for better or worse, it's called “functional genomics.” This new way of doing biology means some real changes, many of which are well under way in the community.
Since the publication of the Saccharomyces cerevisiae genome in 1996 (3), we have become familiar with the use of the full genome sequence in investigations of gene expression patterns and controls, protein-protein interaction networks, and other biological problems (4–6). These investigations are marked by a global point of view that was simply not possible before we had the sequence. Although we still do not know the function of about a third of the yeast genes, we do know that all possible protein and RNA participants in cellular function are encoded in the sequence we have.
As simple as it sounds, to know that there are no other unknown genetic components that can provide alternative explanations of experimental results is a fundamental shift of perspective. This shift is beginning to transform our approach to science, enabling researchers to face the challenge of identifying all the molecular components of the cell, as well as understanding how they are controlled, interact, and function. From a picture of the “software” of the single cell, we can look to the future when researchers will begin building, with as fine a degree of resolution, an integrated view of the universe of cell-cell interactions, differentiation, and development from single cell to organism. The availability of complete sequences of Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana (7–9) is already beginning to revolutionize such studies, and this list may soon include significant sequence from other biological models of metazoan development.
Estimates from genes analyzed to date suggest that the average number of alternates spliced from the transcript of a single mammalian gene might be in the range of two to three or more. As the present sequence yields estimates of about 30,000 genes (1, 2), this would give us an estimated 90,000 or more distinct proteins encoded by the human genome, without considering proteolytic processing or posttranslational modifications. Thus, the complexity of the mammalian genome relative to that of yeast still presents formidable technical obstacles.
So how can the working biologist take advantage of all this new information and bring about the advances predicted? The first step is to understand that the present form of the available sequence information of the human genome is not a complete, fully annotated inventory of the human genes in each chromosome. Nor is the available sequence a single continuous and exact sequence for each chromosome. The reported genome sequence is represented by a set of sequences that cover the genome in a statistical sense but have a very large number of interruptions and gaps. Although the completeness and continuity will continue to improve, there are significant uncertainties when inferences are made from these data. The concept of the “contig” is essential to our understanding of this limitation. A contig is a contiguous piece of sequence information inferred by assembling sequence reads from single reactions (usually 400 to 800 bases in length). The number of contigs reported in the sequence data and their spectrum of sizes are important parameters in the analysis of genes. As of 12 December 2000, the public database at the U.S. National Center for Biotechnology Information (NCBI) reported that the largest contig in the entire available sequence was 28.5 megabase-pairs (Mb) in size; there were 43 contigs larger than 1 Mb, 566 contigs between 250 kb and 1 Mb, and 1628 contigs between 100 and 250 kb in size. This represented a total of approximately 600 Mb in contigs larger than 100 kb—less than 20% of the full sequence of the genome. As illustrated in figure 8 of IHGSC (2), half of the sequence lies in contigs 22 kb or smaller, though they can be joined to form larger contigs. We must distinguish here “initial sequence contigs,” derived from sequenced clones, and “merged sequence contigs,” derived by merging sequence contigs from overlapping sequenced clones [see figures 6 and 7 in (2)]. Because Venter et al. (1) assemble sequence contigs, not from sequenced clones, but from the entire collection of sequence reads, this distinction is not necessary in their report.
Because the average gene is of the same order of magnitude or larger than many of the contigs (a good estimate might be about 30,000 base pairs), this means that a significant fraction of human genes are unlikely to be represented on a single sequence contig in these data sets. The likelihood of finding one of the largest genes, such as Titin [∼250 kb in size with >200 exons (1)] on a single contig is much smaller than for small, simple genes like the olfactory receptor genes, which average less than 2 kb (2). It will be a while before the gaps get filled in and the contigs are joined together.
Therefore, in the near future, many genes will have to be synthesized from an inferred organization of the contigs into a gapped mosaic of assemblies called “scaffolds.” This means that an even more important factor than continuity for using the sequence to construct models of genes is the uncertainty associated with positioning the contigs relative to each other. Ambiguities in order and orientation of the contigs will sharply increase the number of possible ways that the sequence can be fit together and will thereby obscure the actual gene structure.
The definition of a scaffold appears to be quite different in the two papers. Venter et al. (1) report that they built scaffolds by using the paired-end sequences of their plasmid clones to link together and orient sequence contigs. They could put together these chains of sequence contigs, in the right order and orientation and at known distances apart, because they used several, known sizes of plasmid clones for sequencing and always generated sequence pairs at known distances from each other. The advantages of relying on these kinds of sequence data were substantial in the assembly process. One of these advantages is that sequence contigs could be linked in the proper orientation and distance from each other even when they could not be merged into a single contig. Thus, the self-consistent assembly from these data would appear to have ensured a high level of order and orientation of contigs at every scale of length.
It may not be possible to fully assemble genes that fall into these scaffold segments if a gene segment falls into an unsequenced gap, but the picture of the gene that emerges should be fairly reliable. A gene would look something like the picture on a reconstructed Grecian urn, with blank clay segments holding the places for the real, picture-completing fragments. A critical parameter for gene assembly and analysis for the Venter et al. approach is the size and coverage distribution of scaffolds [see figure 5 in (1)]. The average scaffold length reported was more than a megabase, with 25% of the genome in scaffolds of at least 10 Mb in size. As the average gap length between scaffolds was only 2 kb, this data set seems to represent a high level of coverage for gene analyzers, with a high level of consistent order and orientation.
IHGSC (2) report that they built their scaffolds quite differently—largely by linking sequenced bacterial artificial chromosomes, BACs. This will still leave some sequence contigs within the BACs of the scaffold unordered or unoriented. The Grecian urn analogy does not fit here because the sizes and shapes of the gaps are not well known and, in some cases, the pieces may be in backwards or in the wrong order. The critical factor for the gene-analyzing biologist is the degree of ordering and orientation of contigs within the BACs that were linked to make the scaffolds, which is difficult to estimate from the report. Relevant measures include a reported overall estimate in the range of 10 to 15% for misordered or misoriented sequence contigs (2). This paper contains a useful new statistic to indicate sequence contig length or scaffold length that is systematically larger than the simple average length—the N50 length (the largest length such that 50% of all base-pairs are contained in contigs of this length or larger). The reported N50 length for sequence contigs was 82 kb (including data from the finished chromosomes 21 and 22), and the scaffold N50 was 270 kb. Direct comparison of these statistics with the averages from Venter et al. are not meaningful. To understand some of the subtleties, the interested reader will have to venture further into the data on distribution of lengths and other important complexities, keeping in mind the differences between the processes of assembly. It would appear, however, that the scaffold data reported in the Science paper, having 90% genome coverage with end-to-end, long scaffolds, is a powerful resource for the biologist that will steadily improve as new sequence fills in the contig gaps and resolves remaining ambiguities.
The effectiveness of finding genes by similarity to a given sequence segment is determined by a much simpler statistic, the total coverage of the genome by the collective set of sequence contigs. As the overall coverage of the genome is virtually complete (≥90%), there is a strong likelihood that every gene is represented, at least in part, in the data. Thus, finding any gene by sequence similarity searches using sufficient sequence to ensure significance is almost always possible using the data published this week. Caution must be exercised, however, as the identification of the gene may still be ambiguous. This is because a highly similar sequence from a receptor gene from Drosophila, for example, could be found in several different, homologous genes, which may have similar or entirely different functions or are nonfunctioning pseudogenes. In other words, common domains or motifs can be present in many different genes. The use of the approximate similarity search tool BLAST is probably still the best way to find similar sequences. The excellent primer at the NCBI site (10) should be used to understand the nature of the growing armament of BLAST-based tools, as well as the sometimes subtle issue of statistical significance and the limitations of this kind of approximate algorithm. For most purposes, the approximation used by the BLAST algorithm is irrelevant, but the user should be aware of the specific kinds of similarities that may be missed by each available form of the algorithm. For example, since certain kinds of interrupted similarities are ignored, the more widely separated two similar sequences are, the less reliable will be the assessments of statistical significance. Newer methods attempting to use the structural cues inherent in the coding sequence to detect similarities are pushing back the detection limits for significant similarity (11).
Although enormous progress has been made in automating the identification of genes in genomic sequence, building accurate models of genes from the sequence still requires a lot of human, “hands-on” effort. The best models are built of genes whose full-length mRNA sequences are available. The RNA sequence [in the form of complementary DNA (cDNA)] can be used to thread together the exon structure of the gene from genomic sequence no matter where the pieces may reside—continuity, order, and orientation of the fragments are not essential to this process. Of course, the presence of pseudogenes and highly similar repeats can defeat even this strategy. Nonetheless, this represents a strong argument for gathering much more full-length cDNA sequence data.
There are two general approaches to gene finding. The homology-based methods include the use of known mRNA sequences as well as gene families and inter-specific sequence comparisons. The ab initio methods include detection of exons and other sequence signals, like splice sites, by various computational methods within the sequence being analyzed.
In every gene model, the location and structure of the sequences involved in regulation and control stands as one of the most difficult annotation problems. Finding and dissecting these important sequence regions can be done in some cases by means of motifs known to be conserved in transcription factor-binding regions (12), but our ability to define and predict control regions is currently rather poor and unreliable. Interspecific genome comparisons are one of the ways of getting at these regions, under the assumption that the regions will stand out as being conserved (13). New experimental methods like an array-based technique to locate genome-wide sites of action of transcription factors (4), will also make significant contributions to sorting out the cis-regulatory signals in the genome.
A number of tools are currently available for automated annotation, but a discussion of their advantages and limitations in specific circumstances is beyond the scope of this viewpoint. Approaches that use a combination of statistical and heuristic methods to recognize genes and gene features are prevalent (hidden Markov models, neural nets, and Bayesian networks are among the methods used). They are most effective, however, in finding genes, rather than modeling them accurately, and are usually used in concert with homology-based methods. Factors that can have strong effects on the effectiveness of such algorithms include errors in sequencing and statistical biases like base composition. Noise in the data can sharply degrade performance, so draft sequence, in which the error rate is higher, can be markedly inferior to finished sequence for ab initio prediction.
GENSCAN is a widely used piece of software for gene finding and prediction, but newer developments like Genie also look promising (14, 15). Genie is a hidden-Markov-model system that allows for the integration of information from different sources such as signal sensors (splice sites, start codon, etc.); sensors of introns and exons; and alignments of mRNA expressed sequence tag (EST), and peptide sequences. Other software tools, like GENEBUILDER, GLIMMERM, FGENES, GRAIL, and others, have also been reviewed recently (16, 17). There is no one simple way to compare them, as they appear to perform differently in different tests. Using the Drosophila genome as a primary example, the Genome Annotation Assessment Project (GASP1—see table below) provides a very useful analysis of progress and problems in eukaryotic genome annotation (18). A similar comparison has been done using the Arabidopsis genome (19).
The two genome papers have used systems consisting of multiple tools to create their initial gene inventories. IHGSC (2) used a system called Ensembl that follows ab initio predictions by GENSCAN with mRNA, EST, and protein motif information comparisons for the initial predictions (19). It then uses a program called GeneWise, which has been used on the Drosophila genome (20), to extend protein matches. In contrast, Venter et al. (1) report the development of a rule-based expert system for annotation they call “Otto” that attempts to embed some human curatorial functions in software.
All these annotation efforts in the community are also being linked to new ways of visualizing genomes with their annotation (18, 21, 22). Genome browsers that enable the reader to navigate through many levels of genome information are now available that take the first steps in this direction. These tools can be accessed at several sites (see table, above). Commercial firms are also beginning to market similar kinds of software and are likely to continue to develop sophisticated, user-friendly packages for these purposes.
In the future, when the annotation of the genome is complete, the information from the sequence will be indicated in agreed-upon terms that can be searched directly by the text of the annotation—for example, a gene will be found by its name, by its family, by the protein domains it codes for, etc. Clearly, a combination of sequence similarity searching tools, ab initio methods, and annotation-based searches must be used by researchers for the foreseeable future. The next stage of annotation will also require the integration of independent experimental information into the gene annotation. To fully explore the properties of complex, highly interactive systems, databases will need to have pointers that link a gene to other genes by a variety of causal interactions, such as gene product X has a binding partner Y, exerts control on the expression of gene Y through a cis-regulatory site, produces a metabolic product that interacts with the product of gene Y, or participates in the same (or linked) signaling pathway with the product of gene Y. The way to this future has been opened by the availability of sequence information. Now we have to learn to use it to understand the biology of the organism.