Research Article

Finding Functional Features in Saccharomyces Genomes by Phylogenetic Footprinting

See allHide authors and affiliations

Science  04 Jul 2003:
Vol. 301, Issue 5629, pp. 71-76
DOI: 10.1126/science.1084337

Abstract

The sifting and winnowing of DNA sequence that occur during evolution cause nonfunctional sequences to diverge, leaving phylogenetic footprints of functional sequence elements in comparisons of genome sequences. We searched for such footprints among the genome sequences of six Saccharomyces species and identified potentially functional sequences. Comparison of these sequences allowed us to revise the catalog of yeast genes and identify sequence motifs that may be targets of transcriptional regulatory proteins. Some of these conserved sequence motifs reside upstream of genes with similar functional annotations or similar expression patterns or those bound by the same transcription factor and are thus good candidates for functional regulatory sequences.

Functional non–protein-coding DNA sequences, such as gene regulatory elements, are difficult to recognize because they are usually short, often degenerate, and can reside on either strand of DNA at variable distances from the genes they control. Because functional sequences tend to be conserved through evolution, they can appear as “phylogenetic footprints” in alignments of genome sequences of different species (13). To investigate the use of phylogenetic footprinting for identifying gene regulatory elements on a genome-wide scale, we compared the genome sequences of six yeast species. On the basis of our initial analysis of the Saccharomyces phylogeny (4), we selected for sequencing three Saccharomyces sensu stricto (“strict sense”) species that are relatively closely related to Saccharomyces cerevisiae (S. mikatae, S. kudriavzevii, and S. bayanus). Their intergenic sequences average 59 to 67% identity to their S. cerevisiae orthologs in global alignments (table S1). At this evolutionary distance, nonfunctional sequences have diverged enough to allow many functional sequence signals to stand out from the “noise,” but the sequences retain enough overall similarity to enable their alignment. Because of the relatively high degree of similarity of sequences at these evolutionary distances, genome sequences of several species need to be compared to lend sufficient acuity to the phylogenetic footprints.

We also chose to include in the analysis two Saccharomyces species that are more distantly related to S. cerevisiae (S. castellii and S. kluyveri) (5). To estimate the degree to which these species have diverged from S. cerevisiae, we compared the sequences of synonymous codons, because few of their intergenic sequences align to their S. cerevisiae orthologs. S. castellii and S. kluyveri average 33.9% identity to S. cerevisiae, compared with 54.5% identity for the sensu stricto species (table S1). Several Saccharomyces species are approximately this diverged from S. cerevisiae (4). We chose these particular species for sequencing because they are thought to have the smallest genomes (6). In addition, S. kluyveri is of particular biological interest because of an unusual aspect of its physiology: Unlike most other Saccharomyces species, it does not primarily ferment glucose (7). We included these species in the analysis mainly for two reasons. First, we wished to compare the relative utility of sequences at different phylogenetic distances for comparative sequence analysis. Second, based on our preliminary analysis (4), we expected that many intergenic regions would be too highly conserved among the sensu stricto species to provide adequate definition of functional sequences, and more divergent sequences were expected to sharpen the definition of conserved sequence motifs. An additional reason for obtaining genome sequences of the more distantly related species is that, unlike the case with the closely related species, functional domains of their proteins are often apparent in multiple sequence alignments.

The sensu stricto species are so closely related that their genome organization is almost identical; only a few chromosomal rearrangements have occurred in these species, making their chromosomes almost completely syntenic with their S. cervisiae counter-parts (8, 9). In contrast, many chromosomal rearrangements have occurred in the genomes of the two species that are more distantly related to S. cerevisiae, resulting in relatively short stretches of chromosomes that are syntenic with the S. cerevisiae genome (8).

Genome sequencing. When a complete, highly accurate genome sequence is available, as it is for S. cerevisiae, a relatively small amount of sequence data of related genomes is sufficient for substantial comparative analysis. We therefore sought only twofold to threefold genome coverage in random (“shotgun”) sequence reads (10) of the five genomes, which is expected to yield 85 to 95% coverage of the sequence of each genome (11). We then used a semiautomated method for closing the gaps between assembled contigs (12). Because we were primarily interested in identifying functional elements in intergenic sequences, we limited our finishing efforts to gaps between contigs that fall in these regions of the genome. These strategies made this a relatively economical project. Details of sequence assembly and finishing are presented in table S2.

Improving genome annotation. DNA sequence comparisons can be used to refine genome annotation. Even for a small, well-studied, and extensively annotated genome like that of S. cerevisiae, in which genes are relatively easy to recognize, our analysis affected annotation of more than 10% of the genes. We were able to predict 43 previously unannotated genes [all of them small open reading frames (ORFs) fewer than 100 codons in length, including 7 with an intron], based on their degree of sequence conservation (13). We also predict that 515 annotated genes are false, based on the nonsense codons and frameshift mutations present in their orthologs in the other species (14). We noticed several likely oversights in the current annotation of the S. cerevisiae genome (15). For example, the intron branch sequence tACTAAC appeared frequently as a sequence motif conserved in several orthologous intergenic regions, signaling the presence of potential introns. This allowed us to recognize 40 likely introns that had not been annotated, based on consensus 5′ splice donor and 3′ splice acceptor sequences flanking the conserved intron branch sequence. Our comparative genome analysis leads to a more accurate gene count for S. cerevisiae of 5773 (16), rather than the 6331 genes currently annotated in the Saccharomyces Genome Database (SGD) (17).

Alignment of intergenic sequences. Our primary goal was to identify functional non–protein-coding sequences. We therefore compared entire intergenic regions, which are relatively short in Saccharomyces [average ∼500 base pairs (bp)]. Intergenic sequences of the sensu stricto species are similar enough that most orthologous sequences can be accurately identified and aligned with CLUSTALW (18), but the two distantly related species are so diverged from S. cerevisiae that their intergenic sequences almost never aligned to the sequence of their S. cerevisiae ortholog. The orthologous intergenic sequences of the distantly related species could only be identified by aligning the sequences of their associated predicted proteins [using BLASTX (19)]. Over half of the intergenic regions are available from all four sensu stricto species (20); 40% are available from all six species (21). The genome sequences are available in the public databases (22); the CLUSTALW alignments of the intergenic sequences can be obtained at www.genetics.wustl.edu/saccharomycesgenomes/.

The four-way CLUSTALW alignments of orthologous intergenic sequences of the closely related sensu stricto species had an average sequence identity of 37.1%. This is significantly more identity than expected for nonselected (neutral) sequences at these phylogenetic distances (∼16%, see table S1). The distribution of sequence identity within intergenic regions was not uniform: A peak of conservation spanned approximately 125 to 250 bp upstream of the translational start codon (23), suggesting that this region is enriched in regulatory sequence elements (Fig. 1A) (24). This is consistent with the view that most regulatory sequences in yeast promoters lie relatively close to the genes they regulate (25). It also suggests that a substantial number of the conserved sequence elements are conserved because they are functional, rather than because of their shared ancestry (the latter case would be expected to result in their more uniform distribution in intergenic regions). This is in contrast to the relative uniformity of sequence identity in intergenic regions downstream of genes (i.e., in terminators), which harbor no promoter elements (Fig. 1B). Genes that have in their promoter a conserved TATA box (a key promoter sequence element that is the binding site for TATA box–binding protein, the protein around which RNA polymerase II and its many associated proteins assemble) have a broader and higher peak of conservation (Fig. 1A), suggesting that their promoters contain more regulatory sequences than the average promoter, or that they are more slowly evolving than the average promoter. The lower average sequence identity that is 75 to 100 bp upstream of the ATG codon suggests that there may be a spatial restriction on regulatory sequences that prevents them from acting close to the transcriptional start site.

Fig. 1.

Profiles of the average sequence conservation over the length of intergenic regions for different classes of promoters. The average sequence identity in intergenic regions is 37.1% with a range of less than 10% to almost 95% identity, and a median of 36.8%. The percent identity of CLUSTALW alignments of orthologous intergenic sequences of four sensu stricto species was averaged over the length of the intergenic region in 25-bp windows. (A) Average sequence identity profile of all 3523 four-way alignments (black), and average sequence identity of intergenic regions containing aligned TATA box sequences (gray). (B) Profile for intergenic regions between 595 convergently transcribed genes (that is, intergenic regions consisting entirely of sequences downstream of genes).

Several of the intergenic regions have highly conserved sequences immediately upstream of the translational start site that are longer and more conserved than the 6- to 10-nucleotide (nt) length expected for transcriptional regulatory elements (26). These sequences are not misannotated coding regions, because they are not encompassed in ORFs. The location of these conserved sequences makes them good candidates for translational regulatory elements. [Indeed, two of them are upstream of genes known to be translationally regulated (27, 28).] Ribosomal protein genes showed the highest degree of sequence identity within 30 bp of the translational start codon (fig. S1) (29).

Identification of conserved sequence motifs. We searched for conserved sequence motifs in two overlapping sets of orthologous intergenic sequences: 3523 four-way alignments of intergenic regions of the closely related sensu stricto species, and 3084 six-way sequence comparisons that included sequences of both of the more distantly related species. The advantage of analyzing only the sequences of the closely related sensu stricto species is that they can be aligned in multiple sequence alignments, which provide a powerful visual tool for identifying evolutionarily conserved sequences. The value of the multiple sequence alignments is proportional to the degree that the regulatory sequence architecture is maintained among the different species, and this is high among the sensu stricto species. In addition, because functional sequences are expected to be in the same position and orientation in the closely related species, unaligned sequence motifs, which could easily occur by chance, are not considered, thereby restricting the amount of sequence to be searched.

The more distantly related species enhance the analysis in primarily two ways (30). First, because many intergenic regions have a high degree of sequence identity over the length of the CLUSTALW sequence alignments of the four sensu stricto species (31), the more distantly related species provide the sequence divergence necessary for conserved sequence motif identification in these intergenic regions. Second, the length of individual conserved sequence motifs in the more distantly related species correlates better with the length of transcription factor binding sites than it does in the sensu stricto species' sequences, where the conserved sequence motifs are often longer than known binding sites. The average length of sequence motifs conserved in the sensu stricto species' sequence alignments is 10.7 nt, with 43% of them being 10 nt or longer; the average length of conserved sequence motifs identified in the six-way sequence comparison is only 7.3 nt, and only 2% of them are 10 nt or longer. Similarly, the 7.5 nt average length of known conserved protein-binding sites identified in the six-way sequence comparisons correlates better to the length of the binding site than does the 11.7 nt average length found for the same protein-binding sites in the sensu stricto species' sequence alignments (32).

A number of algorithms have been developed to identify similar sequence motifs within sets of DNA sequences (3336). Because these algorithms were designed for searching unrelated sequences, they tend to identify a large number of conserved sequence motifs in our related sequences, many of which are likely conserved as a result of their shared ancestry. Therefore, we used more stringent criteria that require the sequence motifs to be precisely conserved in all sequences being compared. A limitation of this approach is that some functional sequences will be missed because they need not be exactly conserved through evolution, but searching among a large number of intergenic regions increases the chance that a functional sequence motif will be found precisely conserved. We focused on ungapped motifs because most of the characterized sequence motifs (62 of 71) are ungapped, and many gapped motifs are likely artifacts as a result of simple sequences associated with ungapped motifs (such as simple A+T-rich sequences associated with an ungapped sequence motif), which makes it statistically difficult to distinguish the likely real gapped motifs. We expected that this relatively straightforward approach for identification of conserved sequence motifs would be successful because of our judicious choice of the number and phylogenetic distances of sequences to be compared. Indeed, we identified 53 of 62 characterized ungapped transcriptional regulatory motifs that are precisely conserved upstream of at least five genes [we also identified many conserved instances of all nine known gapped sequence motifs (37)].

We identified 8873 conserved 6- to 30-oligomers (mers) in the four-way CLUSTALW alignments of orthologous intergenic sequences of the sensu stricto species. This is significantly more (>150 SDs) than the 1090 that were found in the same sequence alignments when their columns had been randomly shuffled (38). Thus, we are about 88% confident that the 6- to 30-mers do not occur by chance. Our confidence in each n-mer increases with its size: The 98% confidence level is reached with 10-mers, because the number of 10-mers in the shuffled alignments is less than 2% of those in the real alignments (fig. S2). Because the shuffled sequence alignments maintain the same degree of conservation, these results suggest that most of the n-mers result from functional selection rather than common ancestry. The six-way sequence comparison that includes the more distantly related sequences yielded 7915 conserved sequence motifs (39), with the 98% confidence level reached with 8-mers [see table S3 (40)]. The most statistically significant n-mers seem to be biologically significant because many of them are known binding sites for characterized DNA binding proteins (41).

About one-third (2771) of the 6- to 30-mers identified from multiple sequence alignments of sensu stricto species' intergenic regions and about 20% (1535) of the conserved n-mers identified in the six-way sequence comparisons contain at least 1 of 71 known sequence motifs (fig. S3). A few known sequence motifs accounted for the majority of the matching n-mers: 13 characterized sequence motifs accounted for about three-fourths of the known n-mers identified in the alignments of the closely related sequences; 10 characterized sequence motifs accounted for about three-fourths of the known n-mers found in the six-way sequence comparisons. Thus, a few sequence motifs seem to be regulating a large number of genes. The TATA box, which is by far the most frequent of the known conserved sequence motifs (fig. S3), is conserved in surprisingly few intergenic regions (42).

Predicting novel functional sequence motifs. In an attempt to identify novel functional sequences among the 6- to 30-mers, we first determined if any tend to reside upstream of genes that are functionally related (43). Considering unknown 6 to 30-mers that occur upstream of several genes, 18 n-mers identified from the alignments of the closely related sequences, and 18 n-mers identified in the six-way sequence comparisons are upstream of genes significantly enriched for those with similar functional annotations (44) (Table 1), and are thus good candidates for functional sequence motifs.

Table 1.

Conserved sequence motifs upstream of genes with similar function.

Motif FunctionView inlineP valueView inline
From sensu stricto alignments (18)
CTAAACGAView inline Lipid, fatty acid, and isoprenoid biosynthesis <1.0 × 10-6
TTGGAG Lipid, fatty acid, and isoprenoid utilization <1.0 × 10-6
ACTCTTTTView inline Amino acid metabolism 1 × 10-5
TGGCGC Amino acid biosynthesis <1.0 × 10-6
GAAAAAGView inline Amino acid biosynthesis 7 × 10-5
AAAGAAAView inline Amino acid transport 1 × 10-5
TGTGGCGView inline Peptide transporters <1.0 × 10-6
GTACGGATView inline Ribosome biogenesis <1.0 × 10-6
TCTAGAView inline Metabolism of cyclic and unusual nucleotides <1.0 × 10-6
AAGCCACA Nitrogen and sulfur utilization <1.0 × 10-6
ATAGAAA Fermentation 1 × 10-5
AGATCTView inline Phosphate transport <1.0 × 10-6
AACGCCGView inline Peroxisome 1 × 10-5
TGTTTAT Cytoskeleton-dependent transport <1.0 × 10-6
CGCGCG Vacuolar degradation <1.0 × 10-6
GTGCAC Homeostasis of metal ions (Na, K, Ca, etc.) <1.0 × 10-6
TTTTTCCTView inline Chromosome function 8 × 10-5
ACGCCAAA Centrosome <1.0 × 10-6
From six-way alignments (18)
GGAAAAAView inline C-compound and carbohydrate utilization <1.0 × 10-6
TCGTTTAView inline Lipid, fatty acid, and isoprenoid metabolism <1.0 × 10-6
GGAATT Amino acid metabolism <1.0 × 10-6
AGAAATView inline Ribosome biogenesis <1.0 × 10-6
CATACAView inline Ribosome biogenesis <1.0 × 10-6
TACGTA Translation <1.0 × 10-6
TTCAAG Pre-mRNA splicing <1.0 × 10-6
ATATGT Ion transporters (Na, K, Ca, NH4, etc.) <1.0 × 10-6
TATTGT Chromosome function <1.0 × 10-6
AACAAC Chromosome function <1.0 × 10-6
AAGGAA Chromosome function <1.0 × 10-6
GAAACA Pheromone response, mating-type determination <1.0 × 10-6
AAAACG Directional cell growth (morphogenesis) <1.0 × 10-6
AAGGGAAView inline Cell polarity and filament <1.0 × 10-6
TTGCAA Peroxisome <1.0 × 10-6
TGTGGC Nitrogen and sulfur utilization <1.0 × 10-6
AGAGAGView inline Nitrogen and sulfur utilization <1.0 × 10-6
TTTCTTTView inline Plasma membrane <1.0 × 10-6
  • View inline* Functional categories from MIPS (View popup).

  • View inline The probability that this set of genes is enriched in this functional classification by chance (52).

  • View inline In these cases, several overlapping sequence motifs were found as enriched in the same functional category.

  • Another way to predict which of the unknown conserved sequence motifs are functional is to identify those that reside upstream of genes that exhibit a similar pattern of expression. This seems to be a valid way to evaluate the functionality of conserved sequence motifs because the expression profiles of the 736 genes in the S. cerevisiae genome whose promoters contain an MCB box [a sequence motif that contributes to regulation of gene expression in the G1 phase of the cell cycle (12, 45, 46)] were essentially random through the cell cycle (Fig. 2A), but the subset of genes whose MCB box is conserved and aligned in the orthologous sequences of all four sensu stricto species exhibits coherent expression through the cell cycle (Fig. 2B). Even genes in which the MCB box is present but not aligned in the orthologous intergenic sequences of all four species exhibited coherent expression (Fig. 2C), although it is not as obvious as it is in genes that have these motifs aligned. Thirty-nine of the unknown conserved sequence motifs that we identified in sensu stricto species sequence alignments, and 13 of the unknown conserved sequence motifs that we identified in the six-way sequence comparisons that occur in multiple intergenic regions, reside upstream of a set of genes that are significantly enriched for similar gene expression patterns (Table 2) (47).

    Fig. 2.

    Expression profiles of genes containing particular sequence motifs in their promoters, normalized for mean and variance. Expression coherence (EC) values and P values were calculated as described in (61). (A) Cell cycle expression profiles (53) of all genes in the S. cerevisiae genome that contain an exact match to the MCB box in their promoters. (B) Cell cycle expression profiles of S. cerevisiae genes containing an MCB box that is aligned in the CLUSTALW alignment of orthologous sensu stricto promoters. (C) Cell cycle expression profiles of S. cerevisiae genes containing an MCB box that is present but not aligned in each of the orthologous sensu stricto promoters.

    Table 2.

    Unknown sequence motifs upstream of genes with coherent expression.

    Sequence motif Condition(s)View inline ECView inlineView inline
    From sensu stricto alignments (View inline)
    TCCCTT MMS 0.393
    TTCCAGAA DNA damage 0.476
    CAACTTT DNA damage 0.333
    ACGGAT DNA damage 0.489
    GATTGA DNA damage 0.333
    CAGAAC DNA damage 0.333
    TAATAG DNA damage 0.393
    Stress 0.357
    TTTCAGAView inline DNA damage 0.352
    Stress 0.619
    TGTACGGView inline DNA damage 0.6
    Stress 0.4
    GCGATGCView inline DNA damage 0.303
    Stress 0.53
    CAAGGGView inline DNA damage 0.333
    Stress 0.321
    ACTGAAView inline DNA damage 0.393
    Stress 0.321
    CGATGCCCView inline DNA damage 0.81
    Mitochondria 0.429
    TGTTCTView inline Mitochondria 0.306
    Stress 0.429
    CAAACAAA Stress 0.333
    ACTCTTTTView inline Stress 0.321
    AACTTTTC Stress 0.333
    ATGCGATG Stress 0.41
    AAACAAGAView inline Stress 0.381
    TAATCT Stress 0.467
    AAAGTA Stress 0.4
    TCCGTA Stress 0.429
    TTTCTAGA Stress 0.333
    ACATTC Stress 0.381
    ATACCT Stress 0.333
    CCCTTAAA Cell cycle 0.357
    AACGCCAAView inline Cell cycle 0.533
    TTGCCACTView inline Cell cycle 0.4
    TAAACAAT Cell cycle 0.333
    AAAGAT Cell cycle 0.321
    TATTAG Cell cycle 0.4
    CACCAC Cell cycle 0.333
    TTTTTTGTView inline Meiosis 0.451
    ACAAAAACView inline Meiosis 0.371
    GTTGTTTTView inline Meiosis 0.333
    ATCAAA Meiosis 0.357
    CGACAC Meiosis 0.4
    TATGTATAView inline Pheromone 0.321
    GCTACC Pheromone 0.333
    From six-way alignments (13)
    AATGTAView inline DNA damage 0.393
    AAAAGTAView inline DNA damage 0.333
    Stress 0.333
    ACATAC DNA damage 0.357
    Stress 0.6
    ATACAT DNA damage 0.364
    Mitochondria 0.361
    Stress 0.327
    TTTTCATView inline Stress 0.35
    AGTGAAView inline Stress 0.381
    CTGAAAAView inline Stress 0.867
    TCAAAATView inline Stress 0.333
    TTCAAG Stress 0.372
    TAGAAA Cell cycle 0.4
    TTCTTTCView inline Cell cycle 0.361
    ACAAAAView inline Meiosis 0.307
    CCCTTTTView inline Meiosis 0.333
  • View inline* The following gene expression profiling data sets were used: cell cycle (53), meiosis (54), MMS damage (55), sporulation (56), stress response (57), DNA damage (58), MAPK (59), mitochondrial dysfunction (60).

  • View inline Gene expression coherence (0 = no similar expression of the genes, 1 = maximal similarity of expression of the set of genes) was calculated as previously described (54).

  • View inline The probability of each set of genes having the EC score by chance is less than 10-6 (61).

  • View inline§ In these cases several overlapping sequence motifs were found as enriched in the same functional category.

  • Potentially functional conserved sequence motifs can also be predicted by identifying those that tend to reside in the intergenic regions to which a particular transcription factor binds. The intergenic regions of the S. cerevisiae genome to which 106 known or predicted DNA binding proteins bind have been identified by genomewide chromatin immunoprecipitation (ChIP) experiments (48). We determined the significance of the overlap between 23 test sets of intergenic regions that contain a conserved occurrence of a known transcription factor binding site with each of the 106 sets of intergenic regions bound by a transcription factor (49). Twenty-one of these 23 sets of intergenic regions overlapped significantly (P < 10 to 5) with the intergenic regions bound by the transcription factor known to bind the site, suggesting that this is a valid approach for predicting functional sequence motifs. Considering unknown conserved sequence motifs that are present in multiple intergenic regions, nine that we identified in sensu stricto species sequence alignments and four that we identified in the six-way sequence comparisons significantly overlapped with one of the 106 sets of genes bound by a transcription factor (Table 3). These 13 conserved sequence motifs are candidates for sequences that either bind one of the 106 transcription factors, or are bound by an unknown transcription factor that interacts with one of the 106 known or predicted transcription factors.

    Table 3.

    Unknown sequence motifs correlated with ChIP experiments.

    Sequence motif Associated transcription factor P valueView inline
    From sensu stricto alignments (9)
    TGTACGGView inline Fhl1 2.50 × 10-10
    TGTATGGView inline Fhl1 1.97 × 10-7
    GTTCTTGView inline Fhl1 6.66 × 10-7
    TAATCT Fhl1 4.21 × 10-7
    TAGCCA Fhl1 1.97 × 10-7
    TTCTAGA Hsf1 1.92 × 10-7
    GCCAAG Smp1 4.49 × 10-7
    GGACCC Smp1 1.10 × 10-9
    ATTATCA Smp1 2.99 × 10-7
    From six-way alignments (4)
    TTGAAAView inline Fhl1 1.15 × 10-10
    ACATACView inline Fhl1 1.70 × 10-7
    GTTTAT Hir1 3.18 × 10-6
    Hir2 1.74 × 10-7
    TCTTTC Sfp1 3.47 × 10-6
  • View inline* The probability that the set of intergenic regions with a conserved sequence motif overlaps one of the 106 sets of intergenic regions bound by a transcription factor by chance was calculated by the hypergeometric probability distribution, as described in (54).

  • View inline In these cases, several overlapping sequence motifs were found as enriched in the same functional category.

  • In summary, we identified 59 conserved sequence motifs in the four-way sequence alignments and 32 in the six-way sequence comparisons for which there is some evidence of functionality (Tables 1, 2, 3). Twelve of these were identified in both the four-way and six-way sequence comparisons, leaving 79 unique conserved sequence motifs that are good candidates for functional regulatory sequences (50).

    Conclusions. We have shown that phylogenetic footprinting on a genome-wide scale identifies many statistically significant conserved sequence motifs. Because we compared multiple genome sequences that are as optimally diverged as possible, we were able to predict functional sequence motifs by relatively straightforward methods using fairly stringent criteria for sequence motif definition (i.e., searching for n-mers). The fact that most known regulatory sequences turn up as conserved n-mers in our analysis validates this approach for identifying functional sequences and bolsters our confidence that many of the novel sequence motifs we identified are likely to be functional. The novel sequence motifs of which we are most confident are the 79 that lie upstream of sets of genes that tend to have similar functional annotations or similar expression or are bound by the same transcription factors. Of course, these are only predictions of functional sequences; experimental results will be necessary to validate them. The large number of conserved sequence motif predictions provided by comparative DNA sequence analysis should catalyze development and application of the high-throughput experimental methods necessary for testing their function.

    Supporting Online Material

    www.sciencemag.org/cgi/content/full/1084337/DC1

    Figs. S1 to S3

    Tables S1 to S4

    References and Notes

    View Abstract

    Navigate This Article