A Computational Screen for Methylation Guide snoRNAs in Yeast

See allHide authors and affiliations

Science  19 Feb 1999:
Vol. 283, Issue 5405, pp. 1168-1171
DOI: 10.1126/science.283.5405.1168


Small nucleolar RNAs (snoRNAs) are required for ribose 2′-O-methylation of eukaryotic ribosomal RNA. Many of the genes for this snoRNA family have remained unidentified in Saccharomyces cerevisiae, despite the availability of a complete genome sequence. Probabilistic modeling methods akin to those used in speech recognition and computational linguistics were used to computationally screen the yeast genome and identify 22 methylation guide snoRNAs, snR50 to snR71. Gene disruptions and other experimental characterization confirmed their methylation guide function. In total, 51 of the 55 ribose methylated sites in yeast ribosomal RNA were assigned to 41 different guide snoRNAs.

The genome of the yeastSaccharomyces cerevisiae has been completely sequenced and is thought to contain about 6000 protein-coding genes (1). However, some of the largest eukaryotic gene families produce functional RNAs rather than protein products. For example, yeast contains about 140 tandemly repeated copies of ribosomal RNA (rRNA) genes (1) and 275 dispersed transfer RNA genes (2).

The small nucleolar RNAs (snoRNAs) (3) are involved at various stages of eukaryotic ribosome biogenesis within the nucleolus (4). Ribosomal RNA undergoes cleavages and dozens of nucleotide modifications before assembly with ribosomal proteins into the mature ribosome (4). The two major families of snoRNAs appear to be involved in the two most common types of rRNA modification: Box H/ACA snoRNAs are required for specific rRNA pseudouridylations (5), and most of the C/D box snoRNAs appear to be involved in rRNA ribose methylation (6,7). A small number of snoRNAs in each family are involved in other steps of rRNA processing (3).

Conserved among all C/D box snoRNAs, the C and D box sequence motifs are required for snoRNA nucleolar localization, accumulation, and association with the ribonucleoprotein particle complexes that carry out rRNA processing (8). C/D box snoRNAs involved in ribose methylation also contain an internal “guide” sequence that is able to base pair with a specific segment of rRNA (Fig. 1). In association with protein cofactors, a guide snoRNA specifies the precise location for a particular 2′-O-ribose modification through its guide sequence (6, 9).

Figure 1

C/D box methylation guide snoRNA consensus. The position of 2′-O-methylation of rRNA is within the helix formed by the complementary guide sequence of the snoRNA and precisely 5 nucleotides (nt) upstream of box D′ (or D, in snoRNAs that have their guide region adjacent to D instead of an internal D′ box). The C′ box feature was recognized (29) after the original snoRNA model was conceived; bp, base pairs.

Although ribose 2′-O-methyls are numerous in all studied eukaryotic rRNAs (10) and have been known for decades (11), the precise function of these modifications remains unknown. The total number of rRNA 2′-O-methylations inSaccharomyces carlsbergensis, a close relative of S. cerevisiae, has been estimated at 55 (12), 42 of which have been placed to specific nucleotide positions in the rRNA (10, 13, 14). In S. cerevisiae, 19 C/D box snoRNAs have been predicted to be responsible for methylation at 20 sites (3, 6,7, 15), a little more than one-third of the total rRNA ribose methyl groups. Experimental evidence supporting these predictions is available only for U24 (6). If the hypothesis is correct that snoRNAs guide most or all ribose 2′-O-methylations in eukaryotes, the majority of this snoRNA gene family remains unidentified in S. cerevisiae.

Because the S. cerevisiae genome is completely sequenced (1), it is reasonable to consider identifying methylation guide snoRNAs computationally. However, sequence similarity of snoRNAs across phyla and within the gene family is generally weak, so commonly used computational methods such as BLAST and FASTA fail to identify new genes by similarity to known snoRNAs. Attempts have been made to identify snoRNAs by pattern searches based on the rRNA complementary guide sequence and other conserved features, but feature consensus is poor, so this approach has had only modest success (7,16).

Formal probabilistic models, based in part on methods used in speech recognition and computational linguistics, have been introduced for searching for complicated consensus features in biological sequences (17). Hidden Markov models (HMM) (18) are probably the best known of these approaches. Another class of model called stochastic context-free grammars (SCFGs) has been used to construct probabilistic profiles of RNA consensus, allowing sensitive searching for RNA secondary structure (19). Using these probabilistic modeling techniques, we can produce an integrated model of snoRNAs that is based on the sequence features specific to this RNA gene family (Fig. 2).

Figure 2

Schematic of the probabilistic model. States (boxes and ovals) are connected by transitions (arrows). Each numbered state is a probabilistic model of a sequence feature (Table 1). Transition probabilities are 1.0, except those shown for transitions 2 → 3 and 2 → 8, which account for the proportion of snoRNAs with a guide sequence adjacent to box D′ and those with a guide sequence adjacent to box D, respectively.

To rapidly scan for 2′-O-methylation guide snoRNA candidates in the genome sequence, we used a greedy search algorithm. The program sequentially identified six components characteristic of these genes: box D, box C, a region of sequence complementary to rRNA, box D′ if the rRNA complementary region is not directly adjacent to box D, the predicted methylation site within the rRNA based on the complementary region, and the terminal stem base pairings, if present. The program also takes into account the relative distance between identified features within the snoRNA, information we found critical to reducing the false positive identification rate.

Each candidate snoRNA alignment was scored against our probabilistic model (Table 1). SnoRNAs were ranked on the basis of a final logarithmic odds score (20) that incorporated information from each of the snoRNA features. The initial model was trained on 35 human C/D box snoRNAs proposed to function as methylation guides (6). Nine previously isolated yeast snoRNAs matched this snoRNA gene model with significant scores (25.91 to 43.55 bits). In a search of randomly generated sequences (21) equivalent in size to four complete yeast genomes, the maximum score for a false positive (29.65 bits) exceeded the score for only one of the nine known snoRNAs. Thus, we believed we had sufficient training data to search for unidentified snoRNAs in the yeast genome.

Table 1

Summary of states within the snoRNA probabilistic model. State numbers correspond to Fig. 2. “Ungapped HMM” states represent fixed-length conserved sequence motifs. The state for the terminal stem is analogous but models base pairs rather than single positions [for example, an SCFG (17), instead of an HMM]. Duration models for gaps are estimated from binned length distributions (for example, the probability that a gap will be 11 to 20 nt, 21 to 30 nt, and so forth). The guide state is an HMM dependent on the rRNA target sequence; it includes terms for the probability of starting the complementarity at a given position relative to rRNA (this probability is high near known methylation positions), the length of the complementarity, and the probability of mismatches and noncanonical base pairs in the complementarity. For each state, the most common feature (“consensus”) is shown to indicate the overall pattern we search for. The best, average, and worst feature scores are given for 41 methylation guide snoRNAs as an indication of the relative contribution of each state to the overall information in the model. For more details, see the program source code (28).

View this table:

We began our search for previously unidentified snoRNAs by identifying family members that target 42 known 2′-O-methyl sites inS. cerevisiae, inferred from mapping data from S. carlsbergensis (10). Candidate snoRNAs were divided on the basis of target methyl site and sorted by score, producing 42 lists of best-to-worst snoRNA predictions, one for each methyl site. Depending on search parameter cutoffs and the specific target methyl site, the program found up to several dozen predictions for each methyl site. Candidates overlapping predicted protein-coding regions were noted and disfavored relative to other strong, nonoverlapping candidates. Seven previously published snoRNAs have been predicted to guide methylation at eight of the 42 sites (U14, U18, U24 at two sites, snR39, snR39b, snR40, and snR41) (6, 7). Our searches did not show improved snoRNA predictions over the previously identified snoRNAs, so we did not pursue different assignments for these eight sites.

We tested the top scoring snoRNA gene predictions corresponding to the remaining sites by gene disruption (22). Each snoRNA-disrupted strain was tested for the ability to methylate at the predicted rRNA site by a deoxynucleoside triphosphate (dNTP) concentration-dependent primer extension assay (Fig. 3) (23). Out of 30 gene disruptions, 24 loci were verified as encoding methylation guide snoRNAs. Seven of these had been previously identified as C/D box snoRNAs, and 17 snoRNAs had not been previously identified (Table 2). Primer extension assays for two of the snoRNA disruption mutants, snR55 and snR70, showed a noticeable but minor change in the primer extension pattern at the expected sites (24); thus, we qualify these assignments as “inconclusive.” None of these snoRNA gene disruptions were lethal, nor did we observe impaired growth on rich media.

Figure 3

Experimental confirmation of methylation guide function for yeast snoRNAs. 2′-O-methylation is detected as a reverse transcriptase primer extension stop on rRNA that occurs one nucleotide 3′ of the modified base in low but not high dNTP concentrations [for example, for wild-type (wt) rRNA, compare lane 5, high dNTP, to lane 6, low dNTP, for five methylated LSU rRNA positions indicated at the right]. Homologous disruption of a guide snoRNA (here, snR60, snR50, the snR72-snR78 array, and snR40 lanes) causes precise loss of corresponding methylation-dependent stops on the rRNA; for example, observe the loss of the Um896 band in the snR40 knockout in lane 14.

Table 2

C/D box snoRNAs in S. cerevisiae that function as methylation guides. Previously unidentified snoRNAs or methylation sites are in boldface. Previously identified snoRNAs that have now been determined to be methylation guides are in italics. “Match/Mismatch” column refers to the number of base pairings (G-U included) and mismatches found within the snoRNA complementary region–rRNA duplex. “Len” refers to the known or predicted mature snoRNA length in nucleotides. “Position” and “Chr” refer to the 5′ end and chromosomal genomic locus according to the current version of the yeast genome available at the S. cerevisiae database (30). Strand designations: W, Watson, upper/forward strand; C, Crick, lower/complement strand. Ribosomal RNA positions are numbered as in (10). The last column gives the GenBank accession numbers.

View this table:

Some of the 24 snoRNAs described above guide modification at more than one methyl site, as previously seen for U24 (6). The search program predicted and we experimentally verified one additional methylation target site for snR47, snR48, and snR51 (Table 2). We also found an additional target site for snR41 (Table 2), a snoRNA previously predicted to guide at a different methyl site (6). We verified snR41 methylation guide function for both the previously predicted site [small subunit (SSU) Gm1123] and the newly predicted site (SSU-Am541). With these additional site assignments, plus the eight previously assigned sites (6, 7), 36 of 42 known methyl sites can be attributed to guide snoRNA.

Our searches gave no strong snoRNA candidates for four of the remaining six ribose methyl sites, all on large subunit (LSU) rRNA: Cm648, Gm1448, Am2279, and Gm2919. One common factor among these sites is that they are all one nucleotide 3′ to other ribose methyl sites that have confirmed or strong snoRNA assignments. Previous disruption of U24 results in unexpected loss of the methyl at Gm1448, adjacent to the predicted loss at Am1447 (6). Disruption of U18 results in loss at both Am647 and Cm648 (25). Our disruption of snR13 resulted in loss at Am2278 and Am2279. snR52 is strongly predicted, although not confirmed, to guide methylation at Um2918, one nucleotide adjacent to Gm2919. In each of these cases, we hypothesize that a change in the snoRNA-rRNA base pairing could allow a single snoRNA to guide modification at the observed 3′ adjacent ribose 2′-O-methyl sites. A previous alternative proposal suggests that an independent methyltransferase catalyzes addition of the 3′ adjacent methyl groups and that the reaction is dependent on the existence of the snoRNA-guided 5′ methyl sites (6).

We then turned to the 13 ribose methyl sites whose exact positions in the rRNA were not known. We used three lines of evidence to predict and then experimentally verify the position as well as the snoRNA assignment for 12 of the 13 unmapped sites. First, between one and five nucleotides of sequence context surrounding each of these methyl sites are known from ribonuclease T1 fingerprints (12,14). Second, we checked the existing collection of C/D box snoRNAs for previously unrecognized rRNA complementary regions that could target sites not included in the list of known ribose 2′-O-methyls. Third, we went back to the S. cerevisiae genome search results from our program and extracted all high-scoring snoRNAs that could target previously unidentified rRNA methyl sites.

For each newly predicted methyl site, we experimentally checked for a rRNA primer extension pause typical of ribose 2′-O-methyls (23). For supported methyl sites, we then disrupted the corresponding guide snoRNA (all except snR38) to confirm anticipated loss of the methylation (Table 2; snoRNAs assigned to methyl sites in boldface). Six sites were assigned to known C/D box snoRNAs (Table 2; snR40, for example) and six to newly identified snoRNAs (Table 2; snR58, for example). Each of these 12 ribose methyl sites could be correlated with a T1-RNase digest fragment for one of the 13 unmapped ribose methyls (14). We could not identify the location of the single unmapped methyl site in SSU rRNA (T1fragment GmU). snR190 has also been predicted to target a potential methyl site at LSU-Gm2393 (6). In our primer extension assay, this site did not give a visible band, nor did its sequence context correspond to an unassigned T1 fragment. None of the verified guide snoRNAs were found to be essential, nor did gene disruption cause noticeably impaired growth.

Thus, in summary, we can attribute snoRNA-directed modification to 51 of the 55 ribose 2′-O-methyls in yeast rRNA (Table 2). This leaves four sites for which we could not assign a prediction (SSU-Am436), locate the methyl site (SSU-Gm?), or experimentally verify a prediction (LSU-Um2918 and LSU-Gm2919). Protein methyltransferases targeting these specific sites may account for our difficulty in finding or verifying guide snoRNAs in these cases.

From the perspective of the RNA gene family, we count 41 total guide snoRNAs assigned to 51 rRNA methylation sites (Table 2), 22 snoRNAs of which we identified in this work. We estimate that up to two methylation guide snoRNAs remain to be identified for the two unassigned methylation sites (SSU-Am436 and SSU-Gm?) and that two to four snoRNAs may be identified as being redundant with known snoRNAs for SSU-Um1265, SSU-Cm1637, LSU-Um2918, and LSU-Gm2919.

With nearly all 2′-O-methylation guide snoRNAs identified, we can assess the general genomic organization of the gene family (Table 2). Most are dispersed as independent singlets or within five small clusters of two to seven tandemly arrayed guide snoRNAs. A total of 19 singlets occur outside of known protein-coding genes, presumably as independent transcription units. All tandemly arrayed snoRNAs within the same cluster are oriented on the same strand, and recent results indicate that these genes are polycistronic (26). Six yeast snoRNAs occur within the introns of host protein genes, all on the pre-mRNA coding strand. The mixture of snoRNAs in yeast occurring within introns and tandem arrays and as singlets is in contrast to vertebrates, where all currently known guide snoRNAs are within host gene introns. Polycistronic arrays of snoRNAs have also been reported in plants (27). Some plant polycistrons contain a mix of snoRNAs from both major families of guide snoRNAs (C/D box and H/ACA box snoRNAs), whereas none of the yeast tandem arrays contain members outside of the C/D box family.

It is possible that a large number of noncoding RNAs remain to be discovered. Both computational screens and experimental screens tend to be biased against RNAs. Many functional RNAs are not polyadenylated, so they are not well represented in oligo(dT) primed cDNA libraries or in expressed sequence tag sequencing projects. Often the genes for RNAs are small and may occur in multiple copies. RNAs are of course not affected by stop codons or frameshifts, so they are probably somewhat refractory to genetic screens. Most functional RNAs known today have been identified by biochemical means, but these approaches are best suited to abundant RNAs. Using probabilistic modeling methods, we are beginning to gather the tools necessary to computationally screen genome sequences for noncoding RNAs.


View Abstract

Navigate This Article