RNA Maps Reveal New RNA Classes and a Possible Function for Pervasive Transcription

See allHide authors and affiliations

Science  08 Jun 2007:
Vol. 316, Issue 5830, pp. 1484-1488
DOI: 10.1126/science.1138341


Significant fractions of eukaryotic genomes give rise to RNA, much of which is unannotated and has reduced protein-coding potential. The genomic origins and the associations of human nuclear and cytosolic polyadenylated RNAs longer than 200 nucleotides (nt) and whole-cell RNAs less than 200 nt were investigated in this genome-wide study. Subcellular addresses for nucleotides present in detected RNAs were assigned, and their potential processing into short RNAs was investigated. Taken together, these observations suggest a novel role for some unannotated RNAs as primary transcripts for the production of short RNAs. Three potentially functional classes of RNAs have been identified, two of which are syntenically conserved and correlate with the expression state of protein-coding genes. These data support a highly interleaved organization of the human transcriptome.

A large fraction of the noncoding part of a eukaryotic genome is used to make RNA that is sufficiently stable in a cell to be detected by different technological approaches (14). The biological significance of this pervasive transcription is unclear and controversial. One possibility is that only very short regions of such unannotated RNA are biologically relevant (5). In-depth characterization of RNAs as to their subcellular compartmentalization, size, modifications, and genomic origins can potentially provide clues to their functions. This study reports two general observations derived from the maps of nuclear and cytosolic polyadenylated [poly(A)+] RNAs longer than 200 nucleotides (nt) (long RNAs, lRNAs) and whole-cell RNAs less than 200 nt (short RNAs, sRNAs) over the entire nonrepetitive portion of the human genome. First, the potential biological function of an appreciable portion of long unannotated transcripts is to serve as precursors for sRNAs. Second, these maps reveal three classes of RNAs that have specific genomic localization at gene boundaries. Biological relevance of these classes of RNAs is supported by strong correlation with the expression state of genes they associate with, as well as their syntenic conservation between human and mouse.

The complexity of steady-state RNA populations was profiled by using tiling arrays at 5-nt resolution to detect transcribed regions in the human genome (6, 7). Overall, we found the extent and general properties of the lRNA portion of the human transcriptome to be similar to our earlier study (7). Patterns of annotated and unannotated transcription were similar among cell lines and within subcellular compartments (fig. S1, A to J, and table S2, A and B). About 64% of detected poly(A)+ transcription (nucleus and cytosol) did not align with annotations (fig. S1N) (7). Of the 265,237 annotated exons, 80% were expressed in at least one cell line (fig. S2).

A total of 1.1% of the interrogated genome is covered by transcribed fragments (transfrags) representing sRNAs (summarized in table S1 and S2C, figs. S3 and S4). sRNA transfrags have a nonrandom association with genomic features, including EvoFold structure predictions (figs. S5 and S7 and table S1B). In addition, a tendency for some to map antisense to splice junctions was also found. sRNAs were found in intronic, intergenic, and annotated regions (fig. S4). Unannotated sRNAs were verified by Northern blots (table S3 and fig. S6) and real-time reverse transcription polymerase chain reaction (table S3), with an overall verification rate of ∼70% (7).

Maps of sRNAs and lRNAs from different subcellular compartments can be further used to provide a virtual “genealogy” describing origins of particular classes of RNAs (Fig. 1 and table S4). A total of 12.7% of all interrogated nucleotides can be detected as composing lRNAs or sRNAs in HeLa or HepG2 cell lines. One-third of this total (33.2%) is exclusively observed in the nucleus as lRNAs and overlapping sRNAs (Fig. 1A). Another 15.1% is exclusively found as cytosolic lRNAs and overlapping sRNAs. A total of 46.3% of the sequences detected are found in both the nucleus and cytosol (Fig. 1A). Finally, 5.3% of the nucleotides were detected in sRNA transfrags exclusively.

Fig. 1.

Relations among the transcribed bases in the nonrepeat portions of the human genome. (A) Distribution of nucleotides in transfrags from long nuclear and cytosolic RNAs and sRNAs from HeLa and HepG2 (table S4). (B) Genealogy of nucleotides detected in lRNAs and sRNAs based on the association of the array-detected transcription shown in (A). Putative product-precursor associations between lRNAs and sRNAs are indicated by arrows.

The origin of 79.6% of all transcribed bases can be mapped back to the nucleus as lRNAs with 15.1% found only in cytosol (Fig. 1B). About 41.8% of the sequences seem to remain exclusively in the nucleus; the remainder is transported into the cytosol. Furthermore, 3.1% of the exclusively nuclear lRNA sequences and 6.6% of the nuclear sequences transported into the cytosol overlap sRNA sequences, which suggests that ∼40% of the latter may be processed from long nuclear transcripts (Fig. 1B and table S4).

About one-fifth of sRNAs transfrags, 20.9% (HepG2) and 18.4% (HeLa), were identified as evolutionarily conserved. The PhastCons scores (7) associated with sRNAs are significantly enriched in conserved sequences (P value 2.2 × 10–16, Wilcoxon nonparametric test) over random (fig. S8, A and B), which points to the possible biological relevance of these transcripts (Fig. 2A). There is also a statistically significant concordance (P < 0.01, permutation test) observed between the locations of sRNA and nuclear lRNA transfrags. A total of 13% (HepG2) and 9% (HeLa) nuclear lRNA transfrags overlap sRNAs. Conversely, 44% and 31% of sRNA transfrags overlap with nuclear lRNA transfrags. Such an association is potentially confounded by the elevated G-C composition of these regions (7). To explore this further, we divided the transfrags obtained from nuclear lRNA into those that do and do not overlap sRNA transfrags (Fig. 2B). Mean PhastCons scores for the lRNA transfrags that do overlap with sRNAs are significantly higher than such of the transfrags that do not (Fig. 2C and table S5). For lRNA transfrags overlapping sRNAs, a total of 23.9% (HepG2) and 26.2% (HeLa) exhibit PhastCons scores equivalent or higher than the average score observed for annotations (Fig. 2D). The conservation of the nuclear lRNA transfrags often extends beyond a sRNA transfrag it overlaps (Fig. 2B), indicating that other sequences outside of the overlapping regions may be important, reminiscent of the extended conservation seen in the miRNA precursors (8).

Fig. 2.

Sequence conservation analysis of short and long nuclear RNA (7). (A) Conserved sRNAs surrounding the first exon of RYR3 gene. (B) Longnuclear transfrag that overlaps sRNA is more conserved than adjacent lRNA transfrag that does not. (C) Quantile-quantile plot of PhastCons scores of long nuclear transfrags that do (x axis) and do not (y axis) overlap sRNAs. For any given point on the curve, an equal proportion of each “quantile distribution” occurs at this juncture. (D) Distribution of PhastCons scores of long nuclear transfrags that overlap sRNAs, binned on the basis of PhastCons scores (x axis), versus percentage of transfrags in each bin (y axis). Highly conserved transfrags (scores > 0.4) are indicated.

Taken together, these data suggest a possible product-precursor relation between overlapping transfrags derived from lRNAs and sRNAs, underscored by the enrichment of evolutionarily conserved sequences in genomic regions found transcribed in both lRNAs and sRNAs. Conservatively, 3.1% of HepG2 and 2.4% of HeLa nuclear lRNA transfrags may be parts of precursors of sRNAs. The full extent of transcription, which may serve as precursors of sRNA, could, however, be much larger, because lRNA transfrags that directly overlap sRNAs are almost certainly connected to other transfrags in a precursor transcript. Thus, any given lRNA transfrag can be an order of magnitude smaller than the lRNA transcript it represents.

sRNA mapping and sequence conservation analysis (fig. S8C) indicate that sRNAs cluster at the 5′ and 3′ of genes (Fig. 3). We denote these classes of sRNAs “promoter-associated sRNAs” (PASRs) and “termini-associated sRNAs” (TASRs). The occurrence of sRNAs centers around 5′ or 3′ termini and is statistically significant compared with G-C-matched random regions (Fig. 3, B and C, and table S1). Northern hybridization analysis revealed that PASRs and TASRs can vary in length (22 to 200 nt), with one prominent class of PASRs with lengths of ∼26, 38, and 50 nt (Fig. 3, B and C, and figs. S9 and S12 and tables S6 and S7). PASRs were expressed at levels similar to those of the protein-coding genes they overlap (7).

Fig. 3.

sRNAs are enriched at boundaries of transcripts. (A) Smoothed density of sRNAs and the map of long nuclear RNA from HepG2 are shown for a region of chromosome 21. (B and C) Association of sRNAs with 5′ and 3′ boundaries of annotated transcripts is enriched compared with a set of random regions with matched G-C-content. The fold enrichment over random is plotted as a function of a distance from the 5′ or 3′ termini for sRNAs on the same (“sense”) or opposite (“antisense”) strand as the annotations. Examples of Northern blots for PASRs and TASRs are shown below. (B) PASRs; (C) TASRs. (D) A positive correlation between the density of PASRs and the expression level of the associated genes. Violin plots illustrate the frequency distribution of measured expression levels for bins of genes (7). The median and mean expression levels for each bin are indicated by “=” and “*”, respectively. The numbers on top indicate the number of genes in each bin.

Several characteristics of both PASRs and TASRs support the biological significance of these sRNAs. As explained below, gene expression correlates with the density of PASRs, and PASRs associate with other lRNAs at the 5′ boundaries of genes. Also, expressed PASRs are syntenic with mouse.

The correlation of gene expression with the density of PASRs (Fig. 3D) is similar to a trend seen for antisense TASRs (fig. S10, A and B). Overall, 44.6 and 43.8% of genes found to be expressed in cytosol or nucleus have PASR association. Another 11.8 and 18.1% of genes had signal only in the first exon in cytosolic or nuclear RNAs, respectively. Almost half of those are observed to have PASRs (fig. S10, C and D). Conversely, for ∼80% of silent genes (<10% of exons detected), no PASRs were observed.

A third class of RNAs is the long transcripts that overlap 5′ boundaries of protein-coding genes but do not include most of the other exons. This is exemplified by genes that show signal only in their first exons (Fig. 3A). To characterize these promoter-associated lRNAs (PALRs), we performed 5′ and 3′ RACE analysis (rapid amplification of cDNA ends) followed by hybridization to tiling arrays (fig. S11). These experiments revealed that transcripts overlapping the promoter and the first exon and intron regions, ranging in length from hundreds of base pairs to more than 1 kb, are made and map to the same genomic regions as PASRs.

We constructed sRNA maps in two syntenic regions of the human and mouse genomes (IL4R cytokine cluster and four Hox loci) using mouse STO and R1 and human HepG2 and HeLa cell lines (7). Both species-specific and conserved PASRs and TASRs were found, with ∼39% of PASR sequences and 35% of TASR sequences mapping into syntenically conserved regions (Fig. 4A). Genomic regions shared by the PASR (HMSY19) at the 5′ boundaries of the Hox D9 andthe TASR (HMSY5) in the 3′ termini of HoxD10 genes of both species are illustrated in Fig 4. The sizes of PASRs and TASRs are similar for mouse and human cell lines (fig. S13 and table S8).

Fig. 4.

(A) Distribution of syntenically conserved sRNAs. The fractions of sRNAs in each class (ordinate) found in a syntenic location in both species are shown as percentages of the total number of sRNAs in the class. (B and C) Characterization of syntenically conserved PASRs (B) and TASRs (C). Combined maps of syntenic sRNAs from HeLa and HepG2 human cell lines (black) and R1mES and MEF mouse cell lines (gray) are shown. Syntenic PASR HMSY19 and TASR HMSY5 are shown on either top (+) or bottom (–) strands. Northern blots show HMSY19 and HMSY5 in both species with comparable sizes.

We have found that ∼10% of detected transcription is present in sRNA sequences (ranging from 22 to 200 nt in length). The distribution of these sRNAs is not uniform across the genome, because sRNAs are more frequent among genes than in intergenic regions. Furthermore, sRNA transfrags overlap a collection of lRNA transfrags that are significantly enriched in conserved sequences. Taken together, these data suggest that these lRNA transfrags potentially represent parts of nuclear primary transcripts that encode conserved functional sRNAs.

Several other observations are also derived from these mapping data. First, an appreciable fraction of protein-coding genes have expression only in the first exon and intron. This suggests that transcription may have two different states that are characterized by the lengths of transcripts made from the transcriptional start site of gene locus. Second, the fate of the transcripts derived from a particular detected transcribed region could be predicted on the basis of their retention in the nucleus, transport into cytosol, or processing into sRNAs. Overall, these RNA maps provide a virtual genealogy of RNAs (Fig. 1B). Third, PASRs often align within the boundaries of some of the PALRs. The genomic loci and boundaries of PASRs appear to be well conserved in two human cell lines and, in some cases, between mouse and human cells; this may indicate that there could be common processing signals used to create them. Fourth, the ends of almost half of human protein-coding genes were found to be bracketed by PASRs and TASRs. Given that large regions (i.e., >1 kb) are contained in the sequences covered by the sRNAs, the functional roles of these sRNAs may involve broad domains consistent with involvement in chromatin alterations.

Other recent studies also report the presence of multiple transcripts at the 5′ boundaries of genes (9), including unstable lRNAs postulated to be involved in regulation of gene expression (10, 11). Thus, these results suggest a model of genome organization where protein-coding genes are at the center of a complex network of overlapping sense and antisense lRNA transcription, with interleaved sRNAs often marking their boundaries and correlating with their expression state (fig. S14). Our studies also highlight a possible important biological function for a portion of unannotated nuclear transcription as possible precursors for sRNAs. Such interleaved transcription produces a variety of non–protein coding sRNA and lRNA species that offer cis- and trans-regulatory potential (1214).

Supporting Online Material

Materials and Methods

Figs. S1 to S14

Tables S1 to S8


References and Notes

View Abstract

Stay Connected to Science

Navigate This Article