Special Viewpoints

Fewer Genes, More Noncoding RNA

See allHide authors and affiliations

Science  02 Sep 2005:
Vol. 309, Issue 5740, pp. 1529-1530
DOI: 10.1126/science.1116800

Abstract

Recent studies showing that most “messenger” RNAs do not encode proteins finally explain the long-standing discrepancy between the small number of protein-coding genes found in vertebrate genomes and the much larger and ever-increasing number of polyadenylated transcripts identified by tag-sampling or microarray-based methods. Exploring the role and diversity of these numerous noncoding RNAs now constitutes a main challenge in transcription research.

A few months before the publication of the first drafts of the human genome sequence (1, 2), online bids predicting the number of human protein-coding genes ranged from 30,000 to 150,000 [see (3)]. To the surprise of many (4), initial bioinformatic analyses revealed no more than 35,000 human genes, an estimate that has steadily declined to the present 25,000 genes (5). On the other hand, the largest estimates based on the number of distinct polyadenylated transcript 3′-ends identified through the single-pass sequencing of cDNA libraries (6) [i.e., expressed sequence tags (ESTs)] have not followed a diminishing trend. On the contrary, more transcripts keep being discovered, many of which do not correspond to annotated genes [e.g., (7)], in particular when using the serial analysis of gene expression (SAGE) approach (8).

Over the last 5 years, this discrepancy (4) between the number of recognized protein-coding genes and the apparent number of transcripts has not been reduced. As early as 1997, the then-thriving genomics industry had already sequenced several million ESTs and had come up with estimates of well over 100,000 human genes. For example, Incyte Genomics estimated 140,000 genes by grouping overlapping EST sequences [cited in (9)]; this total did not include more than 200,000 EST sequences seen only once.

Comparable numbers emerged a few years later in the public domain. The Human Gene Index of the Institute for Genomic Research predicted in excess of 75,000 human genes (10), whereas the Unigene database of the National Center for Biotechnology Information indicated 84,000 genes (6). These sequences are still in the databases, awaiting reconciliation with the much smaller number of human genes identified by the direct analysis of the human genome sequence.

Recent results may put an end to the paradox, albeit in a rather unexpected manner: A large fraction of the human (vertebrate) genome appears to give rise to polyadenylated transcripts that do not code for proteins. The notion of noncoding RNAs is not new—for example, the 17-kb X chromosome-inactivated specific transcript (Xist) was discovered in 1991 (11). However, it is only recently that the sheer scale of the phenomenon has begun to be realized. Unfortunately, initial analyses of the transcriptome were based on hybridization with probes derived from predefined or predicted gene sequences, and thus they did not reveal unexpected transcripts. A vastly different picture of transcriptional activity emerged as soon as tiling arrays were introduced, allowing the interrogation of genome sequences for corresponding transcripts at fixed intervals irrespective of predicted gene locations. For instance, a tiling array with 5-nucleotide resolution that mapped transcription activity along 10 human chromosomes revealed that an average of 10% of the genome (compared to the 1 to 2% represented by bona fide exons) corresponds to polyadenylated transcripts, of which more than half do not overlap with known gene locations (12).

Recent data from the FANTOM 3 project (13, 14) confirm and amplify these findings. Through a technical tour de force, the members of this consortium have established that a staggering 62% of the mouse genome is transcribed. They have identified more than 181,000 independent transcripts, of which half consist of noncoding RNA. Moreover, they found that more than 70% of the mapped transcription units overlap to some extent with a transcript from the opposite strand (13, 14).

These results provide a solution to the discrepancy between the number of (protein-coding) genes and the number of transcripts—noncoding polyadenylated mRNA contributes to a large fraction of the 3′-EST sequences (and SAGE tags) subsequently clustered or remaining as singletons. Indeed, the noncoding Xist mRNA is abundantly represented in all EST projects. It is thus likely that sequences of noncoding transcripts have been accumulating in EST databases and have for the most part (including singleton and antisense ESTs) been erroneously interpreted as coming from the 3′-untranslated regions of protein-coding transcripts. Noncoding transcripts originating from intergenic regions, introns, or antisense strands have probably been right before our eyes for 8 years without having been discovered!

The notion that transcription is limited to protein-coding genes is also being challenged in microbial systems. For Escherichia coli, the first analysis with a genome tiling microarray revealed a substantial number of antisense and intergenic transcripts (15). Noncoding short-lived “cryptic” mRNAs have also recently been seen in yeast, the transcription of which may maintain chromatin in an open state (16). The consequences of certain RNA polymerase II mutations for the status of pericentromeric heterochromatin also suggest a direct coupling between the transcription of noncoding RNAs and chromatin structure (17).

The intergenic, intronic, and antisense transcribed sequences that were once deemed artifactual are now a testimony to our collective refusal to depart from an oversimplified gene model. But what if transcription is even more complex? Could it, for instance, lead to mRNAs generated from two different chromosomes (Fig. 1)? A year ago, we would have immediately suspected such sequences as further artifacts arising from large-scale cDNA sequencing programs. But now? Perhaps it's time to go back to the cDNA sequence databases and reevaluate the numerous unexpected objects they contain (18). Transcription will never be simple again, but how complex will it get?

Fig. 1.

Relationship between the KIAA0510 cDNA sequence and a FLJ00128 protein-encoding transcript. The FLJ00128 cDNA (GenBank identification number 18676462) looks like a standard transcript with more than 20 exons (not drawn), all mapping to human chromosome 14. This transcript encodes a large protein of more than 1500 residues without known or predicted function. The KIAA0510 cDNA sequence (GenBank identification number 3413954) corresponds to a single exon, mapping on chromosome 1 and devoid of significant open reading frames. The 3′ noncoding part of this cDNA is fused to a 188-nucleotide sequence (boxed) 100% identical to a sequence unique to chromosome 14 and encoding 62 residues of protein FLJ00128. This region does not match the boundaries of an exon (as would be expected for trans-splicing) in the gene encoding FLJ00128. Both transcript sequences were assembled from multiple independently isolated ESTs and are devoid of low-complexity regions or repeats. Thus, they cannot easily be dismissed as cloning or sequencing artifacts.

References and Notes

View Abstract

Navigate This Article