Technical Comments

Comment on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome”

+ See all authors and affiliations

Science  16 Mar 2012:
Vol. 335, Issue 6074, pp. 1302
DOI: 10.1126/science.1209658

Abstract

Li et al. (Research Articles, 1 July 2011, p. 53; published online 19 May 2011) reported large numbers of differences between DNA and messenger RNA in human cells, indicating unprecedented levels of RNA editing, and including sequence changes not produced by any of the known RNA editing mechanisms. However, common sources of systematic errors in high-throughput sequencing technology, which were not properly accounted for in this study, explain most of the claimed differences.

Li et al. (1) reported widespread RNA and DNA sequence differences (RDDs) in the human transcriptome, representing all 12 possible residue changes, and suggesting unexpectedly high frequencies and spectra of RNA editing. Because RNA editing and transcriptional noise are known phenomena (24), the novelty of this work resides in the extreme extent to which the differences are reported to happen in human cells. However, although high-throughput sequencing (HTS) does provide unprecedented opportunities to study the transcriptome, Li et al. did not properly control for a number of technical limitations of HTS and the downstream analysis of the data, resulting in an unacceptably high false-positive rate within this study. In this comment, we revisit the data analyzed by Li et al., highlight the major sources and estimated frequency of errors, and suggest a much more conservative interpretation of the results presented in the original work.

Spurious RNA-DNA differences in HTS data may originate from (i) systematic sequencing artifacts (technology specific), (ii) alignment errors, and (iii) incorrect genotypes resulting from limited coverage of the target regions. The first two types of error will result in specific distributions of presumed polymorphic sites within sequencing reads. Both sequencing and alignment errors most often occur at the termini of reads. In addition, they both tend to occur preferentially on one sequencing strand, whereas for true variants an approximately unbiased distribution can be observed. We analyzed these two biases in the distributions of RNA sequencing calls obtained from the presumed edited sites reported by Li et al. as compared with actual confirmed polymorphic DNA sites (Fig. 1, A and B). Based on this analysis, we estimate that false-positive results identified by their presence in only the first or last positions of sequencing reads, along with those that are supported only by unidirectional reads, account for more than half of the entire dataset (Table 1).

Fig. 1

(A and B) Distributions of RNA sequencing calls observed in RDD sites and in a control data set of dbSNP sites that are polymorphic in the samples under study. False positives are calculated using the tails of the distributions (where the RDD and control distributions intersect), as the number of RDD sites minus the expected number of sites [single-nucleotide polymorphisms (SNPs)]. (See supporting online material for details.) (A) Proportion of reads where the alternative base is found in the first or last base of the sequencing read. Distributions are interpolated using polynomial regression. Solid line, RDD; dashed line, control SNPs. (B) Proportion of reads belonging to the forward sequencing strand. Solid red line, polynomial interpolation of RDD distribution; dashed red line, power law interpolation of control distribution. (C) Screenshots taken from the Integrative Genome Viewer (5), illustrating the read coverage of several samples for the site chr19:60,590,467 in RPL28.

Table 1

Major sources of errors, frequency of occurrence in RDD sites, and estimated number of false positives within the RDD data set. (See supporting online material for details.) FPs, false positives.

View this table:

A common alignment error occurs when reads span splice junctions. Short overhangs of a few nucleotides can often be misattributed to an incorrect exon-exon junction. In Fig. 1C, we show one such case in the gene RPL28, which was extensively analyzed by the authors as an example of an edited site resulting in a vastly altered protein isoform. This specific case illustrates all the typical features of a false positive alignment and sequencing problem: (i) all the reads align in one direction only; (ii) the variant site is present at the extremity of the read; (iii) it is directly adjacent to another variant; and (iv) it flanks a splice junction and no supporting reads extend past the 5th nucleotide of the exon. It should also be noted that the sequence AGAT, which aligns across the junction, corresponds to the first four bases of the Illumina adapter sequence and is most likely not a part of any expressed sequence but the result of sequencing the adapter. We estimate that more than 700 of the reported RDD sites are due to misalignments close to exon boundaries (Table 1).

Alignment errors are also frequent in repetitive regions and their neighboring sequences. As a solution, the authors use a repeat-masked version of the genome. However, when using short reads, repeat masking with standard parameters is not sufficient because such reads may still map to multiple places. Hence, we identified all regions in the genome allowing multiple mappings of 54 nucleotide reads. RDD sites reported by Li et al. are greatly overrepresented in such regions (threefold over the null expectation), as well as in sequences flanking these and all types of repetitive regions (Table 1).

Finally, very-low-coverage DNA sequencing data was used throughout this analysis, introducing uncertainty in genotype information, particularly for regions that are difficult to sequence. The authors argue that this effect is negligible. However, it should be noted that in the two samples where the DNA coverage was most complete (~13X), the number of RDD sites detected was lowest and of the same order of magnitude as has been previously reported by others (2). We investigated whether the reported RDD sites are enriched for single-nucleotide variants present in dbSNP132, and found a six-fold enrichment over the null expectation. Given that dbSNP is incomplete, the actual error rate due to missing DNA information is likely to be even higher. As an extreme illustration, four of the six presumed novel exonic coding sites reported by the authors in table 2 have also been reported as polymorphisms in dbSNP.

In fact, for most of the examples cited by the authors throughout the main text, we found a plausible alternative explanation involving a technical artifact (see table S3). HTS is an exciting new tool for understanding a variety of biological problems and constitutes a very active area of research. With progressively longer sequence reads and improved modeling of sequencing biases in the alignment/assembly steps, it is certainly suited for detecting RDDs. Analyses must, however, take into account both technical and analytical limitations, which the work presented by Li et al. does not do properly. We provide evidence that the authors overestimate the frequency of RDDs by an order of magnitude. In view of the above shortcomings, we question some of the boldest findings of this work—i.e., the extent of RNA editing resulting in protein modifications, noncanonical editing types, and variation across individuals.

Supporting Online Material

www.sciencemag.org/cgi/content/full/335/6074/1302-c/DC1

Materials and Methods

Tables S1 to S3

References

References

View Abstract

Subjects

Navigate This Article