Technical Comments

Response to Comments on “Widespread RNA and DNA Sequence Differences in the Human Transcriptome”

+ See all authors and affiliations

Science  16 Mar 2012:
Vol. 335, Issue 6074, pp. 1302
DOI: 10.1126/science.1210419


Kleinman and Majewski, Pickrell et al., and Lin et al. suggest that mapping and sequencing errors and genetic variants led to false discovery of RNA-DNA sequence differences in our paper. We repeated our analysis using two different sequence alignment methods and carried out additional experiments including whole genome DNA sequencing. The results are consistent with our finding of widespread RNA-DNA sequence differences.

We reported uncovering RNA sequences that differ from the underlying DNA sequences (1). With the RNA sequences and corresponding DNA sequences of B cells from 27 individuals, we found more than 10,000 sites where the RNA and DNA sequences differ. We termed these RNA-DNA differences (RDDs). Because DNA serves as the template for RNA synthesis, systematic differences between DNA and RNA sequences are thought to be rare. The known mechanisms that can lead to such differences are RNA editings mediated by two families of deaminases, ADARs (adenosine deaminases that act on RNA) and APOBECs (apolipoprotein B mRNA editing enzymes, catalytic polypeptide-like), which lead to A-to-G and C-to-U changes. We found all 12 types of differences, including transversions that are not likely to be caused by known editing mechanisms. In addition to B cells, we found RDDs in other cell types, including primary skin cells and brain tissues. By mass spectrometry, we showed that the RNA forms of RDDs are translated into proteins.

Since our publication, several groups (25) have published about finding RNA-DNA differences beyond the A-to-G and C-to-U types. In these projects, different sequencing platforms, tissue types, analysis methods, and inclusion criteria were used (Table 1A), thus supporting the conclusion that RDDs are more common than previously known. We also found examples in the literature where mutations (like RDDs) were found in the RNA and not the DNA of the patients; these have been implicated in human diseases, including Wilm’s tumor (6)and Alzheimer’s disease (7) (see Table 1B for additional examples). These findings of systematic differences between RNA and their corresponding DNA sequences beyond the known RNA editing events show that there are unknown steps involved in how DNA is copied into RNA. It also reveals additional genetic variation that extends beyond DNA sequence polymorphisms.

Table 1

Studies that identify RNA-DNA differences.

View this table:

Kleinman and Majewski (8), Pickrell et al. (9), and Lin et al. (10) raise concerns about our analysis and suggest that mapping and sequencing errors, as well as genetic variation, led to false positives in our study. Their Comments do not refute the presence of RDDs; rather, the disagreement is over how many there are. We view the discovery of RDDs as the important point and find the exact number to be less salient, because it depends on thresholds and sample sizes. Just as in single-nucleotide polymorphism (SNP) discovery, a comprehensive list requires a large sample; we do not intend for our initial report to be a complete catalog of RDDs; rather, our goal is to show that the phenomenon exists. Here, we repeated our analysis using two different sequence alignment algorithms (GSNAP and BLAT), explained the basis for the characteristic features of RDD sites in sequence reads, and carried out additional sequencing (with next-generation and Sanger sequencings) as well as protein immunoblot analysis to support our conclusions.

First, we analyzed the sequence data on the 27 individuals using a different alignment program, GSNAP (11). Among the 10,210 RDD sites, 8069 (79%) were also identified by GSNAP. The correlation of RDD levels obtained by Bowtie and GSNAP was very high (r = 0.9).

Second, we examined some of the unexpected features of the RDD sites in sequence reads. Authors of the Comments raised the concern that more than the expected number of RDD sites are found at ends of reads (positional bias) and reads in one direction (directional bias). We, too, were puzzled by this; but, upon close examination, we found that both features result from clustering of RDD sites, insertions/deletions (indels) that accompany the RDDs, and unannotated exon-exon junctions.

Previous studies showed that A-to-G sites tend to cluster (1214); our results show that RDDs also cluster. Among the identified RDD sites, 2613 sites (26%) are within 25 nucleotides (nt), and 1059 sites (10%) are adjacent to each other (1). This is likely an underestimate, because only reads with two or fewer mismatches (in the seed) were included in our analysis (Fig. 1A). Sites that are in strings of multiple RDDs would not have met our inclusion criteria. A-to-G editing events are known to be subject to hyperediting (13, 14); thus, they allow us to test our hypothesis that clustering of sites contributes to positional bias. We examined our results and indeed found that sites that show positional bias are enriched for A-to-G editing events. Forty-one percent of the sites found only in the first or last base of RNA sequencing (RNA-Seq) reads are A-to-G editing sites, whereas among all the RNA-DNA sequence differences, only 23% are A-to-G sites.

Fig. 1

Clustering of single-nucleotide RDDs, association of RDD sites with insertion/deletion RDDs, and unannotated exon-exon junctions led to positional and directional biases of RDDs in RNA-Seq reads. (A) In our Bowtie alignment, we allowed up to two mismatches (in the seed); thus, reads with multiple RDDs and/or insertion/deletion would not be aligned. Only those reads that happen to capture one or few of the RDDs that are clustered can be mapped; this only occurs when the RDDs are at the ends of the reads. (B) If reads represent exon-exon junctions that were not annotated by Gencode, they also would not be mapped by Bowtie. If a RDD site is found near such an unannotated junction, it can only be included if it is at the end of a read that does not contain the junction. The relative order of a single-nucleotide RDD with an insertion/deletion or an unannotated junction is important because reads that include the insertion/deletion or junction cannot be mapped; only reads in the direction that “meets” the single-nucleotide RDD have a chance to be included in the analysis.

In addition, the indels that accompany RDDs and unannotated exon junctions contribute to positional and directional biases (Fig. 1, A and B). The reads that contain RDDs and insertion/deletions or unannotated exon junctions were removed because they exceeded the number of mismatches we allowed (Fig. 1). Only reads that happen to capture one or few RDDs in a string, or those with only the RDD sites near the ends of sequence reads but without the insertion/deletion, were kept (Fig. 1). To quantify the impact of indels on positional bias, we examined the GSNAP data, because GSNAP allows indels to be incorporated (the version of Bowtie we used did not allow indels). GSNAP indeed “rescued” reads that were previously discarded due to indels, and in these “rescued” reads, the RDD sites are no longer at the ends of the reads (Fig. 2). The GSNAP results show that 1586 of the RDD sites are accompanied by at least one insertion/deletion within 5 bp of the RDD sites. Among them, 14% were found only in the first or last nucleotides of mapped reads, thus illustrating that indels contribute to the positional bias. The positions of RDDs in the sequence reads are explained by the clustering of RDD sites and the association of RDDs with indels and are not due to technical errors.

Fig. 2

Examples of insertion or deletion RDDs next to single-nucleotide RDDs. By Bowtie alignment, only reads that happened to have the single-nucleotide RDD at the ends of the reads were mapped and, therefore, included in our analysis. By GSNAP alignment, we show that some of the single-nucleotide RDDs were accompanied by insertions; when these insertions were allowed to be aligned, the RDD sites were no longer found only in the ends of reads. The examples show mapped reads from Bowtie and GSNAP alignments for two RDD sites: one for CCND2, where the A-to-T RDD site was accompanied by a T insertion (indicated by a blue bar); another is a T-to-G RDD in MRS2 that is accompanied by a C deletion (indicated by a black line).

The authors of the Comments also questioned whether RDDs could be artifacts of paralogous genes. We ruled out this possibility in three ways: (i) by examining the examples listed by Pickrell and colleagues, (ii) by read-depth analysis of RDD sites relative to the rest of the genome, and (iii) by realigning the RDD sites and surrounding sequences back to the entire genome using BLAT (15).

First, we examined the cases that Pickrell and others used as examples of paralogous genes. They pointed out correctly that some of the genes, such as AP2A2 in table 3 of our original publication, belong to a family of genes that share similar sequences. However, to lead to a false RDD call, the sequences of the other genes (paralogs) must correspond to the RNA sequences of the RDD sites; this is not the case for our results. The RNA form of the peptide (aspartic acid, D, coded by GAC) for AP2A2 (table 3 of Li et al.) cannot be explained by the DNA sequences of AP2A1 or AP2A2 (both read TAC that codes for tyrosine, Y) or any other DNA sequences in the genome. The DNA sequences of AP2A1 and AP2A2 (T in both genes) differ from the RNA-Seq reads (G) at the RDD site (see Fig. 3A). Furthermore, compared with AP2A1, our RNASeq reads have two mismatches, whereas, compared with AP2A2, there is only one mismatch (at the RDD site). Thus, the reads are most likely derived from AP2A2. The “equally good matches” listed in table 1 of Pickrell et al. cannot explain the RNA form of the peptides: Even though they may match the DNA forms of the peptides, they do not match the RNA forms of the peptides.

Fig. 3

(A) Alignment of sequences centered around a T-to-G RDD site (11:976,858) in AP2A2 to AP2A2 and to its paralogous gene, AP2A1. The RDD site was found regardless of whether the sequence was compared to AP2A1 or AP2A2. There was an additional mismatch in alignment to AP2A1. (B) Distributions of read depths for all nucleotides and for RDD sites on chromosome 22 based on low-coverage 1000 Genomes data are highly similar.

The above focused on specific examples; to broaden the analysis, we carried out a genome-wide “read-depth” analysis (16) to ask whether RDD sites result from known or unknown paralogous genes. We used the 1000 Genomes data for the CEU (Northern and Western European ancestry from the CEPH collection) individuals to determine the read depth for each base in the human genome and compared that distribution to the read depth for our 10,210 RDD sites. The two distributions are highly similar (Fig. 3B). If RDD sites are results of paralogous genes, then we would expect to see that read depths for RDD sites are higher than those for non-RDD sites; instead, we found that the distribution of read depths for RDD sites is not different from the rest of the genome. This result suggests that RDD sites are not enriched in regions of copy number variation or segmental duplications.

To further examine whether the RDD sites arise due to misalignment to repetitive regions or paralogs, we took the 10,210 RDD sites and surrounding sequences (in three windows of 25, 50 and 75 nt 5′ and 3′ of each site) and aligned those sequences by BLAT to the genome. In this analysis, we examined sequences from the entire genome and not just the sequences corresponding to the Gencode genes; thus, if the original analysis—which focused only on Gencode regions—led to misalignments, we would uncover them here. Among the 10,210 sites, 9289 mapped only to the sites that we identified in Li et al. (1) (see Table 2). We examined the remaining 921 sites that mapped to additional genomic regions. We found that 237 of them contain our RDD alleles, although in most of these cases, the match to the original sites had fewer mismatches than to the other locations; nevertheless, these could be miscalled as RDDs due to misalignment. Authors of the Comments suggest that any RDD that maps closely to another region in the genome is a false-positive finding. This is not correct; only sites that contain the identical sequences as our RDD alleles can lead to incorrect RDD calls. Based on BLAT results, less than 3% of the RDD sites may be miscalled due to paralogous sequences. This relatively low percentage is not surprising; in the original analysis, to avoid miscalling RDDs due to misalignment, we removed all sites in repetitive regions of the genome.

Table 2

BLAT analysis of RDD-containing sequences.

View this table:

The authors of the Comments were also concerned about unknown SNPs. They proposed that some of the RDDs can be explained by nonreference DNA sequences at the sites. The samples used are those studied extensively by several consortia, including HapMap and 1000 Genomes, that focus on identifying DNA sequence variants; thus, unknown SNPs are unlikely. The Comment authors pointed out correctly that some of the RDD sites are known SNPs. As noted in our original publication, we used known SNP sites only if the genotypes of our samples are homozygous based on results from the HapMap and the 1000 Genomes projects. These studies used different technologies to obtain the genotypes, so we are confident of the results. To address this concern experimentally, we validated the DNA sequences corresponding to several hundred RDD events by Sanger sequencing and deep sequenced the DNA of two individuals by next-generation sequencing.

In addition to the Sanger sequencing we carried out previously (1), here we sequenced an additional 73 sites in two to five individuals (average of 3) (Table 3 and Fig. 4). These sites were picked mostly randomly, except for some genes that are of particular interests to us, including the site in RPL28. Over the few hundred sites that were sequenced, two of the sites (chr7:73173273 and chr1:199732016) differ in sequences from 1000 Genomes/ Reference genome data. The remaining ones have identical sequences as those in 1000 Genomes/ Reference genome. Thus the discordance rate is less than 1%.

Table 3

Additional sites sequenced by Sanger sequencing. (Single-letter abbreviations for the amino acid residues are as follows: A, Ala; C, Cys; D, Asp; E, Glu; F, Phe; G, Gly; H, His; I, Ile; K, Lys; L, Leu; M, Met; N, Asn; P, Pro; Q, Gln; R, Arg; S, Ser; T, Thr; V, Val; W, Trp; and Y, Tyr.)

View this table:
Fig. 4

Examples of Sanger sequencing traces of genomic DNA sequences at RDD sites in Table 3. The results confirm DNA sequences used in our analysis.

In addition, we carried out whole-genome sequencing of the DNA of two subjects, GM12004 to 60X and GM12750 to 30X coverage. We asked whether deep- and whole-genome sequencing would allow us to identify DNA variants that led to misidentification of RDD alleles. We checked the DNA sequences of these two individuals at their RDD sites: 48 (10%) of the 486 RDD events in GM12004 and 84 (6%) of 1350 RDD events in GM12750 have one or more DNA reads that contain the RDD alleles (Table 4). In most cases (>60%), less than 5% of DNA reads contain the RDD alleles. Even if we eliminated all events with one or more DNA reads that correspond to the RDD alleles, however, we would only decrease the number of RDDs by ~10%. This confirms that DNA sequence errors or unknown DNA sequence variants do not lead to considerably more false-positive RDDs.

Table 4

Examples of DNA and RNA sequences at RDD sites

View this table:

Last, we carried out immunoblotting to validate our mass spectrometry results. We showed that a T-to-A RDD site in RPL28 leads to a loss of a stop codon and a 55-amino acid extension of the RPL28 protein. In our original publication (1), we showed mass spectra for the extended peptides that result from that RDD. Subsequently, we raised an antibody against the extended peptide and confirmed that the RNA form of RPL28 is expressed at protein level with an expected size of 13.7 kD (Fig. 5).

Fig. 5

Validation of RPL28 RDD (chr19:60590467, T>A) using an antibody specific for the RNA form of RPL28. (A) Schematic diagram of the validation procedure. (B) Detection of the RNA form of RPL28 by mass spectrometry, as reported in (1). The sequence colored in red shows the peptides detected by liquid chromatography–tandem mass spectrometry, and the underlined sequence corresponds to the extended 55 amino acids that resulted from the T>A RDD and subsequent loss of a stop codon. The representative mass spectra are shown. (C) Dot blot confirms that the RPL28 antibody detects the RPL28 peptide (extended peptide in the RNA form) specifically, but not the negative control peptide. (D) Western blot shows that the specific antibody detects RPL28 protein from human B cells. (Lane 1) Input lysate for the mass spectrometry analysis (pooled sample). (Lane 2) Lysate from one individual.

In summary, we reported the unexpected finding that there are many sites where RNA sequences differ from the underlying DNA sequences. In response to the Comments, we showed that clustering of RDDs and insertions/deletions that accompany RDDs explain why RDD and A-to-G editing sites are found more often at the ends of sequencing reads than expected. We also carried out read-depth analysis, realignment with BLAT, and examination of individual examples to show that paralogs and copy number variations do not contribute considerably to false RDD identifications. We also provided additional Sanger and next-generation sequencing data, as well as immunoblot results, to validate the RDD sites. With results from these analyses and recent reports from other groups (24), we are certain that there are many more mismatches between RNA and their underlying DNA sequences than previously known. Next, it will be important to uncover the mechanisms that lead to RDDs and study their functional impact.

Supporting Online Material




View Abstract


Navigate This Article