Report

Effect of predicted protein-truncating genetic variants on the human transcriptome

See allHide authors and affiliations

Science  08 May 2015:
Vol. 348, Issue 6235, pp. 666-669
DOI: 10.1126/science.1261877

Expression, genetic variation, and tissues

Human genomes show extensive genetic variation across individuals, but we have only just started documenting the effects of this variation on the regulation of gene expression. Furthermore, only a few tissues have been examined per genetic variant. In order to examine how genetic expression varies among tissues within individuals, the Genotype-Tissue Expression (GTEx) Consortium collected 1641 postmortem samples covering 54 body sites from 175 individuals. They identified quantitative genetic traits that affect gene expression and determined which of these exhibit tissue-specific expression patterns. Melé et al. measured how transcription varies among tissues, and Rivas et al. looked at how truncated protein variants affect expression across tissues.

Science, this issue p. 648, p. 660, p. 666; see also p. 640

Abstract

Accurate prediction of the functional effect of genetic variation is critical for clinical genome interpretation. We systematically characterized the transcriptome effects of protein-truncating variants, a class of variants expected to have profound effects on gene function, using data from the Genotype-Tissue Expression (GTEx) and Geuvadis projects. We quantitated tissue-specific and positional effects on nonsense-mediated transcript decay and present an improved predictive model for this decay. We directly measured the effect of variants both proximal and distal to splice junctions. Furthermore, we found that robustness to heterozygous gene inactivation is not due to dosage compensation. Our results illustrate the value of transcriptome data in the functional interpretation of genetic variants.

Genetic variants predicted to shorten the coding sequence of genes—termed protein-truncating variants (PTVs)—are typically expected to have large effects on gene function. These variants are enriched for disease-causing mutations (1, 2), but some may be protective against disease (3). However, PTVs are abundant in the genomes of healthy individuals (4), indicating that they often do not have major phenotypic consequences. In addition, although PTVs are often described as loss-of-function (LOF) variants, in most cases their precise molecular effect has not been characterized and in other cases show gain-of-function effects (1). Clinical interpretation of PTVs will thus require direct characterization of their biochemical effects.

We cataloged predicted PTVs and their transcriptomic effect in 462 healthy individuals with DNA and mRNA sequencing (RNA-seq) from lymphoblastoid cell lines (LCLs) in the Geuvadis study (5, 6) and 173 individuals with exome sequencing and RNA-seq from a total of 1634 samples from multiple tissues in the Genotype-Tissue Expression (GTEx) study [supplementary materials section S1 (SM S1)] (7, 8). Each GTEx individual has RNA-seq data from 1 to 30 tissues, with 9 tissues having >80 samples. We defined PTVs (4) (table S1) as single-nucleotide variants (SNVs) predicted to introduce a premature stop codon or to disrupt a splice site, small insertions or deletions (indels) predicted to disrupt a transcript’s reading frame, and larger deletions that remove the full protein coding sequence (CDS) (SM S2, Fig. 1, and figs. S1 and S2). We identified 13,182 candidate PTVs using phase 1 data of the 1000 Genomes Project (9) of the 421 individuals included in the Geuvadis RNA-seq project, as well as 4584 candidate PTVs in the GTEx data, for a combined total of 16,286 candidate variants (table S2).

Fig. 1 Schematic overview of the study.

We prepared an integrated DNA and RNA sequencing data set by combining the pilot phase of the GTEx project of 173 individuals with up to 30 tissues per individual (total of 1634 samples) and the Geuvadis project of LCL DNA and RNA sequencing in 462 individuals. From these data, we analyzed the effect of predicted protein-truncating genetic variants on the human transcriptome, including (A) nonsense SNVs, (B) frameshift indels, (C) large deletion variants, and (D) splice-disrupting SNVs.

We measured total gene expression levels in reads per kilobase of exon per million mapped reads, allele-specific expression (ASE) detecting different expression levels of two haplotypes of an individual, and split mappings across annotated exon junctions to quantify splicing (SM S3 and S4). Transcripts containing common PTVs are more weakly expressed and more tissue-specific than transcripts that do not contain common PTVs (SM S5 and figs. S3 to S7), consistent with previous work (4).

PTVs that generate premature stop codons may trigger nonsense-mediated decay (NMD). Such variants are often recessive and may protect against detrimental phenotypic effects but also may cause disease via haploinsufficiency (1). Variants that escape NMD may create a truncated protein with dominant-negative or gain-of-function effects (1). We compared transcript levels between the PTV and the non-PTV alleles within the same individual (SM S6) (4, 5, 10) for a total of 1814 PTVs (SM S6, figs. S8 to S12, and table S3) and validated the allelic ratios obtained from RNA-seq data (figs. S13 to S18 and table S4) (11). We also generated a method to assess the ASE effect of frameshift indels (SM S6 and figs. S8 to S12), which were not previously examined (5, 10) due to the technical challenges of mapping bias (1214).

Allelic count data were analyzed with a Bayesian statistical method to address whether a variant exhibits ASE in a given tissue and whether this signal is shared across multiple tissues of the same individual (SM S7 and figs. S19 to S26) (15). We observe a higher proportion of strong or moderate allelic imbalance in rare and singleton nonsense SNVs compared with common nonsense variants (54.3%, 55.4%, and 35.7%, respectively), suggesting that rare PTVs are more likely to trigger NMD (fig. S19).

Rare nonsense SNVs predicted to trigger NMD according to the 50-bp (base pair) rule (SM S7) (16) have a larger proportion of ASE than SNVs that escape NMD (69.5% versus 31.9%, respectively), and both classes demonstrate ASE more often than synonymous variants (7.9%, P < 0.001 across all comparisons, two-proportion z test) (Fig. 2A). A higher proportion of ASE is also observed for frameshift indels predicted to trigger NMD (52.1%) compared with those predicted to escape NMD (30.6%) and at higher levels than that predicted for in-frame indels (18.4%) (Fig. 2B). Testing alternative simple distance rules showed that the 50-bp rule has the highest predictive value (Fig. 2C).

Fig. 2 Allele-specific expression analysis.

(A) Proportion of rare SNVs with allele-specific expression (ASE) for synonymous variants (n = 25,233) and nonsense variants predicted to escape (n = 158) or trigger (n = 287) NMD. (B) Proportion of rare indels with ASE for in-frame (n = 355) and frameshift indel variants predicted to escape (n = 77) or trigger (n = 129) NMD. Due to different quality filters, the proportions are not directly comparable to those in (A). (C) Receiver operating characteristic curve for predicting NMD with binary classification defined as no ASE (escape) and moderate, strong, or heterogeneous ASE (trigger). The filled circles show the specificity and sensitivity for NMD prediction with alternative simple distance rules (inset). (D) Multitissue ASE classification for rare nonsense variants predicted to trigger NMD (n = 287). (E) Example of ASE data across six tissues for a heterozygous carrier of the nonsense variant rs149244943 in gene PHKB (phosphorylase kinase, beta) classified as having heterogeneous ASE effects across the six tissues. We confirmed that this effect is not driven by a common tissue-specific expression quantitative trait locus. (F) Example of ASE data across 16 tissues for a heterozygous carrier of the nonsense variant rs119455955, a disease mutation for recessive late-infantile neuronal ceroid lipofuscinosis in gene TPP1 (tripeptidyl peptidase I), classified as having moderate ASE across all tissues. For all plots, 95% CIs are shown.

We next generated an improved predictive model for no ASE versus strong/moderate ASE for all nonsense SNVs (SM S7). Our model predicts NMD better than the 50-bp rule, with an area under the curve (AUC) of 80.8% [95% confidence interval (CI) 77.3 to 84.4%] compared to a 50-bp rule AUC of 72.9% (69.3 to 76.5% CI) (Fig. 2C and figs. S21 and S22). Our results provide a quantitative estimate of the value of NMD predictions and illustrate that the 50-bp rule (16) remains a valuable heuristic. Nonetheless, our model improves NMD prediction, allows a more flexible analysis of the probability that a variant will trigger NMD from variant data (fig. S21), and provides data for understanding the molecular mechanisms of NMD (fig. S22).

The GTEx study design allows us to study variation in NMD across tissues. We applied a Bayesian hierarchical model (SM S7) (15) to rare nonsense variants predicted to trigger NMD, according to the 50-bp rule, with ASE data from at least two tissues. We estimate that 30.5% of these nonsense variants have no ASE in any tissue, and 48.3% and 3.3% have moderate or strong ASE across all tissues, respectively. Finally, 17.9% have heterogeneous effects across tissues, and 8.1% of ASE effects are specific to a single tissue (Fig. 2, D to F, and figs. S23 to S26). The tissue specificity of NMD implies that the same PTV may have different effects across tissues, which could contribute to tissue-specific effects of disease-causing mutations (17).

We examined whether heterozygous carriers of PTVs exhibit compensatory up-regulation of the functional allele, which could contribute to tolerance of PTVs and partially explain the widespread haplosufficiency of human genes (18). Dosage compensation has been reported to correlate with gene expression levels (19) and occur in over 80% of deleted genes in Drosophila melanogaster (20). To minimize the effect of genotyping error, we focused only on biallelic whole-gene deletions with strong experimental support and manual curation (SM S2 and figs. S27 to S29). We first analyzed the few examples of common whole-gene deletion polymorphisms (SM S8). For 5/6 of these genes, an additive model relating gene expression to gene copy number provided a better fit than a dominant model, providing no evidence for dosage compensation (table S6). Additionally, heterozygous carriers of rare deletions also had consistently decreased expression of the respective gene compared with the population median (P = 1.37 × 10−5, one-sided binomial test of 11 rare PTV deletions in 25 genes) (figs. S27 and S28). Similar results were obtained for 53 nonsense PTVs with strong ASE signals (P = 2.90 × 10−9, one-sided binomial test) (figs. S30 and S31). These results suggest that full dosage compensation is rare for human genes.

Disruption of splicing can result in changes in protein structure either via in-frame changes in exon structure or by introducing a premature stop codon (21). Splicing variant annotation tools typically focus only on the two bases at either end of a spliced intron, “essential splice sites” (22), despite the fact that more distant sites are also known to affect splicing (21, 23, 24).

Variation around splice junctions tends to be rare (minor allele frequencies ≤ 0.01). We standardized the population distribution of each splice-junction quantification per tissue and grouped variants by their distance from their respective donor and acceptor sites. We then analyzed whether individuals carrying variants in these positions differ from the population in the quantification of the splice junction and the proximal exon and intron (Fig. 3, A to D).

Fig. 3 Splicing disruption.

(A) Proportion of variants disrupting splicing at each distance ±25 bp from donor and acceptor site (*P < 0.05, **P < 0.01, ***P < 0.001; green for P < 0.05; upper limit of 95% CI is shown; P value evaluated using the estimated proportion of variants supporting the alternative distribution times the effect size of the alternative distribution). (B) Classification of splice disruption events: exon skipping (low exon quantification value, no effect on intron quantification), exon elongation (high intron quantification value, no effect on exon quantification), and mixture (high intron and low exon quantification values). (C) Diagram of donor and acceptor splice junctions and sequence logo of represented sequences. (D) Effect size estimates (in standard deviations from the population distribution; 95% CI is shown) of the variants on splice junction quantification value. (E) Median genomic evolutionary rate profiling score (GERP) of all variants. (F) Distribution of common variants identified in an independent exome sequencing study of 4500 Swedish individuals. (G) Distribution of reported disease-causing variants in ClinVar.

In the Geuvadis data set, up to 79% of variants in the four essential splice-site loci cause splice disruptions (P < 0.01, Fig. 3A; GTEx results, figs. S33 to S37). We also find evidence of splice disruption from variants outside these regions, especially at positions 1 to 5 bp downstream of intronic donor splice sites, 1 bp into the adjacent exon, and also more distally—including the −24 position from the acceptor site, which likely reflects the branch-point position required for pre-mRNA splicing (25).

These patterns are consistent with other estimates of functional effects (Fig. 3E), depletion of common variants in exome sequencing data sets (Fig. 3F) (26), and a higher prevalence of disease-causing mutations (Fig. 3G). Analyses of common variants did not capture these patterns of enrichment (table S7, figs. S38 to S40, and SM S9). Our posterior probability estimates for sites with significant alternative distributions (P < 0.05) provide a resource for analyses (figs. S41 and 42 and SM S9 and S10).

By drawing on data from a wide range of adult tissues across 635 individuals, we provide a systematic assessment of the effect of predicted PTVs on the human transcriptome. Furthermore, this study indicates that nonsense-mediated decay has heterogeneous effects across tissues and also shows how to better detect splice-disrupting variants outside the “essential” sites at the splice junction.

We find no evidence for widespread dosage compensation maintaining normal expression levels of genes affected by heterozygous PTVs. This, together with the fact that most human genes are haplosufficient (18), suggests that homeostatic mechanisms at the cellular level, possibly as proposed in the theory of dominance (27), maintain biological function in the face of heterozygous, or even homozygous (4), inactivation of human genes.

The resource made available with this study provides a starting point for cataloging variants affecting protein function, but larger data sets will be required to increase our power to predict molecular consequences of variants from sequence data alone. These results highlight the benefits of direct RNA sequencing of either patient tissue or genetically engineered cell lines for interpretation of genetic variation and suggest that personal transcriptomics will become an important complement to genome analysis.

Supplementary Materials

www.sciencemag.org/content/348/6235/666/suppl/DC1

Materials and Methods

Figs. S1 to S42

Tables S1 to S7

Data File S1

References (2856)

References and Notes

  1. Acknowledgments: We thank all the members of the GTEx and Geuvadis consortia and L. Solomon for assistance with the figures. This work was supported by the National Institutes of Health (NIGMS R01GM104371 to D.G.M.; NIMH R01MH101814 to E.T.D, C.D.B., S.B.M., R.G., T.L., and M.I.M.; R01MH090941 to E.T.D. and M.I.M.; U01HG007593 to J.B.L. and S.B.M.; and R01MH101810 to D.F.C.); Academy of Finland (257654 to M.P.); a Hewlett-Packard Stanford Graduate Fellowship and a doctoral fellowship from the Natural Science and Engineering Research Council of Canada to E.K.T.; a National Defense Science and Engineering Graduate Fellowship (NDSEG) from the United States Department of Defense (DoD) to K.R.K.; European Research Council, Swiss National Science Foundation, and Louis-Jeantet Foundation to E.T.D.; Wellcome Trust (095552/Z/11/Z and 090532/Z/09/Z to P.D. and 098381 to M.I.M.); and a Clarendon Scholarship, NDM Studentship, and Green Templeton College Award from University of Oxford to M.A.R. The Genotype-Tissue Expression (GTEx) project was supported by the Common Fund of the Office of the Director of NIH. Additional funds were enrolled at Biospecimen Source sites funded by NCI/SAIC-Frederick, Inc. (SAIC-F) subcontracts to the National Disease Research Interchange (10XS170), Roswell Park Cancer Institute (10XS171), and Science Care, Inc. (X10S172). The Laboratory, Data Analysis, and Coordinating Center (LDACC) was funded through a contract (HHSN268201000029C) to the Broad Institute, Inc. Biorepository operations were funded through an SAIC-F subcontract to Van Andel Institute (10ST1035). Additional data repository and project management were provided by SAIC-F (HHSN261200800001E). The Brain Bank was supported by a supplement to University of Miami grant DA006227. Statistical methods development grants were made to the University of Geneva (MH090941), the University of Chicago (MH090951 and MH090937), the University of North Carolina–Chapel Hill (MH090936), and Harvard University (MH090948). The primary and processed data used to generate the analyses presented here are available in the following locations: All primary sequence and clinical data files, and any other protected data, are deposited in and available from the database of Genotypes and Phenotypes (www.ncbi.nlm.nih.gov/gap) (phs000424.v3.p1, except for whole-exome sequencing data in phs000424.v5.p1 and mmPCR-seq data and processed ASE data in phs000424.v6.p1); derived analysis files are available on the GTEx Portal (www.gtexportal.org). Biospecimens remaining from the study may be requested for research studies. The sample request form, biospecimen access policy, and material transfer agreement are available on the GTEx Portal (www.gtexportal.org/home/samplesPage). The Geuvadis data are available in ArrayExpress accession E-GEUV-1. Further details and links to data and software are available in www.well.ox.ac.uk/~rivas/ptv2015. C.D.B. is a paid member of the Scientific Advisory Boards of Personalis, InVitae, and Ancestry.com; he is founder and chair of the SAB of Identify Genomics, LLC; he also owns stock options in Personalis, Invitae, and Identify Genomics, LLC.
View Abstract

Navigate This Article