Research Article

Genetic regulatory variation in populations informs transcriptome analysis in rare disease

See allHide authors and affiliations

Science  18 Oct 2019:
Vol. 366, Issue 6463, pp. 351-356
DOI: 10.1126/science.aay0256

A statistical model to find disease genes

Genetic variation is high among individuals, which makes it difficult to identify any one specific pathogenetic variant in patients with idiopathic disease, especially those that are in noncoding regions of the genome. Examining tissue-specific and population-level RNA sequencing data, Mohammadi et al. developed a statistical test, analysis of expression variation (ANEVA), that can quantify how one individual's gene expression fits in the context of the variation within the general population. By applying ANEVA to a dosage outlier test, the authors identified pathogenic gene transcripts in patients with Mendelian muscle dystrophy.

Science, this issue p. 351

Abstract

Transcriptome data can facilitate the interpretation of the effects of rare genetic variants. Here, we introduce ANEVA (analysis of expression variation) to quantify genetic variation in gene dosage from allelic expression (AE) data in a population. Application of ANEVA to the Genotype-Tissues Expression (GTEx) data showed that this variance estimate is robust and correlated with selective constraint in a gene. Using these variance estimates in a dosage outlier test (ANEVA-DOT) applied to AE data from 70 Mendelian muscular disease patients showed accuracy in detecting genes with pathogenic variants in previously resolved cases and led to one confirmed and several potential new diagnoses. Using our reference estimates from GTEx data, ANEVA-DOT can be incorporated in rare disease diagnostic pipelines to use RNA-sequencing data more effectively.

Large reference databases of human exomes and genomes have enabled the characterization of genomic variation in human populations (13). These data have been used to summarize genic intolerance to damaging variants, where depletion of gene-disrupting variants (e.g., stop-gain variants) indicates deleterious fitness consequences (1, 4, 5). Such analyses are essential for prioritizing rare and de novo coding variants that can underlie Mendelian disease and provide a genetic diagnosis for 25 to 50% of patients (6, 7). However, despite advances in DNA sequencing, the search for rare disease-causing variants outside the coding sequence has been hindered by the difficulty of interpreting rare regulatory variants and identifying their target genes.

Integration of genome- and transcriptome-sequencing data has provided improved diagnosis by better detection of rare variants with functional effects (6, 810). However, the often laborious analysis is further complicated by the transcriptome being affected by the environment, disease state, and technical variation. This has made it challenging to quantify when an effect is genetic and beyond the normal population range. Thus, most analyses have been limited to only a small fraction of variants that induce clear alterations in the transcriptome, such as total loss of expression and splice defects.

One promising data type is the allelic expression (AE), which measures the relative expression of the paternal and maternal haplotype of a gene in an individual. Departure from equal AE, allelic imbalance, is largely unaffected by environmental and technical factors with a reported heritability of 85% (11), and therefore has a higher sensitivity to capture cis-acting genetic effects, including those induced by rare variants (6, 1215). However, a quantitative framework for interpreting this data type to identify rare pathogenic variants has been lacking.

Here, we quantify the effects of genetic regulatory variation in populations using a mechanistic model of cis-regulatory variation. Specifically, for each gene, we estimate VG, the expected variance in the dosage that is due to interindividual genetic differences within a population. Next, we use VG as a reference to identify genes affected by potentially pathogenic regulatory variants in patients.

Results

Generative model for population allelic expression data and the ANEVA method

Cis-regulatory variant effect sizes can be quantified with allelic fold change (aFC) (16). aFC has an analytical link to gene dosage, which would allow calculation of VG if all regulatory variants were known (supplementary materials, Eqs. 1 to 7). In practice, we can use AE data to estimate the overall distribution of regulatory effects on a gene without having to identify these variants explicitly. Across individuals, AE data represent a series of comparisons between the net expression effects of all variants on two random haplotypes at a time. A major complication for applications of AE data is that, within a population, genes have diverse patterns depending on the properties of regulatory variants present and the single-nucleotide polymorphism (SNP) used to measure the allelic expression (aeSNP; Fig. 1, A to D) (14, 15). We derived a generative model for population AE data under a realistic scenario in which a gene is regulated by several regulatory variants, of which only some are identifiable. Under this assumption, population AE data are described by a constrained mixture of binomial-logit-normal (BLN) probability distribution functions (supplementary materials, Eqs. 8 to 19). We fit this model to population AE data (supplementary materials, Eqs. 20 to 28, and Fig. 1, E to H) and use the maximum likelihood parameters to estimate VG indirectly (supplementary materials, Eqs. 29 and 30). We refer to this method as analysis of expression variation (ANEVA). Simulations show that the inferred VG is accurate (R2 = 0.92; fig. S1). Thus, ANEVA allows biologically interpretable estimates of genetic variation in gene expression within a population to be derived from AE read count data.

Fig. 1 Cis-regulatory variation, allelic expression, and ANEVA.

(A to D) Examples of allelic expression across individuals (dots) for four genes with a single aeSNP each. In (A), similar haplotype expression levels for the gene indicate little cis-regulatory variation. In (B) to (D), there is relatively more variation. In (C) and (D), there are distinct clusters driven by different haplotype combinations of a common, strong regulatory variant and the aeSNP, with strong linkage disequilibrium in (D). These examples illustrate the challenge of consistently modeling the underlying regulatory variants. Ref., reference; Alt., alternative. (E) Schematic representation of ANEVA, which uses a generative model of population AE data and a mechanistic model of cis-regulatory variation to estimate the magnitude of genetic variation in expression for each gene. FC, fold change. (F to H) Generative model of population AE data represented mechanistically (F), in population AE data (G), and as a Bayesian plate diagram (H) (supplementary materials, Eqs. 20 to 22). AE data are modeled with one distinctly strong regulatory biallelic variant. If present, this variant is specified by its effect size, sH,L, and a measure of linkage disequilibrium (LD) with the aeSNP. Residual cis-regulatory variation is modeled as an infinite-allelic regulatory variant summarized by variance term σr2. Allelic expressions eR and eA are measured at a heterozygous aeSNP with reference (R) and alternative (A) alleles, and s*R,A is the aeSNP reference allele alignment bias. Haplotypes h1 and h2, basal expression level eB, and N cis-regulatory variant sites v1vN, are components of our complete formal model of cis-regulatory variation.

ANEVA estimates from AE data are consistent with eQTL data and heritability of gene expression

We applied ANEVA to 10,361 RNA-sequencing (RNA-seq) samples from 48 tissues and 620 individuals with whole-genome sequencing (WGS) data from the Genotype-Tissues Expression (GTEx) version 7 data (17, 18). Overall, we estimated VG at a median of 43,219 autosomal aeSNPs per tissue. Gene-level VG was derived as a weighted harmonic mean of SNP-level estimates for a median of 4962 genes per tissue and a total of 14,084 genes (Fig. 2, A and B, and table S1). First, we ensured that our AE-derived estimates were consistent with what is expected from expression quantitative trait locus (eQTL) data [median correlation = 0.73; Fig. 2C, fig. S2, and table S2]. Next, we benchmarked ANEVA estimates against gene expression cis heritability (h2). For GTEx whole blood, we calculated the ratio of AE- and eQTL-derived VG to the total variance of gene expression. These ANEVA-based h2 estimates were consistent and comparable to those from studies using standard methods and with larger datasets, confirming that VG measures the genetic variation in gene expression (Fig. 2D and fig. S3). Because AE-based ANEVA VG estimates are more applicable to AE-based outlier detection, we used these estimates for all subsequent analyses (fig. S14) (19).

Fig. 2 Estimates of genetic regulatory variation in GTEx.

(A) Number of genes with VG estimates across one to 49 GTEX tissues. Min., minimum. (B and C) Distribution of SDG, VG, for 7556 genes in GTEx subcutaneous adipose (B) and its comparison with eQTL data [(C); correlation (corr.) = 0.71]. The red line is Deming regression fit (fig. S2). SDG is capped at 0.5 for visualization. (D) Benchmarking of ANEVA by gene expression heritability (h2) estimates. GTEx h2 was calculated by the linear mixed-model based BSLMM, PrediXcan R2, and ANEVA (19). These were compared with two larger cohorts: BLSMM h2 from the Depression Genes and Networks cohort (DGN) cohort [n =922 (32)] and local identity-by-descent (IBD)–based h2 from the Iceland family blood (IFB) cohort [n = 722 (33)].

Genetically driven variation in gene expression across tissues, populations, and gene sets

Next, we analyzed how VG varies between tissues and populations. The estimates were well correlated between tissues (median correlation = 0.57; Fig. 3A). For a given gene, VG tends to be smaller in tissues where the gene is more highly expressed (Wilcoxon signed-rank test p < 10−300; Fig. 3B). Because this was not an artifact of differences in read depth (fig. S4), it suggests that there is an increased dosage sensitivity and a higher selective constraint in tissues where the gene has a more pronounced functional role (see fig. S5 for an example). To analyze population differences in VG, we used ANEVA on AE data from three European and one African subpopulation from the GEUVADIS data (20). We found a high correlation between estimates from all subpopulations (correlation range: [0.75, 0.83]; fig. S6). This suggests that the total amount of genetic dosage variation is not highly variable between populations, and approaches that aggregate genetic effects at the gene level may have better applicability across populations than analyses of individual variants.

Fig. 3 Biological sources of regulatory variation between genes.

(A) Correlation (corr.) of genetic regulatory variation across GTEx tissues (see table S1 for tissue names). (B) Rank correlation between median expression in a tissue and VG for 9158 genes with VG estimates in at least five tissues. The distribution is shifted (median rank correlation –0.20). Significant genes are shown in red (5% FDR). (C) Rank correlation of V¯G with enhancer size, coding constraint (RVIS, pLI), noncoding constraint (ncRVIS), and noncoding conservation (ncGERP) in UTRs and promoters. (D) V¯G for different gene sets (DD, developmental disorder; CHD, congenital heart disease; MDM, congenital muscular dystrophies and myopathies; table S3), with nominal p-values from rank-sum test compared with the background of all genes (p ≤ 0.01 highlighted) and the number of genes in parentheses. Boxes span the middle 50% values and the whiskers span ±1.5 interquartile range (IQR) from first and the third quartile.

To characterize differences in the amount of genetic regulatory variation between genes, we correlated VG to statistics of gene regulation and constraint. For each gene, we calculated a weighted harmonic mean of VG across tissues (V¯G; table S1). Gene enhancer size had a minimal correlation to V¯G (Fig. 3C) (21), suggesting that the size of the mutational target, a proxy for the background mutation rate, plays a minor role. Genes with high purifying selection for coding gene-disrupting variants or noncoding variants in the promoter or untranslated regions (UTRs) were depleted of genetic regulatory variation (Fig. 3C), as previously observed by eQTL analysis (1). Rare disease genes had lower V¯G, whereas loss-of-function tolerant genes had higher V¯G (Fig. 3D), showing that dosage sensitivity is captured by both exome and regulatory variation analysis. Genes identified by genome-wide association studies (GWAS) showed little deviation from the background, but schizophrenia genes had the lowest VG and blood metabolite genes the highest, suggesting a link to genetic architecture of these traits. The amount of genetic regulation variation measured as VG can complement previous coding and regulatory variation analyses of selective constraint on genes and traits.

Genetically driven variation in gene expression and dosage outlier testing from AE data

In addition to these biological insights, VG has a direct practical application in identifying population outliers that may be pathogenic. To this end, we developed the ANEVA dosage outlier test (ANEVA-DOT) to identify genes likely affected by a heterozygous genetic variant with an unusually strong effect on gene dosage. Using VG for each gene, ANEVA-DOT tests against the null hypothesis that the observed allelic imbalance in an individual is consistent with dosage variation in the general population (Fig. 4A) while accounting for a number of additional technical and biological sources of variation (supplementary materials, Eqs. 31 to 42). We used extensive simulations to ensure that the test is well calibrated (fig. S7). ANEVA-DOT is implemented in R and runs in a few seconds per sample (22).

Fig. 4 Regulatory outliers detected with the ANEVA-DOT.

(A) Illustration of the ANEVA-DOT method. For each gene, the null distribution of allelic imbalance is estimated using the VG and the model of cis-regulatory genetic effect. Allelic counts in a test individual are compared with this null, accounting for sampling noise, sequencing noise, reference bias, and the variant haplotype. (B and C) Enrichment of all rare variants in ANEVA-DOT genes as a function of allele frequency (B) and for putative gene-disrupting variants [minor allele frequency < 1% (C)]. (D to H) Example of AE data for all genes from one previously diagnosed muscle dystrophy patient (N13). The disease gene is shown in blue. Outlier genes identified by different tests (5% FDR) are marked in red: binomial [n = 387 (D)], binomial with a 15% allelic imbalance threshold [Bin-Thr, n = 246 (E)], beta-binomial [Beta-Bin, n = 83 (F)], excess allelic imbalance against GTEx data by z-test [AI z-test, n = 94 (G)], and ANEVA-DOT [n = 15 (H)]. Genes marked in gray are excluded from each test. (I) Fraction of true causal genes identified in previously diagnosed patients (recall) and its 95% bootstrap confidence intervals versus the number of outliers reported. Empirical recall (left) is calculated using all cases in which imbalanced AE would be expected (n = 21), whereas idealized recall (right) excludes five cases in which detecting the gene from AE data is impossible (e.g., when the causal gene is not expressed; fig. S12).

We first tested ANEVA-DOT in the general population of 466 skeletal muscle samples from GTEx. Each sample had a median of 3390 genes tested and 10 genes identified as outliers at a 5% false discovery rate (FDR) (hereafter referred to as ANEVA-DOT genes; 90% range: [3, 22]). An average of 56% of the genes previously implicated in neuromuscular disorders (6, 23) and up to 46% of the highly expressed genes were testable per individual (fig. S8). As a quality filter, 113 out of 5848 tested genes that appeared as outliers in >1% of the individuals were excluded from further analysis (fig. S8, D to F, and table S4). After this step, a median of 4.5 ANEVA-DOT genes were retained per individual (90% range: [1, 14]; Fig. 4B). ANEVA-DOT genes were highly enriched for rare heterozygous variants in a 10-kb window upstream of the transcriptional start site and in the gene body (Fig. 4B). This enrichment was particularly pronounced for rare putative gene-disrupting variants that are expected to have a strong effect on gene expression levels by nonsense-mediated decay (Fig. 4, B and C). This confirms that ANEVA-DOT captures rare genetic effects on gene dosage.

Next, we evaluated how sensitive ANEVA-DOT is to differences in the reference population where VG is calculated. First, using the GEUVADIS data (20), we looked for ANEVA-DOT genes in 86 European (GBR) individuals using VG estimates derived from two European populations (FIN and TSI) and one African population (YRI). The three reference populations performed similarly, with an average of 74% (range 69 to 78%) of ANEVA-DOT genes identified using one confirmed by another (fig. S9), suggesting that the lack of full concordance is likely driven by noise and threshold effects. However, larger sample sizes will be needed for a comprehensive evaluation of the population effects. Next, we investigated whether ANEVA-DOT genes in GTEx skeletal muscle could be identified by analyzing other accessible tissues of these individuals. The detection rate varied from 23.3% in fibroblast to 12.3% in whole blood, which indicates that ANEVA-DOT can capture some outlier effects also from proxy tissues (fig. S10).

ANEVA-DOT accurately identifies disease genes in AE data from rare disease patients

To test ANEVA-DOT’s performance in the diagnosis of rare disease patients, we applied it to AE data from 70 rare Mendelian muscle dystrophy and myopathy (MDM) patients using the VG reference from GTEx skeletal muscle (figs. S11 to S17, and table S5). Of the 65 patients with high-quality data, 32 had a previous diagnosis, of which 21 were expected to lead to allelic imbalance (6). These cases were used as positive controls to benchmark ANEVA-DOT against previous tests of allelic imbalance: binomial and beta-binomial tests, binomial test with an allelic imbalance threshold, and a naïve population-aware test of excess allelic imbalance against GTEx data using the z-test (Fig. 4, D to H, and fig. S12). ANEVA-DOT identified a median of 11 outlier genes per individual (out of a median of 2190 tested), substantially fewer than other tests, (Fig. 4H). This small number of outliers always included the previously diagnosed gene when there was a detectable allelic imbalance present (76%; figs. S11 and S12), typically (69%) among the top five most significant genes (table S5). ANEVA-DOT’s high recall and precision outperformed all of the other tests by a substantial margin (Fig. 4I and figs. S12 to S14) (19).

In the 33 patients without a genetic diagnosis from previous whole-exome sequencing (WES) and/or WGS or RNA-seq analysis (6), we found a median of nine ANEVA-DOT genes per sample (in total 349 genes), which included at least one neuromuscular disease gene (6, 23) in 12 patients (in total 17 genes; figs. S15 and S16). One of these potential new diagnoses from ANEVA-DOT was confirmed: Patient N10, with a limb-girdle muscular dystrophy–like phenotype, had 13 ANEVA-DOT genes, with the one known Mendelian muscle disease gene, DES, being the most statistically significant. Further RNA-seq and reverse transcription polymerase chain reaction analysis identified a pseudo-exon insertion caused by a variant creating an intronic splice site. This had been missed by the prior gene panel, WES, WGS, and RNA-seq analysis because of challenging in silico interpretation of intronic variants and the relatively low number of RNA-seq reads. The variant is in trans with a pathogenic missense variant that had not been identified as a diagnosis because of the lack of a second variant (fig. S18). Additionally, ANEVA-DOT identified strong candidates in six cases and possible candidates in 11 others (19). By design, ANEVA-DOT does not rely on identifying which variant underlies the dosage-outlier effect, but genetic analysis can be applied after prioritizing genes by ANEVA-DOT. This is currently mostly limited to gene- or splice-disrupting variants because of their easier annotation compared with rare regulatory variant candidates that may also exist. Overall, we expect up to 10.5 of the 17 known MDM and 18.8 of all 349 identified ANEVA-DOT genes in the 33 undiagnosed patients to be true disrupted causative genes (19).

Discussion

In this study, we introduce a method, ANEVA, and its extension, ANEVA-DOT, to quantify genetic variation in gene dosage in the general population, and to identify genes where a patient appears to carry a heterozygous variant with an unusually strong effect on gene expression. This enables individual transcriptome comparison to previously generated reference data without the caveats of technical and reverse causation noise in total gene expression analysis.

The ANEVA framework uses biologically interpretable units of gene dosage, allowing interpretation of regulatory and coding gene-disrupting variants on the same scale. Furthermore, the statistical methods introduced here for modeling allelic expression data are applicable to other uses of this data type.

ANEVA-DOT is a fast and powerful approach for finding genes with likely disease effects, with the small numbers of outliers making further manual curation feasible in a clinical setting without compromising sensitivity. The use of VG estimates from GTEx as a shared reference for ANEVA-DOT analysis of patients is analogous to the use of coding constraint metrics for prioritization of pathogenic coding variants. ANEVA-DOT outlier genes can be further prioritized by candidate gene lists and by tools that are currently used in exome sequencing follow-up (1, 2, 5, 24, 25). Because ANEVA-DOT captures transcriptome outcomes of genetic effects without having to identify rare regulatory variants themselves, this method is particularly advantageous for rare genetic effects from poorly defined regulatory elements, but it will also detect, for example, variants triggering transcript decay. However, identifying the specific variants underlying ANEVA-DOT outliers is still challenging despite existing variant prioritization approaches, especially for noncoding regions (2628).

Despite these advantages, our methods have several limitations. The main caveat is that AE data are sparse, and VG estimates may be lacking or noisy for genes with few common coding variants owing to small size, high coding constraint, or low expression levels. These issues will, however, improve with increasingly large RNA-seq datasets. ANEVA-DOT is only applicable to about half of expressed genes per individual that have an aeSNP. Finally, allelic imbalance is not informative of recessive effects without family analysis. Thus, like other genetic diagnosis tools, ANEVA-DOT should be used in conjunction with other methods to capture different types of rare variants underlying disease. We envision that in clinical genetics, when practically feasible, transcriptome data will become a powerful additional layer of data for interpreting the genome and its disease-contributing variants.

Supplementary Materials

science.sciencemag.org/content/366/6463/351/suppl/DC1

Material and Methods

Supplementary Text

Tables S1 to S5

Data S1

Figs. S1 to S18

References (3447)

References and Notes

  1. See the supplementary materials.
Acknowledgments: We thank the GTEx consortium, especially F. Aguet, N. Ferraro, and S. Montgomery. Funding: This work was funded by the following NIH grants: UL1TR002550 (P. Mohammadi), UL1TR001114 (P. Mohammadi), K99HG009916 (S.E.C.), R01MH106842 (T.L., P.H.), UM1HG008901 (T.L., J.E.), R01GM122924 (T.L.), R01MH107666 (H.K.I. and H.E.W.), P30DK020595 (H.K.I.), R15HG009569 (H.E.W.), and UM1 HG008900 (D.G.M.), as well as intramural funds from the NIH (C.G.B., A.R.F., P. Mohassel, S.D.). Author contributions: P. Mohammadi and T.L. designed the study. P. Mohammadi developed all statistical models. P. Mohammadi, S.E.C., B.C., and J.E. analyzed the data. P. Mohammadi, C.S., and P.H. developed software tools. S.D., Z.J., P. Mohassel, A.R.F., H.E.W., H.K.I., C.G.B., and D.G.M. provided data and materials. P. Mohammadi and T.L. wrote the paper with input from all authors. Competing interests: D.G.M. is a founder with equity in and T.L. is an adviser for Goldfinch Bio. S.E.C. is a cofounder, chief technology officer, and stock owner for Variant Bio. T.L. is an adviser with equity in Variant Bio. H.K.I. has received honoraria from GSK and AbbVie. The other authors have no competing interests. Data and materials availability: Software packages for the presented methods are available online as follows: BLN distribution functions (29), ANEVA (30), and ANEVA-DOT (22). Outlier summary statistics for all GTEx tissues are available in (31). The GTEx v7 data and MDM cohort data are available in dbGaP (phs000424.v7.p2 and phs000655.v3.p1, respectively). Gene-level AE data and ANEVA-DOT results in the MDM cohort are provided in supplementary data S1.

Stay Connected to Science

Navigate This Article