Research Article

Allele-specific epigenome maps reveal sequence-dependent stochastic switching at regulatory loci

See allHide authors and affiliations

Science  28 Sep 2018:
Vol. 361, Issue 6409, eaar3146
DOI: 10.1126/science.aar3146

Dissecting the epigenomic footprint

Genome-wide epigenetic marks regulate gene expression, but the amount and function of variability in these marks are poorly understood. Working with human-derived samples, Onuchic et al. examined disease-associated genetic variation and sequence-dependent allele-specific methylation at gene regulatory loci. Regulatory sequences within individual chromosomal DNA molecules showed full or no methylation at specific sites corresponding to “on” and “off” switches. Interestingly, methylation did not occur on each DNA molecule, resulting in a variable fraction of methylated chromosomes. This stochastic type of gene regulation was more common for rare genetic variants, which may suggest a role in human disease.

Science, this issue p. eaar3146

Structured Abstract

INTRODUCTION

A majority of imbalances in DNA methylation between homologous chromosomes in humans are sequence-dependent; the DNA sequence differences between the two chromosomes cause differences in the methylation state of neighboring cytosines on the same chromosome. The analyses of this sequence-dependent allele-specific methylation (SD-ASM) traditionally involved measurement of average methylation levels across many cells. Detailed understanding of SD-ASM at the single-cell and single-chromosome levels is lacking. This gap in understanding may hide the connection between SD-ASM, ubiquitous stochastic cell-to-cell and chromosome-to-chromosome variation in DNA methylation, and the puzzling and evolutionarily conserved patterns of intermediate methylation at gene regulatory loci.

RATIONALE

Whole-genome bisulfite sequencing (WGBS) provides the ultimate single-chromosome level of resolution and comprehensive whole-genome coverage required to explore SD-ASM. However, the exploration of the link between SD-ASM, stochastic variation in DNA methylation, and gene regulation requires deep coverage by WGBS across tissues and individuals and the context of other epigenomic marks and gene transcription.

RESULTS

We constructed maps of allelic imbalances in DNA methylation, histone marks, and gene transcription in 71 epigenomes from 36 distinct cell and tissue types from 13 donors. Deep (1691-fold) combined WGBS read coverage across 49 methylomes revealed CpG methylation imbalances exceeding 30% differences at 5% of the loci, which is more conservative than previous estimates in the 8 to 10% range; a similar value (8%) is observed in our dataset when we lowered our threshold for detecting allelic imbalance to 20% methylation difference between the two alleles.

Extensive sequence-dependent CpG methylation imbalances were observed at thousands of heterozygous regulatory loci. Stochastic switching, defined as random transitions between fully methylated and unmethylated states of DNA, occurred at thousands of regulatory loci bound by transcription factors (TFs). Our results explain the conservation of intermediate methylation states at regulatory loci by showing that the intermediate methylation reflects the relative frequencies of fully methylated and fully unmethylated epialleles. SD-ASM is explainable by different relative frequencies of methylated and unmethylated epialleles for the two alleles. The differences in epiallele frequency spectra of the alleles at thousands of TF-bound regulatory loci correlated with the differences in alleles’ affinities for TF binding, which suggests a mechanistic explanation for SD-ASM.

We observed an excess of rare variants among those showing SD-ASM, which suggests that an average human genome harbors at least ~200 detrimental rare variants that also show SD-ASM. The methylome’s sensitivity to genetic variation is unevenly distributed across the genome, which is consistent with buffering of housekeeping genes against the effects of random mutations. By contrast, less essential genes with tissue-specific expression patterns show sensitivity, thus providing opportunity for evolutionary innovation through changes in gene regulation.

CONCLUSION

Analysis of allelic epigenome maps provides a unifying model that links sequence-dependent allelic imbalances of the epigenome, stochastic switching at gene regulatory loci, selective buffering of the regulatory circuitry against the effects of random mutations, and disease-associated genetic variation.

SD-ASM is explainable by different frequencies of epialleles.

Genetic variants affect the methylation state of neighboring cytosines on the same chromosome. Every WGBS read provides a readout of both a genetic variant (G or T) at a heterozygous locus and the methylation state of neighboring cytosines (“epiallele”). Epiallele frequencies correlate with the affinity of TF binding at regulatory sites.

Abstract

To assess the impact of genetic variation in regulatory loci on human health, we constructed a high-resolution map of allelic imbalances in DNA methylation, histone marks, and gene transcription in 71 epigenomes from 36 distinct cell and tissue types from 13 donors. Deep whole-genome bisulfite sequencing of 49 methylomes revealed sequence-dependent CpG methylation imbalances at thousands of heterozygous regulatory loci. Such loci are enriched for stochastic switching, which is defined as random transitions between fully methylated and unmethylated states of DNA. The methylation imbalances at thousands of loci are explainable by different relative frequencies of the methylated and unmethylated states for the two alleles. Further analyses provided a unifying model that links sequence-dependent allelic imbalances of the epigenome, stochastic switching at gene regulatory loci, and disease-associated genetic variation.

A majority of imbalances in DNA methylation between homologous chromosomes in humans are associated with genetic variation in cis, in which the genetic variants affect the methylation state of neighboring cytosines on the same chromosome (1, 2). Such sequence-dependent allele-specific methylation (SD-ASM) affects at least 8 to 10% of the autosomal genome (35). SD-ASM is an ideally “controlled” natural experiment that provides information about consequences of genetic variation in cis because both “case” and “control” loci can be found within an individual on homologous chromosomes within the same cellular and nuclear environment.

In contrast to association-based methods—such as expression quantitative trait loci (eQTLs) and methylation quantitative trait loci (mQTLs), used to establish the functional effects of common (>5% minor allele frequency) noncoding variants (6)—allelic imbalances (AIs) can be established by profiling a single sample, revealing the functional effects in cis of both relatively low-risk common variants and highly penetrant disease-causing rare and de novo variants (1, 2, 5, 710).

Although the analyses of SD-ASM traditionally involved measurement of average methylation levels across many cells, the epigenome is known to exhibit stochastic cell-to-cell variation and even variation between the two chromosomes within the same nucleus (11). Whole-genome bisulfite sequencing (WGBS) has the potential to provide insights into stochastic variation at the ultimate single-chromosome level of resolution because it provides information about the genetic variant and methylation of neighboring cytosines within the same sequencing read that comes from a single chromosome. Sequencing-based studies of DNA methylation have revealed pervasive stochastic epigenetic polymorphisms within autosomal loci (12, 13).

Methylation patterns evolve along predictable trajectories during normal development (14), being highly stochastic in early metastable stages and stabilizing within differentiated tissues, which results in mosaicism (15). The methylation patterns are maintained in stem cells through a dynamic stochastic epigenetic switching equilibrium (an ergodic process) that provides both epigenetic buffering of environmental noise and responsiveness to specific transcription factors (TFs) (12). By contrast, more differentiated cells (12) and tumor cells (13, 16) appear to use the more error-prone mechanism of direct replication of CpG methylation patterns, which is associated with clonality and mosaicism. Despite the stochastic nature of DNA methylation changes during development, in response to environmental input and in human diseases, the effects of genetic variation on stochasticity and mosaicism of the epigenome remain unexplored.

Results

Patterns of AIs across epigenomic marks

To explore the effects of genetic variation on the epigenome, the National Institutes of Health (NIH) Roadmap Epigenomics Project (17) has now completed whole-genome sequencing (WGS) on genomes of 13 donors and published NIH Roadmap reference epigenomes from 71 combined samples that collectively represent 27 distinct tissue types and nine cell types (fig. S1). For accurate identification of heterozygous genomic loci, we sequenced the donor genomes (18). Eight assays were included in most of the samples and used for AI detection: WGBS, RNA sequencing (RNA-seq), and chromatin immunoprecipitation sequencing (ChIP-seq) for six different histone marks (H3K4me3, H3K4me1, H3K36me3, H3K27me3, H3K9me3, and H3K27ac) (fig. S1). We performed allele-specific methylation (ASM) analysis at heterozygous single-nucleotide polymorphism (SNP) loci within the 49 WGBS methylomes using a threshold of absolute methylation difference of >30% between alleles and by estimating significance by means of Fisher’s exact test on the counts of methylated and unmethylated cytosines observed on the same sequencing read with each of the two SNP alleles (fig. S2A) (18). We performed the identification of AIs for histone marks and transcription using the AlleleSeq pipeline (fig. S2B) (18, 19).

Considering the AIs in all the marks, the imbalances in DNA methylation were by far the most abundant (table S1 and fig. S3), largely because of the genome-wide distribution of DNA methylation, in contrast to the uneven genomic distribution of other marks. Among the histone marks, H3K27ac had more imbalance calls than others (table S1), in part owing to deeper ChIP-seq coverage for H3K27ac (table S1 and fig. S3). At promoters, H3K27ac and H3K4me3 marks were more abundant on the allele with less DNA methylation (Fig. 1A). Conversely, H3K9me3 signal was more abundant on the allele with more methylation in promoters (Fig. 1A). At enhancers, H3K27ac tended to occur more often on the allele with less DNA methylation (Fig. 1A). We also detected, at high specificity, enrichment of AIs in methylation and coordinated changes in transcription and histone marks within a majority of those imprinted loci that included a heterozygous SNP (figs. S4 and S5) (18).

Fig. 1 AIs vary depending on genomic region.

(A) Number of AIs in histone marks and transcription, overlapping ASM loci, over classes of genomic elements. (B to E) Proportions of SD-ASM loci over total heterozygous loci in 200-bp bins near promoters, CpG islands, and enhancers.

We next evaluated the extent of reported SD-ASM. Consistent with genetic effects in cis (6, 2022), co-occurrence of ASM at the same heterozygous locus across different samples was higher than expected by chance under a permutation-based null model (fig. S6A). The degree of co-occurrence of ASM tended to be higher for pairs of samples across tissues of the same individual than between pairs from the same tissue across different individuals, which was higher than for samples without matching tissue or individual (fig. S6B). Low concordance in ASM calls between individuals may be due to local haplotype context, epigenetic drift, or other nongenetic factors (3, 4, 6, 20, 22, 23). Gaussian mixture modeling (18) showed that allelic differences in methylation (above the 30% threshold) at heterozygous SNPs had a tendency to occur in the same direction (the same allele showing higher methylation than the other) across pairs of samples (fig. S6, C to E).

In order to increase the power to detect SD-ASM at high sensitivity, we pooled the reads across all 49 methylomes and applied the same detection method as for individual samples (fig. S7A) (18). The deep coverage of the combined set (1691-fold total coverage in bisulfite sequencing reads in the combined set of 49 methylomes) increased our power to detect those sequence-associated AIs that were detectable across different tissues and donors (fig. S7, B to D), whereas our power to detect tissue-dependent and donor-dependent SD-ASM was reduced (fig. S8, A and B). The number of accessible heterozygous loci (those having at least six counts per allele), for SD-ASM determination, after pooling rose to 4,913,361, increasing our SD-ASM mapping resolution—measured as an average distance between “index hets”—to 600 base pairs (bp). At the 30% methylation difference default threshold, AIs were detected at 5% of index hets; lowering the threshold to 20%, a total of ~8% index hets showed AIs.

Sensitivity of the methylome to genetic variation varies across classes of genomic elements

We next explored whether SD-ASM had the tendency to occur within any particular type of genomic element. Using the reads pooled across the 49 methylomes, we observed depletion of SD-ASM within promoters containing CpG islands (Fig. 1B), as well as within CpG islands in general (Fig. 1C), which is consistent with observations that ASM is depleted in CpG islands (4, 21) and that mQTLs are depleted within promoters of genes within CpG islands (24, 25) and that expression quantitative trait methylation is enriched within CpG island shores and not in CpG islands themselves (26). By contrast, and mirroring previous mQTL patterns (25), promoters of genes not in CpG islands showed high levels of SD-ASM (Fig. 1D). We also observed enrichment of SD-ASM downstream from the promoter and into the gene body (Fig. 1, B and D) and positive association between allele-specific expression (ASE) and ASM over exons (Fig. 1A), which is consistent with higher methylation of actively transcribed regions, including those on the X chromosome (27) and with the enrichment of mQTLs in regions flanking the transcription start site (TSS) (23, 28). One factor contributing to the ASM, particularly near the transcription start sites (Fig. 1D), may be the presence of transcriptional regulatory signals (29).

SD-ASM was also highly enriched within enhancers (Fig. 1E), which is consistent with previous reports (24, 28). The abundance of TF binding sites within enhancers suggests that SD-ASM may result from disruption of TF binding (23). Under that assumption, our data suggest that TF binding at CpG islands and CpG-rich promoters is buffered against genetic perturbations, whereas the TF binding to non-CpG promoters and enhancers is most sensitive. We also observed a somewhat puzzling mild depletion of SD-ASM in the flanking regions of enhancers (Fig. 1E), which also suggests buffering in those regions.

SD-ASM is attributable to differences between allele-specific epiallele frequency spectra

We next asked whether the lack of buffering at SD-ASM loci may result in excess stochasticity and metastability, which is defined by the presence of more than one stable state, each stable state corresponding to an epiallele (single-chromosome methylation pattern). To answer this question, we made use of the deep combined WGBS read coverage across 49 methylomes (table S2) and that each read relates a single variant to a single epiallele. We assessed epialleles by scoring the methylation status of four homozygous CpG sites (42 = 16 possible epialleles) that were the closest to each index het in individual WGBS reads (13, 14) (Fig. 2A). [Our use of the term “epiallele” follows the most recent usage (12, 13) and does not comply with the original definition (30), which implies intergenerational inheritance. Our use of the term “metastability” is consistent with its use in dynamical systems theory and does not imply inheritance of an epiallele during cell division.]

Fig. 2 Differences in epiallele frequency spectra causing SD-ASM.

(A) Example of an epiallele frequency spectrum (bottom) derived from observed epialleles in WGBS reads (top). (B) Histograms of Shannon entropy, in bits, for the epiallele frequency spectra for the hets showing SD-ASM (red) and the nearest (control) hets without SD-ASM (black). (C) Most heterozygous loci with two frequent epialleles show SD-ASM and have entropy larger than 1.7 bits (red portion of the bar), the two epialleles being biphasic (fully methylated or fully unmethylated) 71.7% of the time. The callout on the right provides an example of a het in which the difference between epiallele frequency spectra of allele 1 (A, orange) and allele 2 (G, blue) explains SD-ASM. (D) Histogram of coefficients of constraint for SD-ASM loci with two frequent epialleles. The callouts illustrate an example het (T/C, top right callout) with a low coefficient of constraint, and another (G/C, bottom right callout) with a high coefficient of constraint. (E) Illustration of buffering in contrast to ergodic/periodic and mosaic metastability.

To quantify the amount of stochasticity at index het loci, we used Shannon entropy (18). The entropy values ranged from 0 to 4: An even distribution of frequencies across the 16 possible epiallele patterns produces a maximum entropy score of 4 bits, whereas a complete absence of stochasticity because of maximal “buffering” implies just one epiallele with nonzero frequency and an entropy score of 0 bits. To assess quantitatively any differences in buffering (lack of sensitivity to genetic variation) between SD-ASM and control loci, we identified SD-ASM loci that had sufficient coverage and a close index het without ASM and compared entropies. A total of 6619 (2.7%) of 241,360 loci with SD-ASM met the two criteria (18). We observed a striking difference in entropy, providing a quantitative assessment of the higher stochasticity at the SD-ASM versus control loci (Fig. 2B).

We next examined enrichment for epigenetic polymorphisms at SD-ASM loci. We estimated the number of frequent epialleles for each locus by sorting the epialleles from the most to the least frequent and identified the minimal-size “top-list” of epialleles that accounted for at least 60% of all the reads with ascertained epialleles. In contrast to the control loci, which typically had only one high-frequency epiallele on the “top-list” and were therefore not epigenetically polymorphic, SD-ASM loci showed multiple frequent epialleles—in most cases, just two (Fig. 2C). By examining the top pairs of epialleles, we found that 71.7% of the pairs consisted of one that was completely methylated and another completely unmethylated (Fig. 2C). This is concordant with previous reports of biphasic (fully methylated and fully unmethylated) distributions of methylation in amplicons with high interindividual methylation variance and in polymerase chain reaction clones with bimodal methylation patterns (3, 31). AIs at SD-ASM loci could be traced to shifts in epiallele frequency spectra between alleles, typically shifts in relative frequencies of the fully methylated and fully unmethylated epialleles (Fig. 2C). We validated the observed excess of stochasticity and the enrichment for the biphasic pattern at SD-ASM loci using an independent WGBS dataset from the Encyclopedia of DNA Elements (ENCODE) (fig. S9, A to C) (18).

We next quantified the relationship between genetic variation and stochastic epialleles. At each locus, we estimated the probabilities of epialleles for each allele (higher probabilities are indicated by thicker arrows in Fig. 2, C and D). We then quantified the degree to which genetic alleles determine epiallele frequencies using a coefficient of constraint (18), an information-theoretic measure that is a generalization of the R2 coefficient of determination that is commonly used in genetics and is more appropriate for quantifying genetic determination of stochastic phenotypes. A larger value for the coefficient of constraint value signifies that epigenetic variation is more constrained and determined by genetic variation in cis. Intuitively, a larger coefficient of constraint indicates a larger difference in the epiallelic frequency spectra corresponding to the two alleles, implying a higher degree of determination of epiallele frequency spectra by the genetic alleles (Fig. 2D).

There are two general mechanistic models that could explain the effect of sequence variation in cis on epiallele frequency spectra. The ergodic/periodic model stipulates ongoing switching between metastable states, the transitions being stochastic with a possible component of periodicity, such as circadian oscillations. If a sufficient number of stochastic transitions from one epiallele to another occur, that epiallele frequency spectrum depends largely on the sequence-dependent shape of the current energy landscape (state transition probabilities) and not on the epigenetic memory of past events (Fig. 2E). By contrast, the mosaic model stipulates that epialleles are stably transmitted over time and even during cell division, being “frozen” after a period of initial metastability into one of the stable states. Both models entail a period of metastability, whether past (mosaicism) or current (ergodic/periodic model).

CTCF binding loci show sequence-dependent stochastic switching and looping

Because of its association with DNA methylation at a large number of binding sites, we next examined the role of CCCTC-binding factor (CTCF) in creating the metastable states that correspond to epialleles. Metastability is known to be created by positive (including double-negative) feedback loops (32) that in our case also include interactions in cis, such as the protection against DNA methylation by CTCF binding and reciprocal preference of CTCF for unmethylated DNA (33). The first indication of the role of CTCF binding in metastability came from the observation that the heterozygote with the larger coefficient of constraint (G/C het) also showed larger differences in predicted CTCF binding affinity between the two alleles than the other (T/C het) (Fig. 2D). Considering that the coefficient of constraint is proportional to the differences in epiallele frequency spectra for the two alleles (identical epiallele frequency spectra resulting in coefficient of constraint value of 0), this observation suggested a positive correlation between the coefficient of constraint and the differences in CTCF binding affinity for the two genetic alleles, which was indeed observed (Fig. 3A). In terms of the epigenetic landscape distortion due to genetic variation, we see that sequence variants that show larger differences in CTCF binding affinity also show greater differences in their epigenetic (energy) landscapes, as reflected in the more prominent shifts between alleles in their occupancy of metastable states (as measured by higher values of coefficient of constraint) (Fig. 3A, top). Because CTCF binding and demethylation of its binding site are mutually reinforcing (forming a positive-feedback loop and a metastable state), the model also predicts that the variants associated with higher CTCF binding affinities will show lower methylation, which is indeed the case (Fig. 3A, bottom) as previously observed (23, 24). Taken together, these results suggest sequence-dependent stochastic epigenetic switching between metastable states that is mediated by CTCF binding.

Fig. 3 Correlations between allelic differences in TF binding affinity, coefficient of constraint, and DNA methylation.

(A) (Top) Correlation between absolute CTCF binding affinity differences, based on position weight matrix scores (PWMs), and the coefficient of constraint for predicted CTCF binding sites with SD-ASM, two frequent epialleles, and a biphasic methylation pattern. (Bottom) Correlation between CTCF binding affinity and DNA methylation at predicted CTCF binding sites. (B) SD-ASM is more predictive of allelic looping (28 true positive of 44 predictions) than motif disruption scores (1 true positive of 44 predictions). To control for specificity, thresholds were selected so that both methods predicted the same number of hets (44) to show allelic looping. (C) SD-ASM at binding sites of 377 TFs defined with the SELEX method. The pie chart (left) and the table (right) indicate both enrichments and directionality trends using a shared color code. (D) (Top) Correlation between absolute ELK3 binding affinity differences and the coefficient of constraint for predicted binding sites with SD-ASM, two frequent epialleles, and a biphasic methylation pattern. (Bottom) Correlation between ELK3 binding affinity and DNA methylation at predicted ELK3 binding sites. (E) A mechanistic model of a sequence-dependent energy landscape with two metastable states: allele 1 (top row), corresponding to a landscape where the most frequently occupied metastable state corresponds to a completely unmethylated epiallele, and allele 2 (bottom row), corresponding to a landscape where the most frequently occupied metastable state corresponds to a completely methylated epiallele. Putative positive-feedback loops involving interactions between TF binding and binding site methylation are indicated for CTCF. An alternative model involving competitive binding of two TFs is indicated on the right. Significance of correlations was tested by using Student’s t test.

Because the CTCF TF establishes chromatin loops (34), we asked whether the allelic state of methylation also coincided with allelic looping. Toward this goal, we used a study (35) that reports heterozygous SNP loci that associate both with allelic CTCF binding and allelic chromatin looping, as determined by means of chromatin interaction analysis by paired-end tag sequencing (ChIA-PET). Indeed, a total of 44 of those SNP loci were also present in our dataset. Comparing our signals for the methylation state of CTCF binding sites with the predicted CTCF motif disruption scores suggested that SD-ASM is a more accurate indicator of allelic CTCF binding and looping than the motif disruption score (Fig. 3B) (18).

TF binding sites show sequence-dependent shifts in epiallele frequency spectra and AIs

Analyses of ASM at regulatory elements and eQTLs revealed associations between ASM and allele-specific histone marks with downstream allele-specific transcription (fig. S10, A to F) (18). These results complemented previous studies (22, 23) and suggested involvement of allele-specific TF binding and cofactors in ASM. To examine the role of allele-specific TF binding, we focused on the set of 377 TFs assessed for binding affinity using the high-throughput systematic evolution of ligands by exponential enrichment (SELEX) method (36). As for CTCF, we identified the subset of binding motif loci in a heterozygous state with two frequent epialleles and examined the correlation between coefficient of constraint and difference in predicted allelic binding affinities across these loci for each TF (table S3). Because of the relatively small number of such loci per TF, only 13 showed significant individual P values (Student’s t test, P < 0.05), with only CTCF surviving Bonferroni correction (for testing 377 TFs) (table S3). However, a majority (11 of 13) of the TFs that showed individually significant correlation also showed positive correlation (P = 0.01, binomial test), which is consistent with the pattern observed for CTCF where larger differences in TF binding affinities correspond to larger distortions in the configuration of metastable states within the landscape (table S3).

Likewise, we next examined for all 377 TFs whether disruptions of their predicted binding sites associated with methylation imbalances. A majority (241) showed SD-ASM enrichment within their binding motifs compared with flanking loci (500 bp on each side) (Fig. 3C and table S4), suggesting that TF binding associates with allelic DNA methylation. The SD-ASM outside of the examined motifs may be attributable to sequence variation within undiscovered binding loci, within motifs of noncoding RNAs, or within loci in physical proximity or contact with regions of perturbed TF activity.

We then examined the relation between allelic differences in motif strengths and methylation levels at SD-ASM loci (18). We observed that for more than half of the TFs tested (207), there was an association between motif strength and level of methylation (Fig. 3C). Most TFs (159) showed gain in methylation on the allele with the disrupted motif, which is consistent with the TF binding either protecting a region from passive methylation (37) or causing active demethylation (Fig. 3C) (38). By contrast, a smaller number of TFs (48), including members of TF families that recruit methyltransferases such as the ETS-domain TF family members (39, 40), showed loss of methylation on the allele with the disrupted motif (Fig. 3, C and D). About a quarter of TFs that show enrichment for SD-ASM show no bias in directionality (table S4), the lack of bias being explainable by contextual behavior at different binding loci, such as for nuclear factor of activated T cells 1 (41) or because of competing TFs at overlapping motifs. Our results support that TF motif sequences are predictive of proximal CpG methylation levels (23, 42, 43).

We sought to validate the downstream functional consequences of SD-ASM variants, with predicted allelic differences in TF binding, using a luciferase assay. We prioritized cis-overlapping motifs (CisOMs), including those of c-MYC proto-oncogene (cMYC) and tumor suppressor p53 (TP53) that show competitive binding at many loci (44), because CisOMs provide one of the mechanisms of metastability (Fig. 3E), and also those that may have consequences for human disease (table S5). All four SNP validations showed allelic effects on luciferase expression, including two SNPs within CisOMs for cMYC and TP53 and some falling within disease-associated loci (fig. S11) (18), which suggests that SD-ASM helps identify those disease-associated variants that also have functional consequences.

SD-ASM is enriched near disease-associated loci

We observed that heterozygous variants with SD-ASM were enriched in the neighborhood of variants previously reported as significant in genome-wide association studies (GWASs) of common disease (Fig. 4A) (22, 23, 45). The enrichment was stronger around GWAS variants that have been replicated in multiple studies versus those that have not. To explore more specifically the role of enhancers, we performed a similar enrichment analysis focusing only on GWAS and SD-ASM variants overlapping enhancer elements. Enhancers that contain replicated GWAS variants were significantly (P < 0.0001, χ2 test) more likely to also contain a variant with SD-ASM than enhancers that did not contain replicated GWAS variants (Fig. 4B). Taken together, these results indicate that AIs provide information about the role of specific loci in common diseases, pointing to the loci that are sensitive to the effects of genetic variation and have functional effects. The enrichment of both GWAS loci and AIs at enhancers, and sensitivity of TF binding to genetic variation discussed in previous sections, provide a mechanistic link between AIs and GWAS associations.

Fig. 4 Association of ASM with disease loci and purifying selection.

(A and B) Enrichment of ASM in the proximity of GWAS loci. ASM hets within 1 kb of GWAS loci are compared with colocalized hets without ASM. (C to F) Evidence of purifying selection acting on rare variants with ASM. [(C) and (D)] Proportion of variants associated with ASM compared with those without ASM among the rare (DAF < 1%) variants across individual methylomes. [(E) and (F)] Proportion of loci with ASM over total heterozygous loci over windows of increasing DAF in the combined set of methylomes. (F) This bar chart summary of the data in (E) shows the excess of SD-ASM variants among those with DAF < 1%. χ2 tests were used for significance of enrichments.

Variants showing SD-ASM are under purifying selection

Because the variants with large effects are under purifying selection, they tend to be rare, with frequencies below the detection threshold of association studies such as GWAS, mQTL, and eQTL. By contrast, AIs may provide evidence for functional effects even for rare variants that may be detected in only one individual. On the basis of previous studies that have used signatures of purifying selection such as shifts toward smaller derived allele frequency (DAF) to identify functional variants (46, 47), we would expect that ASM variants would also tend to have a lower DAF than those without ASM. Therefore, we obtained DAF estimates from the 1000 Genomes Project (48), ignoring variants that overlapped regions with low accessibility to variant calling. We observed that in nearly every sample in our dataset, heterozygous variants with ASM were significantly (P < 0.05, χ2 test) more likely to have DAF smaller than 1% than were those without ASM (methylation difference between alleles < 5%) (Fig. 4C). Overall, this analysis found ~130 (median) more rare (DAF < 1%) variants than expected among those with ASM per individual methylome, providing a lower bound on the number of those under purifying selection per individual. When we repeated the analysis for enhancer regions, strong signal was again observed (Fig. 4D), suggesting a median excess of at least 26 enhancer variants under purifying selection per individual.

The lower bounds from individual samples may underestimate the extent of purifying selection because of underdetection of SD-ASM. We therefore investigated whether an enrichment for rare variants could also be seen for those variants associated with SD-ASM from the combined dataset, using neighboring variants as controls (18). We observed that the chance of a locus having SD-ASM decreased as the derived allele frequency increased (Fig. 4E; there were very few variants with DAF > 50%, causing high variance and large confidence intervals). We further tested whether there was a significant enrichment for variants with DAF < 1% among those with SD-ASM and found that such enrichment was indeed significant (odds ratio 1.18; P < 0.0001, χ2) (Fig. 4F). That enrichment represents an excess of 2184 rare variants among those with SD-ASM compared with controls. Considering that this observed excess represents a set of 11 genomes (nine individuals and two cell lines), we estimate at least ~200 variants with SD-ASM under purifying selection per individual donor.

Discussion

Taken together, our findings suggest a mechanistic link between sequence-dependent AIs of the epigenome, stochastic switching at gene regulatory loci, and disease-associated genetic variation. Our allelic epigenome map reveals CpG methylation imbalances exceeding 30% differences at 5% of the loci, which is more conservative than previous estimates in the 8 to 10% range (3, 4); a similar value (8%) is observed in our dataset when we lowered our threshold for detecting AI to 20% methylation difference between the two alleles. We observed an excess of rare variants among those showing ASM, suggesting that an average human genome harbors at least ~200 detrimental rare variants that also show ASM.

The methylome’s sensitivity to genetic variation is unevenly distributed across the genome. The higher buffering of CpG islands, and associated promoters, may be due to their use by housekeeping genes. Conversely, other promoters and enhancers may show lower buffering because of their presence in tissue-specific genes that tend not to be associated with CpG islands (49). Our findings are consistent with the evolutionary advantages that may stem from the buffering of housekeeping genes against the effects of random mutations—while still retaining the potential for evolutionary innovation through changes in the regulation of less essential genes with tissue-specific expression patterns—and refine reports that suggest that all promoters show epigenetic buffering (50). Highest sensitivity to genetic variation at enhancers, and potentially perturbations in general, is consistent with observations of high cell-to-cell variability of their methylation (28, 51). Validated GWAS loci, and the enhancers within those loci, show enrichment for AIs, suggesting that sensitivity of those loci to genetic variation, and potentially also to environmental influences, may at least in part explain their role in common diseases.

Overall, our results suggest an explanation for the conservation of “intermediate methylation” states at regulatory loci (52). These intermediate methylation states reflect the relative frequencies of fully methylated and fully unmethylated epialleles corresponding to biphasic on/off switching patterns (31) at regulatory loci marked with SD-ASM. Moreover, our analyses reveal that the SD-ASM is explainable by allele-specific switching patterns at thousands of heterozygous loci.

Waddington’s epigenetic landscape has served as a guiding metaphor for the emergence of cellular identities during development. The landscape has until now been an abstract construct (53), disconnected from the mechanistic function of gene regulation. Our energy landscapes (Figs. 2D and 3E) may be interpreted as a special type of epigenetic landscape model of gene regulatory interactions in cis in which epialleles correspond to metastable states (attractors) within the landscape (30). As cells transition into a more differentiated state, the cellular epigenome enters a buffered “valley” at a bifurcation point in the landscape (Fig. 2E). Because our current dataset is static, it precludes us from being able to distinguish between the mosaic and stochastic/periodic models that characterize the more and less differentiated states, respectively.

Consistent with proposed theoretical models that define the landscape in terms of potential energy functions (53) and postulate local attractors created by positive-feedback loops (32), our model involves interactions between one or more TFs, DNA methylation, and likely other epigenomic marks. Competitive binding of TFs at CisOMs may be one of the mechanisms of metastability. By putting the metastable states within the landscape in correspondence with epialleles and TF binding, we bring Waddington’s landscapes into correspondence with assayable and quantifiable epiallele frequency spectra and with specific mechanisms of gene regulation.

Our findings are consistent with the role for bistable switching and stochasticity in bacterial gene regulation (54) and extend stochasticity to eukaryotic cells in vivo within their natural tissue context across a diversity of human tissues. One obvious question is the possible purpose of stochasticity at gene regulatory loci across both domains life. The sharp contrast between the “digital” nature of regulatory elements and the “analog” nature of concentrations of upstream TFs, and downstream gene products, suggests that we may be observing a naturally evolved system similar to von Neumann’s “stochastic computer,” an early hybrid analog-digital computer design that can implement regulatory circuits within control systems in which analog quantities are not encoded by using the usual binary or decimal system but are encoded as fractions of “on” states in a stochastic series of on and off states (55). In contrast to a standalone stochastic computer, there is a multitude of cells within a tissue, raising the question about the role of intercellular communication within tissue microenvironment in the establishment of a dynamic equilibrium that results in observed methylation averages.

To promote further data analyses and experimental work by the community, we provide an Allelic Epigenome Atlas, a collection of annotations, including AI scores, for all of the ~4.9 million heterozygous loci analyzed here (18). The breadth of the genome-wide AI scores available in the Allelic Epigenome Atlas, which includes the multitude of different individual tissue and cell types profiled here, may serve as additional layers of functional evidence for prioritization of disease-causal candidate variants, including subthreshold GWAS variants in implicated regions and noncoding variants detected with clinical whole-genome sequencing.

Materials and methods

We identified heterozygous SNP loci using a joint variant-calling pipeline on WGS datasets from 13 donors from the NIH Roadmap Epigenomics Project. We used ChIP-seq and RNA-seq datasets from a total of 71 various tissues of these donors to detect AIs in histone marks and transcription as described (19). We detected allele-specific methylation using an in-house script on 49 WGBS datasets to compare counts of methylated and unmethylated cytosines proximal to each allele. We tested different regulatory regions and GWAS and eQTL loci for enrichment of AIs using nearby heterozygous SNPs without AI as controls. We compared differences in methylation between alleles with the alleles’ predicted TF motif strengths. We analyzed epialleles by quantifying the methylation patterns of the four closest CpGs to the alleles in single WGBS reads and comparing these patterns between alleles. We calculated Shannon entropy at different loci using epiallele patterns to quantify stochasticity at SD-ASM loci. We calculated coefficient of constraints at SD-ASM loci to determine the constraint of epiallele polymorphisms by genetic variants.

Supplementary Materials

www.sciencemag.org/content/361/6409/eaar3146/suppl/DC1

Materials and Methods

Figs. S1 to S12

Tables S1 to S5

References (5678)

References and Notes

  1. Materials and methods are available as supplementary materials.
Acknowledgments: We thank I. Golding for helpful comments and reviewing the paper and Y. Ruan for assisting with allelic looping datasets. Funding: A.M. acknowledges support from the Common Fund of the National Institutes of Health (Roadmap Epigenomics Program, grant U01 DA025956). M.G. acknowledges support from the National Human Genome Research Institute (grant 5U24HG009446-02). W.D.F. acknowledges support from the National Institute of General Medical Sciences (grant GM122030-01). M.K. acknowledges support from the National Institute of Mental Health (grant R01 MH109978) and the National Human Genome Research Institute (grants U01 HG007610 and R01 HG008155). Author contributions: V.O., E.L., and A.M. conceptualized research goals and designed studies; R.A.H. and C.C. performed collection and organization of sequencing datasets; C.C., P.P., and R.Y.P. performed processing of sequencing datasets; V.O., E.L., P.P., R.Y.P., J.R., T.G., Z.H., R.C.A., Z.Z., and L.A. performed data curation; Z.H. created and performed variant-calling pipelines with supervision from F.U.; J.R., T.G., R.C.A., and Z.Z. provided software and participated in allelic histone calling with supervision from M.K. and M.G.; V.O., E.L., and I.C. performed bioinformatics, statistical analyses, and presentation of data with supervision from A.M.; J.W.B. performed luciferase assay validation experiments with supervision from W.D.F.; E.L. performed validation experiments on external datasets; L.A. developed the online resource; V.O., E.L., and A.M. wrote the original draft; and E.L., M.K., and A.M. revised and edited the manuscript. Competing interests: The authors declare no competing interests. Data and materials availability: All allelic datasets are available on https://genboree.org/genboreeKB/projects/allelic-epigenome. Code used for bioinformatic analyses can be found at https://github.com/BRL-BCM/allelic_epigenome. The whole-genome sequencing, whole-genome bisulfite sequencing, and ChIP-sequencing reads that were used are stored on the Roadmap Epigenomics BioProject page (accession nos. PRJNA34535 and PRJNA259585) and dbGaP (accession nos. phs000791.v1.p1, phs000610.v1.p1, and phs0007000.v1.p1).
View Abstract

Navigate This Article