Research Article

Chromatin three-dimensional interactions mediate genetic effects on gene expression

See allHide authors and affiliations

Science  03 May 2019:
Vol. 364, Issue 6439, eaat8266
DOI: 10.1126/science.aat8266

Noncoding variation and gene expression

Natural genetic variation outside of protein coding regions affects multiple molecular phenotypes that can differ across individuals. To examine how genomic variation affects proximal (cis) or distal (trans) gene regulation, Delaneau et al. analyzed gene expression, chromatin, and the three-dimensional conformation of the genome. Clustering regulatory elements and activity across individuals reveals genomic structures termed cis-regulatory domains and trans-regulatory hubs that affect gene expression. Associations between these structures and genes within and across chromosomes contribute to links between noncoding genetic variation and gene expression.

Science, this issue p. eaat8266

Structured Abstract

INTRODUCTION

Genome-wide studies on the genetic basis of gene expression have advanced considerably our understanding of the function of the human genome. Large collections of expression quantitative trait loci (i.e., genetic variations affecting gene expression; eQTLs) are now available across many cell types, tissues, and conditions and are commonly used to better interpret the effects of noncoding genetic variations. Although this constitutes an extraordinary resource to study complex organismal traits and diseases, we still have a poor understanding of how they affect the regulatory machinery, which regulatory elements (REs) they perturb, and how their effects propagate along regulatory interactions.

RATIONALE

In this study, we aimed to characterize the complex and cell type–specific interplay between genetic variation, REs, and gene expression to dissect cis- and trans-regulatory coordination. To this end, we assembled and analyzed a population-scale dataset combining the activity of REs [measured by chromatin immunoprecipitation sequencing (ChIP-seq) for methylated histone 3 at lysine 4 (H3K4me1), trimethylated histone 3 at lysine 4 (H3K4me3), and acetylated histone 3 at lysine 27 (H3K27ac)], the expression of genes (using RNA-seq), and genetic variations for 317 lymphoblastoid and 78 fibroblast cell lines, all from European ancestry.

RESULTS

First, we show that the regulatory activity is structured in 12,583 well-delimited cis-regulatory domains (CRDs) that respect the local chromatin organization into topologically associating domains (TADs) but constitute finer organization. Our work suggests that three-dimensional (3D) organization in cis can be broadly categorized into functionally linked and unlinked domains, with the most-linked ones corresponding to CRDs. In addition, we found 25,315 significant associations between CRDs located on distinct chromosomes that form 30 trans-regulatory hubs (TRHs). These TRHs are consistent with a higher-order chromatin organization into A and B nuclear compartments and show a signal of allelic coordination, suggesting that some of the trans associations are not transcriptionally mediated and result from a complex and higher-order 3D nucleus organization.

Second, we show that CRDs and TRHs essentially delimit sets of active REs involved in the expression of most genes and provide a dense genome-wide map linking REs and genes. We show that these links vary substantially across cell types and are key factors involved in the cis and trans coexpression of genes.

Third, we show that CRDs are under strong genetic control. We discovered a total of 58,968 chromatin peaks affected by nearby genetic variants (cQTLs), 6157 QTLs that affect the activity of CRDs (aCRD-QTLs), and 110 QTLs that affect the correlation structure within CRDs (sCRD-QTLs). These QTLs tend to locate close to their genomic targets, are enriched within functional regions of the genome, and frequently overlap genetic variants associated with complex traits.

Finally, we show that CRDs and TRHs capture complex regulatory networks along which the effects of eQTLs are propagated and synergized to affect gene expression. Overall, we estimate that 75% of the eQTLs also affect the activity of CRDs and describe four specific types of genetic effects that can be mediated by CRDs: (i) cis-eQTLs affecting distal genes, (ii) multiple cis-eQTLs with independent effects, (iii) multiple rare variants that have a cumulative effect, and (iv) trans-eQTLs.

CONCLUSION

We provide a genome-wide map of the coordination between REs and describe how this serves as a backbone for the propagation of noncoding genetic effects in cis and trans onto gene expression. We show how these types of data can reveal higher-order functional attributes of the genome and can serve as an effective prior to boost future association studies at both the discovery and interpretation levels. Overall, our study reveals the complexity and specificity of the cis- and trans-regulatory circuitry and its perturbation by genetic variations.

Interindividual correlation between nearby chromatin peaks forms CRDs in the context of TADs.

CRDs capture the structure of the coordination between nearby REs along which genetic effects combine and propagate to distal genes. A magnified view of a region spanning 1000 chromatin peaks on chromosome (chr) 8 is shown. Interindividual correlation (shades of blue) is given in the context of Hi-C contacts (shades of red). Black triangles indicate CRD calls, and red and blue intervals indicate the genomic locations of TADs and CRDs. bp, base pairs.

Abstract

Studying the genetic basis of gene expression and chromatin organization is key to characterizing the effect of genetic variability on the function and structure of the human genome. Here we unravel how genetic variation perturbs gene regulation using a dataset combining activity of regulatory elements, gene expression, and genetic variants across 317 individuals and two cell types. We show that variability in regulatory activity is structured at the intra- and interchromosomal levels within 12,583 cis-regulatory domains and 30 trans-regulatory hubs that highly reflect the local (that is, topologically associating domains) and global (that is, open and closed chromatin compartments) nuclear chromatin organization. These structures delimit cell type–specific regulatory networks that control gene expression and coexpression and mediate the genetic effects of cis- and trans-acting regulatory variants on genes.

A decade of unbiased genome-wide association studies revealed that most of the genetic determinants of complex traits likely reside in noncoding regions of the genome (13) and are genetic variants modulating gene expression, usually called expression quantitative trait loci (eQTLs) (4). As a consequence, large genome-wide collections of eQTLs across multiple human populations (5), conditions (6), and tissues (7) have been collected and now offer the resources to systematically identify the genes and tissues involved in complex traits (8). We do not understand precisely the upstream and downstream effects of eQTLs on the regulatory machinery of the cell, yet recent developments in high-throughput assays that capture biochemical changes [e.g., chromatin immunoprecipitation sequencing (ChIP-seq) (9)], open regions [e.g., ATAC-seq (10)], and conformation [e.g., Hi-C (11)] at the chromatin level provide the tools to achieve this task. These tools have already provided key insights into the molecular mechanisms underlying eQTL function, and the consensus picture that now emerges (12) suggests that eQTLs perturb binding of transcription factors, which results in variable activity of the corresponding regulatory elements (REs) (13, 14). Then, this effect propagates to distal REs and genes through chromatin interactions (1520). These chromatin interactions may result in short-range dependencies and cross-talk [previously identified within variable chromatin modules (VCMs) (19)], and their possible range has been previously associated with chromatin contact domains [called topologically associating domains (TADs)], which represent the structural units of the three-dimensional (3D) nuclear organization (11, 21, 22).

To formally characterize the specific and fine structure, as well as the interplay between genetic variants, regulatory activity, and gene expression, we assayed a combination of three well-studied histone modifications known to reliably capture enhancer and promoter activities, together with genetic variation and gene expression in 317 lymphoblastoid cell lines (LCLs) and 78 primary fibroblast lines all derived from unrelated European individuals (19). By leveraging the interindividual variability inherent to the population design of this study as functional perturbation, we (i) provide large collections of regulatory variants, (ii) profile the coordinated activity of regulatory elements, (iii) characterize how the coordination is structured around intra- and interchromosomal DNA contacts, and (iv) characterize how genetic variability is mediated by this coordination to affect gene expression.

Population-scale datasets of chromatin activity across two cell types

Chromatin activity for a total of 271 LCLs and 78 primary fibroblast lines from overlapping sets of individuals was obtained by performing ChIP-seq for three histone modifications [acetylated histone 3 at lysine 27 (H3K27ac), methylated histone 3 at lysine 4 (H3K4me1), and trimethylated histone 3 at lysine 4 (H3K4me3)]. By combining data from a previous study (19), we analyzed a total of 317 LCLs and 78 fibroblasts, all of European ancestry, with sufficient sequencing coverage to quantify chromatin activity, that is, enrichment of histone modifications (figs. S1 and S2) (23). The population design of this study requires quantifying each sample with the same coordinates for chromatin peaks. We therefore performed peak calling for each ChIP-seq assay on consensus sequence data files that we generated by subsampling sequence data from individual files. The resulting peak coordinates were then quantified for each sample in turn, which allowed us to integrate all chromatin assays into two matrices comprising 271,417 chromatin peaks quantified in 317 LCLs and 78 fibroblasts (fig. S3). We added gene expression that we measured using RNA-seq and genetic variation from imputed single-nucleotide polymorphism (SNP) arrays to the data. To account for confounding factors in any analysis (figs. S4 to S6), we regressed out multiple covariates in the molecular phenotype data—for example, sex, ancestry [three first principal components (PCs) on genotype data], the cohort of origin, and technical factors that we captured by performing principal components analysis (PCA) on the molecular data (fig. S7).

A collection of 58,968 chromatin quantitative trait loci

Using chromatin activity as a phenotype, we found that a substantial fraction of the chromatin peaks is under genetic control: We discovered 44,492 and 14,476 significant chromatin quantitative trait loci (cQTLs) at a 5% false discovery rate (FDR) for LCLs and fibroblasts, respectively (Table 1). As expected, the cQTLs replicate well in an independent dataset for neutrophils from the Blueprint project (24) (fig. S8), tend to locate near the peaks they are associated with, are enriched in open chromatin regions, and exhibit small allelic biases (fig. S9, A to C). That chromatin peaks associated with cQTLs cluster together (fig. S9D) suggests some coordination between nearby REs (17, 19).

Table 1 Molecular QTLs discovered in LCLs.

The five columns provide (i) the type of molecular QTL, (ii) the association tests that were performed to discover them, (iii) if the tests were done in cis or trans, (iv) the number of tested molecular phenotypes, and (v) the number of significant QTLs discovered at 5% FDR.

View this table:

RE variability reveals thousands of cis-regulatory domains

To study the coordination of REs, we systematically measured the interindividual correlation between chromatin peaks located within a sliding window spanning 250 peaks. This revealed a widespread correlation that rapidly decays with distance, slightly varies across assay pairs, and shows increased cell-type specificity over long range (Fig. 1A and fig. S10). Importantly, this correlation is not specific to this data because we also observe it in an independent dataset for neutrophils from the Blueprint project (24) (fig. S11), with a relatively good degree of agreement (fig. S12). The correlation forms well-delimited domains that we call cis-regulatory domains (CRDs) (Fig. 1B) (19). We produced a genome-wide call set of CRDs using an algorithm based on hierarchical clustering that iteratively groups chromatin peaks into CRDs on the basis of their correlation levels (23). This regrouped 40.9% (n = 111,005) and 16.6% (n = 45,062) of the chromatin peaks within 12,583 and 10,442 CRDs in LCLs and fibroblasts, respectively.

Fig. 1 Intrachromosomal coordination between REs.

(A) Genome-wide map of the squared interindividual correlations between nearby chromatin peaks (shades of blue). The largest chromosomes (chr) are split across multiple rows. (B) Magnified view of the boxed area in (A), a region spanning 2000 chromatin peaks on chromosome 4. Interindividual correlation (shades of blue) is given in the context of Hi-C contacts (scaled between 0 and 1; shades of red). CRD calls are shown with black triangles, and the genomic locations of TADs and CRDs in the region are shown with red and blue intervals, respectively. Mbp, mega–base pairs.

In the case of LCLs, CRDs capture coordinated activity at 13,872 (57.7%) and 55,059 (40.5%) of the putative promoters and enhancers once the chromatin peaks are collapsed into nonoverlapping REs (fig. S13A). On average, a CRD contains 5.6 REs, but this varies substantially (44.6% with 2 REs and 14.2% with >10 REs; fig. S13B). As a result, CRDs help us to study enhancer-promoter coordination (fig. S14A): A promoter, on average, coordinates with 7.8 enhancers [median (md) = 4, standard deviation (sd) = 11.5], whereas an enhancer with 1.9 promoters (md = 1, sd = 2.1; fig. S14B) is in line with previous estimates (25). Only 46.2% of the promoters coordinate with their closest enhancer (fig. S14C), confirming that enhancers should not be assigned to promoters on the basis of proximity alone. In addition, enhancers tend to locate on one or the other side of the promoters, as suggested by the enrichment of promoters at CRD boundaries (fig. S14D).

The analysis of allele-specific effects (ASE) in chromatin peaks reveals that the coordination between REs occurs in a haplotype-specific manner (17, 23). Indeed, we found that distinct REs tend to exhibit haplotypic activity coordination when they belong to the same CRDs (fig. S15). This shows that the coordination between REs discovered at the population level occurs in cis and is observed at the individual level using ASE. Mapping CRDs using population data is, by nature, sample-size dependent. We therefore subsampled the LCL data in groups of 50 individuals to assess our discovery power and found that 317 LCL samples provide reasonable power: Saturation is reached in terms of number of CRDs discovered, whereas more samples would better delimitate their chromatin peak content (fig. S16).

Overall, interindividual correlation between chromatin peaks is able to reveal the coordinated activity of a large fraction of REs in the genome, a coordination that occurs within thousands of CRDs and defines the manner by which genetic effects affect the cis-regulatory landscape of genes.

CRDs capture functional links between REs

Physical interactions between REs are commonly mapped with chromosome conformation capture (3C) techniques, such as Hi-C. In practice, Hi-C measures 3D contacts between distinct DNA regions and therefore potential functional links between REs (26). By contrast, interindividual correlation captures the coordinated activity of REs resulting from (direct or indirect) functional links. We therefore looked at how functional links compare to 3D contacts at three different levels.

First, we used publicly available deep Hi-C data for LCLs (22) and estimated the Hi-C contact intensity between any pair of chromatin peaks (23). We compared the resulting DNA contacts with the interindividual correlations for the same pairs of peaks and observed overall good concordance between the two. We found stronger Hi-C contacts at significantly correlated pairs of peaks (Fig. 2A), a higher chance of significant correlation between peaks with strong Hi-C contact (Fig. 2B), and an overall positive correlation between the two pairwise measures (fig. S17, A and B).

Fig. 2 Functional links versus 3D contacts.

(A) Distributions of the log Hi-C contact intensities at significantly associated (blue) and nonassociated (white) pairs of chromatin peaks within 20 bins of increasing distance between peaks. Increases in the means are shown at the bottom. Error bars indicate ±1.5 times the interquartile range. (B) Percentages of significantly associated pairs of peaks at the 1 and 10% strongest and weakest Hi-C contacts within 20 bins of increasing distance between peaks. (C) Density of REs located in CRDs relative to TAD boundaries. REs are either located at the CRD boundaries (black) or not (gray). (D) Genomic sizes of CRDs and TADs. md, median; mu, mean.

Second, we compared the respective domains (CRDs and TADs) inferred from the two pairwise measures. We found that they frequently overlap (fig. S18, A and B), they often regroup the same pairs of peaks (fig. S18C), and their respective boundaries tend to match (Fig. 2C). However, CRDs remain substantially smaller than TADs (32 versus 185 kb; Fig. 2D), suggesting that CRDs define regulatory units within TADs. To further refine this idea, we classified the chromatin peaks into four categories depending on their location relative to TADs and CRDs (i.e., in or out). This showed higher activity and variability for chromatin peaks within CRDs as opposed to TAD-specific peaks (fig. S18, D and E).

Third, we compared the two types of pairwise measures at the level of protein binding sites for LCLs that we obtained from the ENCODE project (27). We found that the correlation responds to these specific DNA elements the same way that Hi-C does (22), that is, there is a higher density of CCCTC-binding factor (CTCF) binding sites at CRD boundaries (fig. S19, A and B) and the overall level of correlation is higher when measured in between binding sites of structural proteins (CTCF, RAD21, SMC3, and ZNF143; fig. S19C).

All this constitutes evidence that interindividual correlation between peaks reflects the local 3D conformation of chromatin and that CRDs delimit the most active and variable regulatory units in which 3D contacts are realized into functional links and result in coordinated activity between REs.

Interchromosomal coordination between REs reveals trans-regulatory hubs

By using Hi-C, multiple studies have characterized a high-level organization of the chromatin in the nucleus within A and B nuclear compartments [where A is open or expression-active chromatin and B is closed or expression-inactive chromatin (11, 22)]. Here we used population variability of chromatin activity to uncover the interchromosomal coordination between REs. Specifically, we measured interindividual correlations for pairs of chromatin peaks located on distinct chromosomes and found a substantial signal of association (fig. S20, A and B), particularly at the strongest Hi-C contacts (Fig. 3A). When only considering pairs of chromatin peaks with a correlation P value below 1 × 10−6, we found that some chromosomes preferentially associate together (fig. S20C), especially those with high gene density (Fig. 3B).

Fig. 3 Interchromosomal coordination between REs.

(A) Percentages of significant association at 1% FDR between pairs of peaks located on distinct chromosomes when measured in 50 bins of increasing Hi-C contact intensities. (B) Per-chromosome fraction of correlation P values below 1 × 10−6 as a function of gene density. Significance of the correlation between both is given in the top left. (C) Numbers and percentages of CRDs as a function of the number of other CRDs on distinct chromosomes that they are associated with (i.e., degree). (D) Network representation of the 1000 strongest associations between CRDs on distinct chromosomes. Each network node is a CRD, and each edge is a significant association between two CRDs. Nodes are colored depending on their nuclear subcompartment membership. (E) Significant enrichments in each TRH (rows) for CRDs located in either A or B nuclear compartments (columns). Each TRH is identified by the number of CRDs it contains (from 19 to 901).

To define individual associations, we performed interchromosomal correlation analysis between CRD activity rather than individual chromatin peaks. This reduced the number of statistical tests by three orders of magnitude (from n = ~3.4 × 1010 to ~7.5 × 107) and concentrated the analysis on the most informative peaks, because CRDs are enriched for chromatin peaks with interchromosomal correlations (Fisher test odds ratio = 3.2, and P < 1 × 10−300). In practice, this requires quantifying CRD activity per individual, a task that we performed by aggregating together the quantifications of multiple CRD peaks into a single CRD quantification vector (23). This allowed us to discover 25,315 significant interchromosomal associations at 1% FDR, thereby revealing extensive interchromosomal coordination between CRDs. Those associations are unlikely to result from translocation events (fig. S21) and show enrichment or depletion for specific pairs of protein binding sites (fig. S22). Bringing them into a network structure reveals that few CRDs play a central role in linking together multiple CRDs from distinct chromosomes (Fig. 3C): They form trans-regulatory hubs (TRHs). In practice, we identified 30 TRHs containing >10 distinct CRDs using network community detection (23). These TRHs have interesting biological properties. They regroup CRDs that come from distinct chromosomes (fig. S23A), usually from the same nuclear compartment (odds ratio = 4.7; Fisher test P < 1 × 10−300; Fig. 3, D and E, and fig. S23B); often link CRDs located at the telomeres (odds ratio = 1.6; Fisher test P = 5.2 × 10−13; fig. S23, C and D); and exhibit similar profiles in terms of protein binding site density (Spearman’s rho = 0.387, P < 1 × 10−300; fig. S23, E and F). Interchromosomal coordination between REs is therefore abundant in the genome and forms hub structures (TRHs) in the context of the nuclear organization in A and B compartments.

To corroborate the trans CRD associations, we attempted to replicate them in neutrophils, monocytes, and T cells using the Blueprint dataset (24). We defined the same exact coordinates of peaks and CRDs using two of the three overlapping assays between our study and Blueprint and tested the enrichment of low P values for the same 25,315 pairs of associations discovered in LCLs. We observed a proportion of true positives (pi1) = 17.2% (baseline pi1 = 10.1%) for neutrophils, pi1 = 10.0% (baseline pi1 = 7.1%) in monocytes, and pi1 = 20.6% (baseline pi1 = 15.4%) in T cells (fig. S24, A to C). This suggests that, even though trans signals are expected to be highly tissue specific, a substantial fraction is shared between LCLs and the three tissues of the Blueprint dataset (24).

Transcriptional mediation is the most immediate explanation for this trans coordination between REs. However, as suggested by the overlap between strong associations and contacts (Fig. 3A), the enrichment in A and B compartments (Fig. 3D), and some recent work on the regulation of olfactory receptors (28), it seems that there could be other mechanisms involving 3D interactions. To further assess whether some fraction of the trans signal is not transcriptionally mediated, we tested the possibility of allelic coordination between trans regions. If some of the associations are due to 3D interactions, it is conceivable that there is allelic preference in the interaction. Specifically, we tested whether pairs of associated CRDs share high chromatin ASE more frequently than pairs of CRDs from the same sample but without significant association (P > 0.5). We found a modest, but highly significant, signal of allelic coordination between CRDs in trans (fig. S25), thereby suggesting the presence of some functional coordination at the allelic level.

Variability in CRDs and TRHs affects the expression and coexpression of genes

The population-scale design of this study enables the characterization of the interplay between genes and REs at a large scale and high resolution: We can directly test genes for association with CRDs. Doing so in LCLs unravels a strong signal of association (23): 54% of the genes are associated with CRDs in cis (±1 Mb), which represents a total of 12,204 CRD-gene associations at 1% FDR (Fig. 4A and fig. S26A). This captures a widespread interplay between gene expression and CRDs with two notable degrees of complexity.

Fig. 4 The regulatory function of CRDs and TRHs.

(A) Examples of associations (black segments) between activity of CRDs (blue segments) and expression of nearby genes [red arrows represent transcription start sites (TSSs), and black boxes represent gene bodies] within an 11-Mb region on chromosome 20. (B and C) Distribution of the gene TSSs relative to the boundaries of the CRDs they are associated with, for TSSs either inside (B) or outside (C) CRDs. (D) Number and percentage of genes or CRDs as a function of the number of CRDs or genes they are associated with, respectively. (E) Percentages of pairs of genes being coexpressed or not (5% FDR) that associate with identical CRDs. This has been measured in seven bins of increasing distance between the genes. Odds ratios and their significance are given. (F) Percentage of pairs of genes being coexpressed or not (5% FDR) that associate with CRDs coordinated in trans. Odds ratio (OR) and its significance are given.

First, only 53.2% of the genes fall within the boundaries of the associated CRDs (Fig. 4, B and C). Even though genes are enriched at CRD boundaries, some associations capture long-range effects spanning hundreds of kilobases with small effect sizes and high cell-type specificity (fig. S26, B to D). Focusing on those spanning more than 10 kb, we found that they are well supported by chromatin conformation because they occur within TAD boundaries more frequently than expected by chance (fig. S27A). This suggests another layer of coordination between nearby CRDs that likely results from the hierarchical nature of regulatory interactions.

Second, genes are often associated with more than two CRDs (23.3% of genes) and CRDs with more than two genes (20.2% of CRDs; Fig. 4D). As a result, we found that nearby coexpressed genes associate more often with identical CRDs than genes that are not coexpressed (Fig. 4E). By extending this idea in trans, we also found that coexpressed genes in trans tend to associate with coordinated CRDs in trans (Fig. 4F). In addition, CRDs from the A nuclear compartment associate more often with genes than CRDs from the B compartment (fig. S27B). Altogether, this suggests that CRDs and their interchromosomal coordination are key factors involved in cis and trans coexpression of genes.

Activity and structure of CRDs are under strong genetic control

The relatively large sample size of the LCL dataset (n = 317) offers the required statistical power to analyze the effects of genetic variants on CRDs and genes (i.e., QTLs). To discover CRD-QTLs, we quantified (i) the activity of a CRD as the mean activity of the peaks it contains and (ii) its structure by assessing whether the data for each individual contribute positively or negatively to the mean correlation (23). We tested both phenotypes as well as gene expression for association with genetic variants in cis (±1 Mb) and discovered 6157 activity CRD-QTLs (aCRD-QTLs; Fig. 5A), 110 structure CRD-QTLs (sCRD-QTLs; Fig. 5B), and 7658 eQTLs (Table 1 and fig. S28A).

Fig. 5 The genetic control of CRDs and its effect on gene expression.

(A and B) Examples of an aCRD-QTL (A) and a sCRD-QTL (B). For each, the correlation structure (positive and negative correlations in blue and red, respectively) and the normalized activity distribution (shown at the bottom) of all chromatin peaks contained in the CRDs are stratified by the QTL genotype (REF/REF, REF/ALT, and ALT/ALT, where REF and ALT are the reference and nonreference genome alleles, respectively). The numbers of individuals (n) per QTL genotypes are given. (C) MAF spectrum for all types of QTLs. (D) Spearman correlation between all pairs of ASE calls located in the same CRDs. Because phase is known, ASE calls are haplotype specific [imbalance is h0/(h0 + h1), where h0 and h1 are the number of reads mapping to the alleles carried by the first and second haplotypes, respectively]. Correlation has been measured within CRDs without a QTL (gray) or with a QTL in individuals that are homozygous for the major allele (HOM) at the QTL (red) or that carry at least one copy of the minor allele (blue, HET + HOM). CRDs were further stratified depending on the type of QTL (no QTL, aCRD-QTL, or sCRD-QTL) and the sign of the QTL regression slope of the minor allele (+ or −). (E) Percentages of genes associated with one CRD having one aCRD-QTL (gray), one CRD with two or more aCRD-QTLs (blue), or two or more CRDs (green) as a function of the number of eQTLs associated with it. (F) Mean probabilities for each causal model as a function of distance between the eCRD-QTLs and the target genes. Only eCRD-QTLs falling within their target CRD are shown. Q, QTL; G, gene.

All QTLs exhibit the same pattern in terms of genomic distribution: They are stronger and more frequent as we approach the genomic source of the target phenotype (fig. S28B). However, they substantially differ in terms of the minor allele frequency (MAF) spectrum: sCRD-QTLs usually involve genetic variants with a MAF below 20%, as compared with aCRD-QTLs and eQTLs, which span the full MAF range (Fig. 5C). They also differ in their impact on DNA binding motifs: Alternative alleles at sCRD-QTLs involve less protein binding motifs compared with aCRD-QTLs and eQTLs (fig. S28C). These two latter points suggest a stronger selective pressure on sCRD-QTLs and suggest overall that there are two types of genetic variations affecting CRDs: common genetic variants modulating the CRD activity and rare genetic variants disrupting the CRD structure, that is, the local coordination between nearby REs.

To corroborate sCRD-QTLs, we first replicated them in the Blueprint dataset. Given the overlap with only two of the three chromatin assays and the low MAF of sCRD-QTL variants, we were able to test 55 of the 110 sCRD-QTLs. Of those, 10, 18, and 11 had a P < 0.05 in neutrophils, monotcytes, and T cells, respectively; 23 of them had a P < 0.05 in at least one cell type; and 5 had a P < 0.05 in all cell types (fig. S29, A to C). The direction of the effect was shared in the majority of cases (fig. S29, D to F). Given the smaller sample size of the Blueprint dataset and the distinct cell types, the level of replication is substantial. To offer an orthogonal corroboration of the effects of aCRD-QTLs and sCRD-QTLs, we tested for allelic coordination within CRDs (Fig. 5D) and found (i) that the level of allelic coordination varies depending on the genotypic categories for both aCRD-QTLs and sCRD-QTLs, (ii) that this variation depends on the direction of the sCRD-QTL effect conversely to aCRD-QTLs, and (iii) that the direction of the sCRD-QTL effects is consistent with the change in allelic coordination: A positive slope indicates a minor allele that decreases the overall correlation level and thereby the level of allelic coordination.

CRDs combine and mediate cis-eQTL effects over long genomic distances

A logical step forward is to assess the overlap between aCRD-QTLs and eQTLs (Table 1), which is facilitated by the dense map linking genes to CRDs. We therefore tested aCRD-QTLs against relevant genes and discovered that 72% of them also affect gene expression (fig. S30A). Despite this large overlap, aCRD-QTLs and eQTLs are not enriched within the same functional annotations (27, 29): aCRD-QTLs locate more often in open chromatin regions and enhancers and more rarely in promoters and transcribed regions (fig. S30B). This does not change their relevance in studying diseases, as shown by the similar enrichments they exhibit for known genome-wide association studies (13, 30) (fig. S30C). However, this seems to play a role in cell-type specificity: QTLs for CRDs are more cell-type specific than eQTLs (fig. S30D).

We next characterized the role of CRDs in mediating genetic effects on gene expression. First, we asked if the regulatory coordination within a CRD could combine multiple genetic effects on a single gene. We performed conditional analyses to discover genes with multiple eQTLs (23) and found that those genes are often associated with multiple CRDs (odds ratio = 1.42) or with a single CRD having multiple aCRD-QTLs (odds ratio = 1.81; Fig. 5E). This shows that the joint effect of multiple eQTLs on a gene mostly results from the regulatory coordination occurring within or between CRDs.

Second, we asked if the regulatory coordination within CRDs can mediate long-range eQTL effects. A prerequisite of this is to focus exclusively on genetic variants affecting both CRDs and genes. We detected those in an unbiased manner by directly quantifying the 12,204 gene-CRD pairs using PCA. We discovered QTLs for 8706 of them [i.e., expression CRD-QTLs (eCRD-QTLs); Table 1], thereby defining triplets (eCRD-QTL, CRD, and gene) in which we inferred the most likely causal relationships using Bayesian networks (23, 31, 32). We assigned a causal scenario for a vast majority of the cases (fig. S31A) and found that activity changes at CRDs are either causal or reactive to changes in gene expression depending on whether they originate from distal or proximal eCRD-QTLs, respectively (Fig. 5F). Functional elements contribute differently to these two types of causal mechanisms (i.e., causal or reactive): Causal CRDs involve QTLs enriched in enhancers, whereas reactive ones involve QTLs enriched in promoters and transcribed regions (fig. S31B). To summarize, CRDs mediate long-range effects of enhancer-related regulatory variations on genes and reflect short-range effects of promoter-related regulatory variations.

CRDs mediate the cumulative effects of multiple rare cis-regulatory variants

In the context of association studies, having characterized the local coordination between REs and genes is instrumental to study the collective effect of multiple regulatory variants. As a proof of principle, we used whole-genome and transcriptome sequencing data (5) to assess whether the accumulation of rare variants within CRDs is associated with the expression at nearby genes (i.e., rare-eQTL effects). To this aim, we defined a “genetic burden” for each CRD as the number of minor alleles at rare variants (MAF < 5%) in its regulatory regions (23). This was tested against nearby genes and revealed a non-negligible signal of association (Table 1 and Fig. 6, A and B). At 5% FDR, we found 33 CRDs whose baseline activity is perturbed by the accumulation of rare mutations with a level of expression at the downstream gene being decreased in 64% of the cases (fig. S32A). This illustrates the potential gain that future association studies could get from the CRD organization in the context of whole-genome sequencing.

Fig. 6 CRDs enhance association studies.

(A) Example of rare variants within a CRD associated with the expression of the CMBL gene. Shown from top to bottom are (i) the pairwise correlations between the eight chromatin peaks (peaks are identified by numbers), (ii) the genomic locations of the eight peaks and the exonic regions of the CMBL gene (exons in black and H3K27ac, H3K4me1, and H3K4me3 peaks in red, blue, green, respectively), and (iii) the positions of the minor alleles at variants overlapping peaks for each of the 358 tested individuals, sorted by increasing gene expression (low expression at the bottom). (B) Quantile-quantile plot (QQplot) of rare-eQTL P values (burden test). (C) QQplot of trans-eQTL P values in six datasets (EUR for Eurobats and SGX for this study). (D) Replication P values for the 27 significant trans-eQTLs. (E) Illustration of the four most likely causal models (with mean probability above 0.01) out of the 18 tested ones (see fig. S30).

TRHs mediate trans-eQTL effects

We then leveraged the interchromosomal associations between CRDs (i.e., TRHs) to discover trans-eQTL effects and study potential mechanisms for how they exert their effects on gene regulation. To this aim, we combined three different layers of associations: (i) aCRD-QTLs, (ii) interchromosomal associations between CRDs, and (iii) gene-CRD associations. This defined 20,489 candidate variant-gene pairs located on distinct chromosomes to be tested for association (23), a task we performed in six distinct datasets compiling data from two studies [the present study and EUROBATS (33)] and five tissues (blood, fat, LCLs, skin, and fibroblasts). Overall, we only found an association signal in the LCL datasets (Fig. 6C) and discovered nine trans-eQTLs that were well replicated across all LCL datasets (Table 1, Fig. 6D, and fig. S32B). By assembling all the relevant CRDs, genes, and trans-QTLs into networks, we found that these trans-eQTL signals relate to two distinct variants (rs2980236 and rs7758352), the former affecting a total of eight genes in trans, all of them connected by a single TRH. After extensive investigations of the propagation of the genetic effects using Bayesian network modeling (23), we hypothesized that these trans-eQTLs act on genes via interchromosomal cross-talk between REs (Fig. 6E, fig. S33, and table S1). We therefore attempted to experimentally test the physical proximity of the regions involved in three of the trans-eQTL signals with 2D DNA fluorescent in situ hybridization (FISH), but no evidence for physical proximity was obtained. Given the strong statistical support for the proposed causal scenario, it is likely that the CRDs either demonstrate physical proximity in a transient fashion difficult to be detected with 2D DNA FISH or there is a yet-unknown mechanism of trans signaling in the absence of mediating transcription that signals from one region to the other, maybe in an allelic fashion (as observed above in the context of TRHs).

Discussion

By analyzing genome-wide genotype, chromatin activity, and gene expression data, we reconstructed first- and higher-order functional links at the base of complex regulatory networks. Interindividual variability of regulatory activity is structured within thousands of CRDs, and we predict those to reflect the local nuclear organization of chromatin (34). They delimit the most active and variable regulatory units within TADs in which 3D contacts can be translated to functional links and result in coordinated activity between promoters and enhancers. Interindividual correlation of regulatory activity also occurs in between chromosomes and forms TRHs. The most likely mechanism underlying this observation is that TRHs emerge because interindividual variability of transcription factor levels, which is reflected at multiple binding sites of such transcription factors across the genome, leading to the observed coordination between the corresponding chromatin peaks. However, several lines of evidence suggest that this is not the only mechanism at play, and we hypothesize in this work that a non-negligible fraction of the trans associations between CRDs also results from higher-order 3D nucleus organization, as suggested by (i) the higher chance of CRD associations at strongly contacting regions, (ii) the compatibility of TRHs with the A and B compartmentalization of the chromatin in the nucleus, (iii) the signal of trans allelic coordination observed in TRHs, and (iv) the causal signal inferred for the trans-eQTLs we discovered. This hypothesis is in line with recent models of trans interactions (28, 35), phase separation (36, 37), and gene activation (38). Both CRDs and TRHs are key factors involved in the regulation of genes because they are commonly associated with the expression of genes and provide a mechanistic rationale for the coexpression of genes in cis and trans commonly reported in the literature (39). CRDs are under tight genetic control, with effects that either propagate along the functional links they contain (i.e., modulating the activity of CRDs) or, more rarely, disrupt them (i.e., affecting the correlation structure of CRDs), resulting in dynamic changes of the local regulatory circuitry in the population and, in time, by means of natural selection. Assessing whether these genetic effects also involve changes in the chromatin architecture remains to be studied and would likely require population-scale chromatin conformation data. Having characterized the coordination between REs helps to provide mechanistic insights for eQTLs. Specifically, our work suggests that the regulatory coordination within CRDs can (i) mediate long-range genetic effects on genes (i.e., distal eQTLs), (ii) combine multiple genetic effects on a single gene (i.e., independent eQTLs), and (iii) mediate the collective effect of multiple rare variants onto genes (i.e., rare-eQTLs). The latter allows the extension of the widely used burden analysis of rare variants to noncoding regions, typically limited to gene and exon annotations. Regulatory coordination between CRDs can also mediate genetic effects on genes located on different chromosomes (i.e., trans-eQTLs), and we predict that this may allow for the detection of a new class of trans-eQTLs that may not be transcriptionally mediated.

By drawing a dense and high-resolution genome-wide map of the interplay between genetic variations, REs, and gene expression across two cell types, we provide (i) a reference resource to better interpret disease and quantitative trait–associated variants, (ii) a framework to integrate functional genomics of the 3D nucleus and population variability, (iii) a model of how genetics can affect the cis and trans functional links between REs, and (iv) informative priors to enhance present and future association studies. Future studies using such population-scale approaches with larger sample sizes should allow for the capture of even weaker effects, whereas those using multiple tissues and affected patients will permit building cross-tissue patterns of genetically perturbed regulatory networks and characterizing disease-specific regulatory signatures, respectively.

The consideration of genetic effects at local and global scales, as performed in this study, offers mechanistic insights that allow us to start assessing the true cellular impact of genetic variants individually, as well as altogether, and their impact not only at the gene level but also at the cellular level. Overall, our work bridges discoveries made by dissecting the genetic basis of chromatin activity and gene expression variability with those of genome-wide studies focusing on the structural properties of the chromatin and the 3D genome organization. This brings us closer to fully dissecting the properties and interactions of genetic variants defining organismal phenotypes via intra- and intercellular interactions.

Methods

Sample collection and experimental methods

We collected samples from three different sources: (i) 46 LCLs that we generated as part of a previous study (19), (ii) 111 LCLs from the 1000 Genomes Project (40), and (iii) 160 LCLs and 78 fibroblasts from the GenCord collection (41). Altogether, this gives a total of 317 LCLs and 78 fibroblasts all from individuals of European ancestry. We cultured all cells for which no data was already available (271 LCLs and 78 fibroblasts), performed chromatin immuno-precipitation for three different histone modifications (H3K27ac, H3K4me1, and H3K4me3), and extracted total RNA. We sequenced all resulting libraries on a HiSeq2000 machine. In addition, we also collected genotype data for all samples by merging existing sequence data from the 1000 Genomes project (n = 145) with genotype data we generated using Illumina Human OMNI 2.5M SNP arrays (n = 178).

Molecular phenotype and genotype data preparation

All sequence data was mapped onto the human genome (hg19) with either BWA (42) for ChIP-seq data or GEM (43) for RNA-seq data. Gene expression was quantified using QTLtools (44) with GENCODE v19 (45) as reference annotation and filtered in order to only retain protein-coding genes and lincRNAs expressed in more than 90% of the samples. For the chromatin assays, we proceeded in three steps: (i) We built a population-scale sequence file for each ChIP-seq assay by subsampling 1 million reads from 50 LCLs and 50 fibroblasts, (ii) we performed peak calling on the three resulting consensus sequence files using Homer v4.7 (46), and (iii) we then quantified each sample in turn using the resulting consensus peaks coordinates as a reference. To account for confounding factors, we regressed out the following covariates from the gene and chromatin peak quantifications: sex, cohort of origin, genotyping platform, ancestry (captured by PCA on the genotype data), and technical variables (captured by PCA on the phenotype data). For the latter, we used the number of PCs that maximizes the number of cQTLs discovered. To remove potential outlier effects in association testing, we also normalized each molecular phenotype so that the quantifications across individuals match a normal distribution with mean 0 and standard deviation 1. Finally, we merged all chromatin assays together to obtain a total of four quantification matrices: (i) expression at 18,939 genes for 317 LCLs, (ii) activity at 271,467 chromatin peaks for 317 LCLs, (iii) expression at 18,068 genes for 78 fibroblasts, and (iv) activity at 271,467 chromatin peaks for 78 fibroblasts. Concerning the genotype data, we merged together all genotype coming from the SNP array and filtered variants using standard procedures to remove low-quality SNPs (i.e., based on call rate, Hardy-Weinberg, minor allele frequency, and frequency discordance with 1000 Genomes). The resulting genotype matrix was imputed from the 1000 Genomes phase 3 reference panel (40), poorly imputed variants were removed, and the rest were combined with the genotype data coming from the 1000 Genomes sequencing data. This gave us genotype data for 323 individuals at 9,255,024 variants.

Mapping molecular quantitative trait loci

We mapped molecular QTL for each molecular phenotype using the standard procedure implemented in the QTLtools software package (44). To summarize, we performed (i) 1000 permutations to correct for the number of genetic variants being tested in cis per phenotype (±1 Mb from the phenotype boundaries) and (ii) a false discovery rate procedure implemented in the R/qvalue package (47) to correct for the number of phenotypes being tested genome-wide.

Building correlation and contact maps

We built correlation maps for the chromatin assays by systematically measuring interindividual correlation (i.e., Pearson correlation coefficient) between any pair of chromatin peaks located on the same chromosome (intrachromosomal within a 250-peak sliding window) or on distinct chromosomes (interchromosomal). In multiple analyses, we measured the percentage of correlation P values exhibiting association according to various parameters of interest (e.g., distance between peaks or types of peaks) by simply computing the number of significant associations obtained at a given FDR threshold [usually 1% FDR given by the R/qvalue package (47)] divided by the number of statistical tests performed. In addition, we built peak-centered Hi-C contact maps by interpolating the contact intensities between any pair of chromatin peaks from a deep Hi-C experiment made for LCLs (22, 23). Overall, this means that between any two chromatin peaks, we got two pairwise measures that we can compare: (i) the interindividual correlation between activity levels and (ii) the contact intensity interpolated from Hi-C data.

Calling and quantifying CRDs

We called CRDs from the correlation data in two steps. First, we performed hierarchical clustering on the chromatin data to get a binary tree that regroups the chromatin peaks for each chromosome depending on the correlation levels they exhibit. Second, we cut the resulting binary trees at the level maximizing the total correlation mass captured by the clusters it defines (i.e., CRDs). We did this using three empirical criteria: (i) overall correlation that requires the mean level of correlation in a CRD to be at least twice the background, (ii) edge correlation that requires the mean level of correlation of the peaks at the CRD boundaries to be at least twice the background, and (iii) a requirement that the CRD covers at least two nonoverlapping DNA regions. We then quantified CRDs on a per-individual basis at two levels: (i) We measured the overall activity in a CRD by taking the averaged quantification per individual across all peaks contained in the CRD, and (ii) we measured the structure of a CRD by assessing the contribution of each individual onto the mean correlation within the CRD. We did this by using a leave-one-out approach: We take out each individual in turn from the data, recompute the overall correlation within the CRD, and assign the resulting value as quantification for the individual we removed. Individuals highly contributing to the correlation have low values, whereas those contributing poorly have high values.

Replication and validation of QTLs and CRDs

We corroborated molecular QTLs and CRDs at two levels. First, we attempted to replicate them into an independent dataset, Blueprint (24), that comprises genotype data and ChIP-seq data (H3K27ac and H3K4me1) for up to 165 individuals across three immune cell types: neutrophils, monocytes, and T cells. Specifically, we replicated (i) the cQTLs, (ii) the sCRD-QLTs, (iii) the correlation between nearby peaks, and (iv) the correlation between CRDs located on distinct chromosomes. Depending on the analysis, we either performed peak calling and quantification on the raw sequencing data of Blueprint or just quantified our set of chromatin peaks into the Blueprint data. Second, we attempted to validate some discoveries by looking at how they translate in terms of allelic coordination. To do so, we called allele-specific effects (both alleles covered and more than 10 covering reads) into a subset of 145 individuals for which we have both ChIP-seq data and whole-genome DNA sequence data from the 1000 Genomes project. Then, we checked (i) if the correlation between peaks in cis and trans translates into with allelic coordination between them and (ii) how CRD-QTL genotypes affect the allelic coordination within CRDs. We measured allelic coordination by computing the Spearman correlation between relevant pairs of ASE calls. For cis analyses, we defined an ASE call at a position as the proportion of reads mapping to the first haplotype (the score therefore ranges between 0 and 1). For the trans analysis, we defined an ASE call at a position as the proportion of reads mapping to the least-covered allele (the score ranges between 0 and 0.5). Precise details on these analyses are given in the supplementary materials.

Mapping genes and QTLs for CRDs

Having quantified the CRDs, we searched for associations with nearby genes and genetic variants using the permutation procedure implemented in the QTLtools software package (44). This allowed us to efficiently correct for the multiple genetic variants, genes, and CRDs tested genome-wide. Of note, we only searched for associations within 1 Mb of the CRD boundaries. For the genes, we used as genomic coordinates the position of the TSSs. In some cases, we complemented this step by conditional analysis, as implemented in the forward-backward scan of QTLtools in order to find multiple independent associations per phenotype.

Causality inference

To infer the causal relationships between QTLs, CRDs, and genes, we proceeded in two steps. First, we only focused on QTLs affecting both CRDs and genes. To get these, we quantified each gene-CRD pair using a PCA and mapped QTLs using the coordinates on PC1 as quantifications. As a by-product, this gives QTL-CRD-gene triplets onto which we performed Bayesian network analysis to get the posterior probabilities for the three possible causal models we considered in this study: (i) the causal model with QTL > CRD > gene, (ii) the reactive model QTL > gene > CRD, and (iii) the independent model with QTL > CRD and QTL > gene. This has been implemented using the R/bnlearn package (48).

CRD-based burden test

We designed a burden test at the regulatory level that leverages CRDs. This test relies on two steps: (i) We measured the burden of rare mutations for each CRD as the number of rare alleles at variants with a MAF < 5% falling within the REs of the CRD, and (ii) we tested this burden with expression at relevant genes. By relevant genes, we mean genes that have been previously found to be associated with the CRD of interest. We performed this analysis on the GEUVADIS dataset for which we have whole-genome and transcriptome data for 358 individuals (5).

TRH-driven trans-eQTL mapping

We built a map of gene-variant pairs located on distinct chromosomes to be tested for association by leveraging the interchromosomal CRD-CRD associations we discovered in this study. We assembled together (i) gene-CRD associations (cis links), (ii) CRD-CRD associations (trans links), and (iii) QTL-CRD associations (cis links), which effectively forms 20,489 gene-variant pairs to be tested in trans. Then, instead of performing the billions of tests required by standard trans-eQTL analysis (millions of variants times thousands of genes), we only focused on the 20,489 pairs that this approach helped to assemble. We performed these tests in multiple data sets [e.g., EUROBATS (33)] and extracted the significant hits we obtained at 5% FDR. To infer how genetic effects propagate from the genetic variants to the genes located on distinct chromosomes, we used Bayesian networks (48). Specifically, we enumerated all possible five-tuples regrouping the QTL variants, the cis and trans CRDs, and the cis and trans genes. Then, we enumerated 18 possible causal models for the genetic effects propagation. These 18 models allow two contrasting hypotheses: The genetic effects propagate to the gene in trans through the coordination between CRDs or through the gene in cis. Finally, we computed the posterior probability of each model using Bayesian networks.

Supplementary Materials

science.sciencemag.org/content/364/6439/eaat8266/suppl/DC1

Materials and Methods

Figs. S1 to S37

Tables S1 to S4

References (5162)

References and Notes

  1. See supplementary materials.
Acknowledgments: Funding: This work was supported by grants from SystemX.ch, the Swiss Initiative in Systems Biology (SysGenetiX grant 3826), the Swiss National Science Foundation (SNSF) Sinergia grant CRSI33_130326 (to E.T.D. and A.R.), European Research Council grant and Louis-Jeantet Foundation support (to E.T.D.), and SNSF grants (31003A_170096 to E.T.D., 31003A_160203 to A.R., and 163180 to S.E.A.). K.P. was partially supported by the 5 Top 100 Russian Academic Excellence Project at the Immanuel Kant Federal Baltic University and Russian Foundation for Basic Research (project 18-29-13055). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: E.T.D. is chairman and member of the board of directors of Hybridstat Ltd. Author contributions: E.T.D., A.R., S.E.A., S.B., and P.B. designed and supervised the study and contributed to data interpretation. C.B., P.R., and D.H. cultured and processed the fibroblast and lymphoblastoid cell lines. M.Z., G.G., and M.W. designed and executed the ChIP experiments. C.B., L.R., and D.B. prepared ChIP- and RNA-seq libraries and executed the ChIP- and RNA-seq experiments. E.F. and C.B. performed the genotyping. O.D. designed and executed the primary data analysis. G.R., C.H., S.K., H.O., K.P., D.M., and G.A. contributed to data analysis. O.D., E.T.D., and A.R. performed the primary manuscript writing with contributions from S.E.A. Data and materials availability: RNA- and ChIP-seq data for all 1000 Genomes samples, generated as part of the 2015 Waszak et al. study (19) and this study, have been deposited in the ArrayExpress Archive at EMBL-EBI (www.ebi.ac.uk/arrayexpress) under accession numbers E-MTAB-3656 and E-MTAB-3657, respectively. Genotype data for these samples are available on the 1000 Genomes FTP server: sequenced individuals (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) and genotyped individuals (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20120131_omni_genotypes_and_intensities/). RNA-seq, ChIP-seq, and genotype data for the Gencord samples have been deposited at the European Genome-phenome Archive, which is hosted by the EBI and CRG, under accession number EGAS00001003485. Processed data (cQTLs, eQTLs, aCRD-QTLs, sCRD-QTLs, chromatin peaks, and CRDs) are hosted on Zenodo (49). The code underlying CRD analyses is hosted on Zenodo (50).
View Abstract

Stay Connected to Science

Navigate This Article