Five-Vertebrate ChIP-seq Reveals the Evolutionary Dynamics of Transcription Factor Binding

See allHide authors and affiliations

Science  21 May 2010:
Vol. 328, Issue 5981, pp. 1036-1040
DOI: 10.1126/science.1186176


Transcription factors (TFs) direct gene expression by binding to DNA regulatory regions. To explore the evolution of gene regulation, we used chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq) to determine experimentally the genome-wide occupancy of two TFs, CCAAT/enhancer-binding protein alpha and hepatocyte nuclear factor 4 alpha, in the livers of five vertebrates. Although each TF displays highly conserved DNA binding preferences, most binding is species-specific, and aligned binding events present in all five species are rare. Regions near genes with expression levels that are dependent on a TF are often bound by the TF in multiple species yet show no enhanced DNA sequence constraint. Binding divergence between species can be largely explained by sequence changes to the bound motifs. Among the binding events lost in one lineage, only half are recovered by another binding event within 10 kilobases. Our results reveal large interspecies differences in transcriptional regulation and provide insight into regulatory evolution.

The relationship between genetic sequence and transcriptional regulation is central to understanding species-specific biology, disease, and evolution (1). Identifying the divergence and conservation among functional regulatory elements is an important goal of comparative genomic research, and this is often done via DNA sequence comparisons using distant (2) and closely related species (3). Although both approaches have successfully identified conserved regulatory regions, the majority of transcription factor (TF) binding events can change rapidly between closely related species, making them difficult to detect using DNA sequence alone (47). For instance, the experimentally determined binding events for homologous TFs found in mouse and human livers are unlikely to align with each other (7), despite conservation of their functional targets (8) and global liver transcription (9). The evolution of mammalian transcriptional regulation remains largely unexplored beyond limited mouse-human comparisons.

We therefore identified the genome-wide binding of two TFs: (i) CCAAT/enhancer-binding protein alpha (CEBPA) in the livers of species representing five vertebrate orders: human (primate), mouse (rodent), dog (carnivora), short-tailed opossum (didelphimorphia), and chicken (galliformes); and (ii) hepatocyte nuclear factor 4 alpha (HNF4A) in livers from humans, mice, and dogs. Chromatin immunoprecipitation experiments were combined with high-throughput sequencing (ChIP-seq) using healthy, nutritionally unstressed adult livers from the heterogametic sex as a functionally and transcriptionally conserved homologous tissue type (Fig. 1 and fig. S1) (8, 10).

Fig. 1

CEBPA binding in vivo in livers isolated from five vertebrate species cross-mapped to the human PCK1 gene locus. A rare ultraconserved binding event is shown surrounded by species-specific and partially shared binding events. On the left is the evolutionary tree of the five study species (Hsap, Homo sapiens; Mmus, Mus musculus; Cfam, Canus familiaris; Mdom, Monodelphis domesticus; Ggal, Gallus gallus), with their approximate evolutionary distance in millions of years ago (MYA). The bottom track shows evolutionary conservation measured across 44 vertebrate species, and darker shading represents slower evolution.

CEBPA and HNF4A were selected as representative TFs within the liver-specific regulatory network, because both are conserved and constitutively expressed with well-characterized target genes (10, 11). In addition, they represent distinct TF classes, and the DNA binding domains of each factor’s orthologs are nearly identical among the study species (fig. S2).

The genomic TF occupancy data were reproducible between different individuals of the same species (fig. S3) and were validated by using alternative antibodies (fig. S4). Using a mouse carrying a human chromosome, we confirmed that genetic sequence, and not diet, lifestyle, or environment, is the primary determinant of liver-specific TF binding (fig. S5) (12). Given the greater evolutionary distance to opossum and chicken, contributions from nongenetic sources could be higher in those vertebrates.

We identified TF-bound regions using a dynamic programming algorithm, and our results were robust to different peak-calling thresholds (figs. S6 to S8) (13). To detect TF binding events shared among any combination of the five vertebrates, we used the Ensembl 12-way multispecies alignment (14), which incorporates approximately half of each species’ genome into global alignments. Our findings did not substantially change with an alternate methodology that used pairwise alignments in a separate algorithm (figs. S6 to S8) (13).

Each TF bound between 16,000 and 30,000 locations in each mammalian genome; CEBPA bound approximately half this number in the smaller chicken genome (Fig. 2 and figs. S6, S7, and S9). For both factors, less than a quarter of bound regions were within 3 kb of known transcription start sites (TSSs). Between 30 and 50% of the binding sites of the two TFs overlapped in the genome (table S1). These overlapping sites did not exhibit substantially different characteristics in the conservation of underlying genetic sequence than the sites of CEBPA and HNF4A did when considered individually.

Fig. 2

Conservation and divergence of TF binding. For (A) CEBPA and (B) HNF4A, the pairwise distribution and numbers of binding events are shown as a pie chart distributed into the following segments: intergenic (red), intronic (yellow), exonic (blue), and promoter (TSS ±3 kb) (green) regions. The left-most column contains the distributions of the bulk genomes. The right-most pie chart represents all binding events in each species, with the total number of alignable peaks above the total peaks (in parentheses). (C and D) Multispecies CEBPA and HNF4A binding event analysis, where black circles indicate binding in a given species. For instance, there are 764 regions bound by CEBPA only in dog and human (see also figs. S6, S7, and S17 and tables S2 and S6). (E) The DNA sequence constraint beneath binding events was measured by average GERP (20) scores for peaks found: in all five species (5-way), among all the placental mammals (3-way), bound in any two species (shared), within 10 kb of the TSS of functional targets (functional), and all peaks.

For these two liver-specific TFs, binding events appear to be shared 10 to 22% of the time between mammals from any two of the three placental lineages we profiled, separated by approximately 80 million years of evolution (figs. S6 and S7). This result reveals a rapid rate of evolution in transcriptional regulation among closely related vertebrates. Nevertheless, the number of CEBPA and HNF4A TF binding events shared between any two of our five study species is far greater than could have occurred by chance (fig. S10).

We used the genome-wide binding of CEBPA in opossum to test the hypothesis that regulatory regions have diverged substantially between eutherian and metatherian mammals (15). Opossum indeed showed dramatic changes in TF binding, and only between 6 and 8% of the genomic regions that are occupied by CEBPA in opossum liver align with CEBPA binding events also found in mouse, dog, and/or human livers. This divergence was even greater in chickens, which shared only 2% of CEBPA binding with humans, demonstrating extensive and continuous rewiring of gene regulation during vertebrate evolution that corresponds to evolutionary distance.

Ultraconserved noncoding regions are revealed by comparative genomic sequencing (16). We identified ultrashared interactions between CEBPA and the vertebrate genome as binding events that were preserved over the 300 million years of evolution and thus were found in aligned positions in all five species: human, mouse, dog, opossum, and chicken. Using our most stringent threshold, a set of 35 binding events were found to be shared by all five vertebrate species, and these binding events are almost invariably near genes that are central to liver-specific biology (Fig. 2C, tables S2 and S3). Although these ultrashared binding events are close to important liver-specific genes, they make up less than 0.3% of the total CEBPA binding found in humans.

About 250 direct functional HNF4A target genes have recently been identified by using multiple independent methodologies in mouse and human, including perturbation analysis in both species (8). We experimentally identified a similar set of transcriptional target genes whose expression is dependent on CEBPA in adult mouse livers by using a conditional knock-out strategy (17). In mammals, the target genes for both TFs have a disproportionate fraction of binding events that are shared in at least two species (P value < 1 × 10−5) (table S4). CEBPA binding near direct target genes did not overlap with the binding events shared by five species.

We further compared our results to a set of 53 regulatory sequences within known, authentic liver enhancers in humans (table S5) (18). Thirty-eight of these regulatory sequences were located within nine HNF4A-bound regions. CEBPA binding overlapped with five of these HNF4A-bound regions, and we also found that five of the nine HNF4A binding events were bound by HNF4A in more than one species. Overall, these findings suggest that functional targets are enriched for TF binding events found in multiple species.

Mammalian TF binding studies have suggested that functional enhancers show increased sequence constraint (19). As expected, the relatively few binding events shared among three or five species showed increased sequence constraint. The sequence constraint, which was evaluated by genomic evolutionary rate profiling (GERP) scores (20), in bound regions near functional targets was similar to that for all bound regions for both TFs, and these results were robust to the method applied. Regions bound by both CEBPA and HNF4A have sequence constraint patterns similar to those found for each factor analyzed independently (Fig. 2E and fig. S11). In sum, TF binding events near functional targets showed enhanced sharing between species, without a corresponding increase in sequence constraint.

DNA binding specificities of TFs show remarkable diversity and complexity (21), yet few studies have compared specificities of orthologous TFs among multiple species. The motifs we directly determined from experimental binding data showed that in vivo bound consensus sequences remained virtually unchanged during vertebrate evolution, despite most binding events being species-specific (Fig. 3A and fig. S12). Neither the quality of a bound motif, as determined by its similarity to the consensus, nor the regional ChIP enrichment, as measured by sequencing read depth, was correlated with the conservation of TF binding events (fig. S13).

Fig. 3

DNA binding specificities of CEBPA and HNF4A are highly conserved during vertebrate evolution. (A) The known sequence motifs were identified de novo in each species interrogated (13), and found within almost all binding events (fig. S12). (B) Multiple aligned motif occurrences are highly associated with binding events shared among three or more species. Peaks are categorized by the number of species in which they are shared, and the fraction of peaks with 0 (blue), 1 (gray), and 2 or more (red) aligned motifs are shown.

Searching for the sequence features that are associated with shared binding events, we discovered that binding events shared by more species contain more aligned motifs (Fig. 4B). These shared regions represent examples of deeply conserved regulatory architecture featuring multiple motifs at specific sequence locations maintained through vertebrate evolution. The most conserved of these, the five-way ultrashared sites, also exhibit the strongest sequence constraint (Fig. 2E).

Fig. 4

Lineage-specific loss and turnover of TF binding events. (A) The unbound regions in each placental mammal that align to regions showing TF binding in the other two placental mammals were collected, and the mechanisms by which the underlying motifs were disrupted were summarized. (B) Turnovers occurred near lineage-specific lost binding events approximately half the time; shared turnovers represent cases where a cluster of binding events likely occurred in a common ancestor (fig. S16).

To explore the genetic mechanisms underlying the divergence of TF binding, we identified potentially lost CEBPA and HNF4A binding events. A binding event was assumed to be lost if it was not present in one placental mammal yet was experimentally found at aligned, orthologous regions in the other two placental mammals. Using parsimony, this situation is best explained by an ancestral TF binding event present before the mammalian radiation that was subsequently lost along one lineage.

The lost binding events were categorized by the sequence changes to the alignable binding motifs within the orthologous regions of the other species (Fig. 4). Between 20 and 40% of the motifs associated with lineage-specific binding event losses were unchanged. These regions may represent cases of epigenetic redirection, yet-to-be characterized single-nucleotide polymorphisms (SNPs) or indels, or loss of nearby genomic binding partners. A larger fraction of the absent binding events was associated with motifs whose disruption could be assigned to base pair substitutions, indels, and gaps in the alignment. Across all the vertebrate species, indels appear to be associated with loss of the underlying sequence motif a third as often as mismatches. A four-mammal analysis, using opossum as an outgroup, afforded similar results (fig. S14). Analogous mechanisms appear to explain species-specific gains of TF binding events (fig. S15). Taken together, the steady accumulation of small changes in the genetic sequence appears to rapidly remodel thousands of TF binding sites in mammals.

Approximately half of lineage-specific losses of TF binding showed evidence of nearby compensatory binding events (Fig. 4B). A quarter of species-specific losses had a nearby (±10 kb) gained binding event that is unique to the same lineage (unshared turnover), and an additional quarter of the losses had a nearby binding event that is shared in one or more other species (shared turnover) (fig. S16). The latter case suggests the existence of a cluster of binding events in the common ancestor. In both cases, the probability of finding a turnover decreased rapidly with distance from the loss (fig. S16), but a shared turnover was typically closer to the site of the loss than was an unshared turnover [P value < 1.0 × 10–10 (CEBPA) and P value < 1 × 10–15 (HNF4A)].

Understanding the evolutionary dynamics of TF binding is essential to understanding the evolution of gene regulation. Many comparative genomics approaches assume that a multispecies alignment of a high-quality motif is indicative of functionality (20, 2227). Our analysis of experimentally determined in vivo occupancy of two TFs in multiple vertebrates revealed apparent limitations to this model and a number of other insights about the complex relationship between genetic sequence, TF binding, and genome regulation.

First, the vast majority of ChIP-identified TF binding events are unique to each vertebrate species; in mammals, the binding events that occur within species-specific, repetitive DNA are more common than conserved binding events. Second, ultrashared TF binding events, which are the functional counterpart of ultraconserved sequences, appear rarely in vivo among all five vertebrates. Third, only approximately half of the binding events that are lost in one placental mammal yet present in at least two others are potentially recovered by nearby turnover events. Fourth, neither motif nor strength of TF binding correlate with conservation of a TF’s genomic occupancy. Alterations in the DNA binding specificity of CEBPA and HNF4A cannot account for rapid binding divergence, nor can species-specific environmental differences (12).

Nevertheless, comparing binding events within 10 kb of the TSS of experimentally determined target genes of CEBPA and HNF4A has shown that binding events near these genes are more likely to be shared with other species, although this does not correspond to an increase in sequence constraint. In fact, the set of the ultrashared, five-way binding events is entirely disjoint from the set of genes that are directly dependent on CEBPA in adult liver. For HNF4A, only 6% of binding events shared across three placental mammals (Fig. 2D) are near the highest-quality functional target genes, namely, those genes that depend on HNF4A for proper expression in both mouse and human. Given that most TFs are active in multiple cell types (28), it is possible that the remaining shared sites are active in other tissues or other developmental stages. Indeed, the ultrashared CEBPA binding events are uniformly found near liver-specific genes that would be expected to be up-regulated upon liver organogenesis. Conversely, those binding events near functional targets in adult liver that are neither shared nor show signs of sequence constraint may represent lineage-specific regulatory interactions.

The preponderance of species-specific binding and the rapid lineage-specific loss of binding events suggests that a sizeable majority of specific TF-DNA interactions could be evolving neutrally. Liver-specific TFs and subsequent gene expression are both highly conserved; the rapid gain and loss of binding events may be indicative of compensatory changes that maintain local concentrations of TF binding near functional targets (29). Indeed, a recent computational approach that uses a high concentration of TF binding motifs, regardless of their alignment, showed improved ability to predict regulatory interactions (30).

Despite the rapid gain and loss of TF binding events in mammals, tissue-specific gene regulation seems to be maintained by identifiable regulatory architectures that can be independent of sequence constraint.

Supporting Online Material

Materials and Methods

Figs. S1 to S17

Tables S1 to S7


References and Notes

  1. Materials and methods are available as supporting material on Science Online.
  2. We thank N. Matthews and J. Hadfield at the Cambridge Research Institute (CRI) Genomics Core; the CRI Bioinformatics Core; W. Howat and the Histopathology Core; T. Davidge, S. Ballantyne (CRI); A. Enright, S. Wilder, and J. Herrero (European Bioinformatics Institute). This work was supported by the European Research Council Starting Grant, the European Molecular Biology Organization Young Investigator Award, Addenbrooke’s Biomedical Research Centre, Hutchinson Whampoa (D.T.O.); Swiss National Science Foundation (C.K.); University of Cambridge (D.S., M.D.W., and D.T.O.); Cancer Research UK (D.S., M.D.W., G.B., C.K., and D.T.O.); the Wellcome Trust (grant nos. WT062023 and WT079643) (B.B. and P.F.); and European Molecular Biology Laboratory (P.C.S. and P.F.). ChIP-seq experiments were deposited into ArrayExpress under the accession number E-TABM-722. CEBPA knockout gene expression experiments were deposited into ArrayExpress under the accession number E-MTAB-178. Author contributions: D.S., M.D.W., and D.T.O. designed experiments; D.S., M.D.W, C.K., S.W., and C.P.M.-J. performed experiments; D.S., M.D.W., B.B., G.B., P.C.S., and P.F. analyzed the data; S.M., C.P.M.-J., I.T., and A.M. provided tissues; D.S., M.D.W., B.B., I.T., P.C.S., P.F., and D.T.O. wrote the manuscript. P.F. and D.T.O. oversaw the work.
View Abstract

Navigate This Article