Chromatin-Associated Periodicity in Genetic Variation Downstream of Transcriptional Start Sites

See allHide authors and affiliations

Science  16 Jan 2009:
Vol. 323, Issue 5912, pp. 401-404
DOI: 10.1126/science.1163183


Might DNA sequence variation reflect germline genetic activity and underlying chromatin structure? We investigated this question using medaka (Japanese killifish, Oryzias latipes), by comparing the genomic sequences of two strains (Hd-rR and HNI) and by mapping ∼37.3 million nucleosome cores from Hd-rR blastulae and 11,654 representative transcription start sites from six embryonic stages. We observed a distinctive ∼200–base pair (bp) periodic pattern of genetic variation downstream of transcription start sites; the rate of insertions and deletions longer than 1 bp peaked at positions of approximately +200, +400, and +600 bp, whereas the point mutation rate showed corresponding valleys. This ∼200-bp periodicity was correlated with the chromatin structure, with nucleosome occupancy minimized at positions 0, +200, +400, and +600 bp. These data exemplify the potential for genetic activity (transcription) and chromatin structure to contribute to molding the DNA sequence on an evolutionary time scale.

Mutation and repair characteristics of DNA sequence in experimental systems have been shown in a number of cases to reflect structures in chromatin. For one well-studied experimental system, ultraviolet-irradiated yeast (Saccharomyces cerevisiae), repair rates for a set of DNA nucleosome core regions are lower than in the surrounding linker regions (14). Correlations between chromatin structure and mutation rates have also been suggested in analysis of human and yeast genomes (57). The draft genome sequences of two inbred medaka strains, Hd-rR and HNI (8), provide a remarkable opportunity for extensive comparison between genomic variation and structural features in the genome. The two strains are cross-fertile, yet their genomes are substantially different [∼3.42% single-nucleotide polymorphism (SNP)] (8). For analysis of chromatin and transcriptional effects on genetic variation, tissue samples including totipotent (germline tissue) would be most relevant, as mutational events in the germ line would uniquely contribute to shaping the genome over evolutionary time (911).

To characterize transcriptional activity patterns from the medaka genome at embryonic stages, we collected 25-nucleotide (nt) 5′-end mRNA tags for 1-, 2-, 3-, 5-, 10-, and 14-day Hd-rR medaka embryos (12). Among a total of ∼38.5 million 5′-end tags collected, ∼26.2 million (68.14%) were successfully aligned to unique positions in the medaka genome (fig. S1). Starting with a rough assumption that one cell contains ∼300,000 mRNA molecules (13), single-copy-per-cell RNAs would be represented by ∼100 of the ∼26.2 million tags. To define a set of active transcription start sites (TSSs), we used a clustering algorithm yielding 11,654 ≥100-tag clusters. More than 98.4% of neighboring clusters were separated by >100 base pairs (bp) from their nearest neighbor (fig. S3B). A reference TSS for each cluster was defined as the position with the most 5′-end tags.

The substitution and indel rates within 1000 bp of the reference TSSs in the 11,654 TSS clusters tend to reach a valley at the TSSs (Fig. 1A), suggesting relative selective constraint within promoters. This is consistent with reports of high conservation around TSS regions in mammals (14). Our analysis in medaka uncovers an additional pattern: The substitution rate (blue line) showed peaks at +100 and +300 bp and valleys at +200 and +400 bp around the TSSs (the same pattern was also seen in the transition and transversion rates). The indel rate [red line (Fig. 1A)] was minimal at the TSSs and maximal at +200 bp; additional peaks were evident at +400 and +600 bp. These peaks define regions where indel mutation rates were significantly greater than the average rate (0.59%) for the entire genome, with the signal weakening with increasing distance from TSSs. The indel dataset was then split into a “1-bp” category (37.46%) and the remaining “>1-bp” category of indels (fig. S4C). The peaks at +200, +400, and +600 bp are generated by the increase in the >1-bp category, whereas the 1-bp indel rate does not yield an evident periodicity (Fig. 1A). Comparisons of genetic variation to TSSs were possible in human to chimpanzee or mouse to rat, although not limited to germline or embryo TSSs (fig. S5). A limited periodicity in substitution rates may be present for these genomes, albeit much smaller in magnitude than that observed with the early transcriptome TSS data from medaka.

Fig. 1.

Diversity rates and nucleosome positions around TSSs. (A) The x axis shows the distance from the representative TSSs in the medaka (Hd-rR) genome. Color key shows rates: blue line, mismatch mutation (substitution) rate; red line, indel mutation rate; and gray line, rate of indels of length 1 bp. For smoothing of lines, a running average over a 23-bp window (one full turn of the helix in each direction) is depicted. (B) (Top) Putative nucleosome dyads (red points, 73 bp from start of sequence read) and cores (gray bars; 147 bp). (Bottom) The distinct meanings of the three nucleosome indicators. (C) Distribution of nucleosomes, substitutions, and indels surrounding a TSS. Black boxes, exons of the gene; blue histograms, distributions of the three nucleosome indicators; green vertical bars, substitutions between the Hd-rR and HNI genomes; red bars, deletions from the Hd-rR genome; blue bars, insertions into the Hd-rR genome; and gray bars and boxes, failure of alignment. (D) The average local dyad positioning score.

The ∼200-bp periodicity of the substitution and indel rates in medaka suggested the involvement of nucleosome structure. We isolated mononucleosome core DNAs from micrococcal nuclease–digested chromatin from blastulae [0.5-day embryos that maintain germline character in some (or all) cells (15)] and sequenced 67 million DNA ends to 36 bp (12, 16, 17). The first 25 bp were sufficient (fig. S6) to map 37.3 million ends (55.7% of sequenced reads) to unique locations in the medaka genome.

The distribution of distances between nucleosome start and end reads (fig. S7B) presents a significant peak at ∼147 bp, coincident with the size of nucleosome cores and indicative of some degree of constraint in nucleosome positioning. To assess nucleosome spacing intervals, we analyzed the distribution of distances between start positions of mapped nucleosome ends (fig. S7A) (16, 17). We observed a small peak at 165 bp, which indicated that adjacent nucleosomes in regions with conserved positioning are likely to be located at approximately 165-bp intervals (∼18-bp linker), while a distinctive ∼200-bp spacing (∼50-bp linker) was seen downstream of TSSs (see below).

Our metric for nucleosome position at individual sites in the genome (Fig. 1B) counts the number of putative nucleosome dyads in a 23-bp “sliding window” and divides this by the total number of nucleosomes impinging on this window to obtain a localized dyad positioning score (Fig. 1B). The 23-bp window (±1 helical turn) is used to accommodate observed variability in nuclease cleavage around nucleosome termini (see fig. S7B and S8B) (12, 17).

The distribution of nucleosome dyad indicators, substitutions, and indels around several TSS sites is shown in Fig. 1C and fig. S8. For global analysis, positioning scores (X/Y) were taken into account only in areas covered by multiple nucleosome reads [87.1% of genomic positions (fig. S9B); the remaining 12.9% correspond in part to repetitive sequences that occupy 17.5% of the medaka genome (8)]. In unique regions supported with multiple nucleosome core coverage, putative nucleosome dyads that occur reproducibly in a defined neighborhood allow us to define positioned nucleosomes (fig. S9C). The average local dyad positioning score has local minima at positions +200, +400, +600, and +800 bp from the TSSs (Fig. 1D, green line), which suggests the presence of phased arrays of nucleosomes every ∼200 bp downstream of the TSS (911, 1821).

By contrast to the decreased substitution rate in nucleosome linker regions, the indel rate for medaka had peaks at positions +200, +400, and +600 bp from the TSSs, which implied that indels of length >1 bp are more likely to occur at DNA linker regions. One possible explanation is that DNA linker regions have more indel mutations than the rest of the genome; this idea is supported by the higher indel rate on a genome-wide scale (not limited to TSS regions) in the DNA linkers in regions occupied by positioned nucleosomes (Fig. 2). One might wonder if the substitution rate increases toward the positioned dyads in nonpromoter regions; however, this tendency was not observed (Fig. 2A). These observations suggest an interplay of transcription and nucleosome positioning in determining susceptibility to substitutions and indel mutations.

Fig. 2.

Mutational spectra at positions around 8181 positioned dyads that are isolated from their neighboring dyads by >165 bp and are covered by an average of 5.44 putative nucleosome cores on a genome-wide scale (excluding TSSs and coding regions). (A) In nonpromoter regions, where transcription does not occur, the two locations in the distinct strands are positionally equivalent in a nucleosome core if they are the same distance from the dyad. The x axis presents the distance. Color key shows rates. (B) An expanded view of the indel rates enclosed in the green square in Fig. 2A is duplicated in tandem, and the two copies are overlaid for comparison with equivalent measurements relative to TSSs in Fig. 1A. (Bottom) The estimated dyads (arrows) aligned with dyad positioning score near TSSs (expanded from Fig. 1D).

Transcription-coupled DNA repair (TCR), a mechanism that protects transcribed regions from mutations, may contribute to the observed sequence effects (2, 2224). TCR is thought to work simultaneously with mRNA transcription involving RNA polymerases I and II; resulting in an asymmetric effect with an overabundance of G+T over A+C downstream from the TSSs (through an excess of C-to-T mutations over G-to-A mutations) (22, 23). A significant asymmetry of the base composition is found in examining natural variation in the medaka genome at TSSs (Fig. 3A). Examining reciprocity in frequencies of the 12 possible base substitutions in 319 transcribed loci (121.1K bp, in total; regions where ancestry could be inferred by comparison with sequence data from an outgroup species), only the C-to-T versus G-to-A in the transcribed regions downstream of TSSs showed a significant strand bias (Fig. 3B) (P = 0.044) (12). This is consistent with TCR as one of the factors contributing to the character of natural sequence variation in these regions.

Fig. 3.

(A) Base composition surrounding TSSs. Red line, the difference between guanines and cytosines; blue line, the difference between adenines and thymines. (B) Substitution rates around TSSs. Rates for each substitution and its complement and their 95% confidence intervals are indicated side by side for untranscribed and transcribed regions that are upstream and downstream of TSSs, respectively.

Several possible causal and structural relationships may link sequence composition to mutagenesis rates and nucleosome positioning around TSSs. One rather simple explanation for the remarkable periodicity in mutation rates might have been an underlying bias in sequence composition in nucleosome core regions that favored certain types of mutations, whereas distinct sequence composition in linkers would favor other types of mutations. We addressed this possibility by examining sequence composition in general and around sites of genetic variation as a function of positioning relative to nucleosomes and TSSs (fig. S13) (12). This analysis gave no indication that differential mutagenesis could be accounted for by an initial sequence bias. A second intriguing possibility is that mutagenesis rates are influenced toward periodicity, not by the structural constraints of the chromatin template, but by functional constraints related to overall organismal fitness. Thus, for example, it would be conceivable that substitutions might be underrepresented in a critical set of linker sequences that are essential in maintaining specific transcription complexes and nucleosome-based structures downstream of TSSs. We do not favor this explanation for the medaka data, given that indel mutations show an opposite distribution, occurring more frequently in the linker regions. Instead, the biases in genetic variation seem most likely to represent structural constraints of the chromatin template during the mutagenic processes that medaka has encountered during evolutionary time. The mechanistic points at which nucleosomes may have influenced mutagenesis and/or repair processes in medaka evolution are (by definition) not known. The ability of nucleosomes in model assay systems to block repair of certain DNA lesions [e.g., (3)] certainly provides a precedent for the observed higher substitution rates in core regions. The complementary pattern of indels in medaka could reflect any of several conceivable linker and/or core differences (e.g., higher susceptibility of cores to breakage or less precise break repair in linkers).

For any species, the balance of specific mutagenic and repair processes occurring over history would have shaped the genome in potentially unique ways; thus, not all genomes would be expected to show a qualitatively or quantitatively equivalent “shadow” of germline chromatin structure. Our working model for the basis of structural variation between the genomes of these two inbred medaka strains is that chromatin structure influences mutagenesis, which in turn influences genetic variation, to provide the observed periodic pattern near the 5′ ends of germ line–transcribed genomic segments. We expect the influence of chromatin structure to be a general feature of sequence evolution throughout the genome and the biosphere.

Supporting Online Material

Materials and Methods

Figs. S1 to S14

Tables S1 to S3


References and Notes

View Abstract

Stay Connected to Science

Navigate This Article