Technical Comments

Comment on “Evidence of Abundant Purifying Selection in Humans for Recently Acquired Regulatory Functions”

See allHide authors and affiliations

Science  10 May 2013:
Vol. 340, Issue 6133, pp. 682
DOI: 10.1126/science.1233195


Ward and Kellis (Reports, 28 September 2012, p. 1675; published online 5 September 2012) found altered patterns of human polymorphism in biochemically active but non–mammalian-conserved genomic regions relative to control regions and interpreted this as due to lineage-specific purifying selection. We find on closer inspection of their data that the polymorphism trends are primarily attributable to mutational variation and technical artifacts rather than selection.

Ward and Kellis (1) found reduced single-nucleotide polymorphism (SNP) density, heterozygosity, and derived allele frequency (DAF) in non–mammalian-conserved ENCODE-annotated (2) regions, relative to putative neutrally evolving control regions. Inferring that such differences are due to purifying selection is complicated by the fact that biological (e.g., mutation rates and background selection) as well as technical (e.g., coverage depth of sequencing reads used to detect and genotype SNPs) factors that differentially affect control and target regions may be confounded with signals of selection. The authors controlled for some but not all of these factors, which prompted us to analyze their data in more detail.

We noted that their neutral control regions have a somewhat more extreme nucleotide content than the ENCODE target (36% G+C versus 40%). This is of concern because more extreme nucleotide content is associated with both higher mutation rate [see figure 31 in (3)] and reduced coverage by next-generation reads [see figure S12 in (4)]. We found that the 1000 Genomes Project read-depth distributions for the two regions differ (Fig. 1A) and that—consistent with the fact that the 1000 Genomes Project low-coverage pilot has marginal power to detect rare alleles (5)—the DAF of detected SNPs depends on read depth (Fig. 1B). We also noted that the authors’ filtering of CpG “hotspot” dinucleotides erroneously also removes GpCs (which are not mutation hotspots and are much more common than CpGs). Moreover, in addition to ancestral CpGs and GpCs, they eliminated sites at which the derived allele creates a CpG or GpC; because it is targeted specifically to SNP sites, this affects SNP density and heterozygosity in a manner that depends on G+C content and, in particular, tends to reduce SNP rates in the ENCODE target relative to the neutral control.

Fig. 1 Variation in 1000 Genomes read depth (totaled over 59 Yoruban individuals) and its impact on DAF.

(A) Read-depth distribution for SNPs in neutral control (non-ENCODE) and ENCODE target regions. (B) DAF as a function of read depth, for non-ENCODE SNPs. DAF decreases with increasing depth, due to increasing sensitivity to detect rare variants; the reverse trend at depths above 300 likely reflects the presence of spurious “paralogue-collapse” SNPs .

We then reestimated (Fig. 2) the number of bases under selection, adapting the procedure in (1) to allow several motif-filtering methods [including one that only removes CpGs in the “ancestralized reference” genome obtained by replacing the reference nucleotide at any SNP site by the ancestral allele, when it differs (6)], to distinguish by type (A, C, G, or T) of ancestral nucleotide, and (for the allele frequency analyses) to control for varying read depth (7). We analyzed three region types considered in (1): (i) nonexonic, transcription start site (TSS)–distal, non–mammalian-conserved ENCODE regions [estimated in (1) to contain 111 megabases under selection]; (ii) mammalian-conserved exonic and TSS-proximal regions; and (iii) mammalian-conserved nonexonic, TSS-distal regions. The estimates for mammalian-conserved regions (which have independent evidence of purifying selection from interspecies comparisons) are remarkably robust to choice of motif filter and SNP measure and split roughly proportionately between ancestral A or T and ancestral C or G bases. If the polymorphism trends in the ENCODE target are due to selection, we would expect the ENCODE region estimates to be similarly robust. They are not, instead fluctuating wildly according to choice of motif filter, SNP measure, and ancestral nucleotide. In particular, there is a large difference between the estimates for ancestral A/T versus ancestral C/G; this may reflect a higher rate of biased gene conversion (8) in the ENCODE regions relative to the non-ENCODE control, but in any case it cannot plausibly be attributed to selection.

Fig. 2 Estimated bases under selection for ENCODE and mammalian-conserved regions by motif filter, SNP measure, and ancestral nucleotide.

None, no motifs removed; W&K, ancestral and derived CpG and GpC dinucleotides (1) removed; aCpG, ancestral CpGs removed; aA+T, analysis restricted to ancestral A and T nucleotides; uDAF, analysis based on DAF unadjusted for read depth; 1+2, analysis based on relative frequency of SNPs with derived allele count 1 or 2. DAF (without u) and 1+2 analyses used depth-adjusted measures. a=Ref, analysis using only SNPs for which the reference allele is ancestral. HM3, analysis using only HapMap3 SNPs.

Estimates based on allele frequency measures (right-hand side of Fig. 2) are similarly unstable. Controlling for read-depth variation sharply reduces the DAF-based ENCODE region estimate, from 147 Mb to 86 Mb. However, even the latter estimate appears suspect because it is sensitive to which part of the allele-frequency spectrum is considered: Using the rarest variants (those for which the derived allele occurs at most twice in the Yoruban chromosomes), which should have the greatest power to detect selection, the estimate drops to 9 Mb. Genotyping error is likely a factor, because confining the analysis to those SNPs for which the reference allele is ancestral, or which were typed in HapMap3 (9)—categories both indicated in (5) to have lower genotyping error rates than average—sharply reduces the DAF-based estimates.

Thus, it appears that the SNP-density- and heterozygosity-based estimates of selected sequence in (1) are considerably confounded by mutational trends and that suitably corrected allele-frequency–based estimates are much smaller than in (1). We note that although some lineage-specific functional sequences must underlie primate phenotypic characteristics, the amount need not be great: a “mere” 2 Mb, for example, could comprise some 250,000 8–base pair protein binding sites—i.e., more than 10 sites per gene and seemingly more than enough for massive regulatory rewiring. Thus, it should not be surprising that our analyses suggest a relatively small value.

References and Notes

  1. We used the 1000 Genomes inferred ancestral allele where available (96% of SNPs), and the major (most common) allele at other SNP sites (the major allele is nonancestral for roughly 15% of SNPs, so using it at 4% of SNPs introduces an error rate of 0.6%).
  2. In estimating bases under selection, we modified the authors’ linear interpolation procedure (1): For SNP-density- and heterozygosity-based estimates, we used 25 background selection (B) bins rather than 10. For DAF-based estimates, we did not control for B because for small samples, DAF is only weakly correlated with B. We instead controlled for read-depth variation, as follows. For a given region type, we binned SNPs by total read depth in the 59 YRI individuals, using bins of 0 to 9 reads, 10 to 19 reads, and so on; computed the allele frequency distribution for each bin; and took the weighted average of these distributions across bins, using as bin weights the fraction of ENCODE-region SNPs falling into that bin. (Thus, a single set of weights was applied for all region types.) This average distribution was then used to compute the adjusted DAF and adjusted 1+2 (fraction of SNPs with derived allele count 1 or 2) for the interpolation procedure.
  3. Acknowledgments: We thank L. Ward and M. Kellis for facilitating access to their data, E. Birney and J. Akey for helpful discussions, and NIH for financial support.

Stay Connected to Science


Navigate This Article