Research Article

Multiple Instances of Ancient Balancing Selection Shared Between Humans and Chimpanzees

See allHide authors and affiliations

Science  29 Mar 2013:
Vol. 339, Issue 6127, pp. 1578-1582
DOI: 10.1126/science.1234070

Balancing Humans with Apes

Shared ancestral polymorphisms between species tend to be relatively rare, and studies of trans-species polymorphisms have focused on just a few regions known for balancing selection. Leffler et al. (p. 1578, published online 14 February) performed genome-wide scans among humans and great apes and found shared polymorphisms between chimps and humans. Many of the identified variants seem to be associated with genes involved in pathogen response or defense, suggesting that this widespread balancing selection may reflect the ongoing arms race between pathogens and hosts.

Abstract

Instances in which natural selection maintains genetic variation in a population over millions of years are thought to be extremely rare. We conducted a genome-wide scan for long-lived balancing selection by looking for combinations of SNPs shared between humans and chimpanzees. In addition to the major histocompatibility complex, we identified 125 regions in which the same haplotypes are segregating in the two species, all but two of which are noncoding. In six cases, there is evidence for an ancestral polymorphism that persisted to the present in humans and chimpanzees. Regions with shared haplotypes are significantly enriched for membrane glycoproteins, and a similar trend is seen among shared coding polymorphisms. These findings indicate that ancient balancing selection has shaped human variation and point to genes involved in host-pathogen interactions as common targets.

Balancing selection is a mode of adaptation that leads to the persistence of variation in a population or species in the face of stochastic loss by genetic drift. In humans, examples include the sickle cell hemoglobin polymorphism, maintained by heterozygote advantage in environments in which Plasmodium falciparum is endemic, as well as other cases that likely arose recently in evolution in response to malaria (1). Beyond humans, examples of balancing selection are known in a wide range of organisms and often seem to arise from predator-prey or host-pathogen interactions (28). Most are not thought to be due to heterozygote advantage but to negative frequency-dependent selection, as occurs at self-incompatibility loci in plants (5, 9), or to temporally or spatially varying selection, as seen at R genes in Arabidopsis, for example (4). The genetic basis is known only in a small subset of cases, however, and the age-old question (1012) of how much genetic variation is maintained by balancing selection remains largely open.

When balancing selection pressures result in the stable maintenance of genetic variation in the population for long periods of time, neutral diversity accumulates at nearby sites; in other words, ancient balancing selection leads to deep coalescence times to a common ancestor at the selected site (or sites) and closely linked ones (13). One approach to identify targets is therefore to scan the genome for regions of high diversity or other related features, such as intermediate allele frequencies (14). A challenge is that such patterns of diversity can occur by chance because of the tremendous variance in coalescence times due to genetic drift alone (14). As an illustration, under a simple demographic model with no selection, the probability that two human lineages do not coalesce before the split with chimpanzee is on the order of 10−4 (15, 16). Although this probability is small, the human genome is large, and so many such regions could occur by chance. To circumvent this difficulty, we looked for cases in which an ancestral polymorphism has persisted to the present time in both humans and chimpanzees, that is, is shared identical by descent between the two species. This outcome is not expected to occur by genetic drift alone because it requires that neither human nor chimpanzee lineages coalesce before the human-chimpanzee ancestor, which is unlikely even in a large genome (16).

To date, two cases of human polymorphisms shared with other apes have been shown to be identical by descent [additional background is available in fig. S1 and (16)]: variants in the major histocompatibility complex (MHC), a complex encoding cell surface glycoproteins that present peptides to T cells (17), and polymorphisms at ABO, a glycosyltransferase, that underlie the A and B blood groups (18). Ancient balancing selection leaves a narrow footprint in genetic variation (15, 18), however, which may be particularly difficult to detect without dense variation data (19). Thus, the recent availability of genome sequences for multiple humans and chimpanzees provides an opportunity to search comprehensively and with greater power for ancient balancing selection.

Identification of shared SNPs and haplotypes. We examined complete genome sequences from 59 humans from sub-Saharan Africa (Yoruba) (20) and 10 Western chimpanzees (Pan troglodytes verus) (21) in order to identify shared polymorphisms—namely, high-quality orthologous SNPs with identical alleles in the two species (table S1) (16). In total, 33,906 autosomal and 492 X-linked single-nucleotide polymorphisms (SNPs) passed our filters (table S2). The lower proportion of shared SNPs found on the X (in humans, 0.36% of autosomal SNPs versus 0.19% of X-linked SNPs) is expected under neutrality because of the lower mutation rate and the smaller effective population size of the X (22).

The set of shared SNPs has similar properties to those of nonshared SNPs in terms of mapping quality, depth of coverage, and proportion in repeats (fig. S2 and table S2), which is consistent with it containing few artifacts. The shared SNPs include a much higher proportion of CpGs, however: 71.5% of autosomal shared SNPs occur at CpG dinucleotides, whereas only 26.4% of all human SNPs have this property (table S2). Because CpGs are known to have a higher mutation rate than other sites (23), this observation, along with the similarity in allele frequency distributions of shared and nonshared SNPs (fig. S2), suggest that most instances of shared SNPs are due to the independent occurrence of the same mutation in both species—in other words, that most SNPs are identical by state rather than descent (16).

Nonetheless, SNPs are shared between humans and chimpanzees 1.3-fold more often than is expected by chance, after controlling for the composition of the adjacent base pairs [the sequence context thought to have the strongest effect on mutation rate variation (23)] (fig. S3). This excess may be explained by residual effects on the mutation rate of the sequence context beyond the adjacent base pairs (fig. S4) or by variation in selective constraint across sites, but could also reflect instances of balancing selection.

Within the set of shared SNPs, we sought to enrich for targets of balancing selection by two approaches (Fig. 1A). First, we considered shared coding SNPs (16), a set that a priori should contain more functional changes subject to purifying selection so is less likely to include polymorphisms shared by chance alone. Second, to home in on cases with unequivocal evidence for balancing selection, we searched for polymorphisms shared because of identity by descent. Where balancing selection acted on a single site and maintained a polymorphism stably since the human-chimpanzee split, a short ancestral segment should persist until the present around the selected site, of expected length less than 4 kilobases (kb) [depending on the recombination rate (16)]. This segment is likely to contain one or more neutral, shared polymorphisms that arose in the ancestral population of humans and chimpanzees and are in strong or complete linkage disequilibrium (LD) with the selected site (Fig. 1B) (15, 18). Thus, this scenario should produce specific patterns of haplotype sharing between species. Guided by these considerations, we focused on cases with two or more shared SNPs within 4 kb and in significant LD in humans and in chimpanzees, with the same coupling of alleles in the two species (henceforth “shared haplotypes”) (16). These LD criteria should almost always be met when a neutral polymorphism has persisted because of close linkage with an ancient balanced polymorphism and yet are expected to filter out the vast majority (>96%) of cases of neutral, recurrent mutations (table S3) (16). These LD criteria should also be met if balancing selection acted on two or more sites and there is epistasis between them (as is the case at ABO), in which case the shared haplotypes may be longer (Fig. 1B).

Fig. 1

Analysis pipeline. (A) Diagram of the pipeline to identify shared coding SNPs and shared haplotypes. Details of the filtering and validation are available in (16). (B) Two possible scenarios of ancient balancing selection that may be detected by our approach. In (i), only one site is under balancing selection, and a second mutation is neutral but persisted as a polymorphism until the present in both species because of tight linkage to the selected site. In (ii), two or more epistatically interacting polymorphic sites are maintained by balancing selection from the ancestral population of human and chimpanzee to the present time. In this case, the ancestral segment could be substantially longer because there is selection against recombinant haplotypes.

We imposed stringent quality control filters on the shared haplotypes and coding SNPs (Fig. 1A) in order to exclude regions with highly similar paralogs present in the reference genomes of humans or chimpanzees as well as artifacts arising from duplicates that are either fixed or polymorphic in the two species but for which one copy is absent from both reference genomes (these filters should also weed out regions that experience paralogous gene conversion) (16). After filtering, we combined regions with shared haplotypes if they had a shared SNP in common (tables S4 and S5).

Protein variants. Across the genome, the MHC stood out (fig. S5), with 11 shared nonsynonymous and seven shared synonymous SNPs, including six nonsynonymous and three synonymous that were not among the many cases of shared haplotypes in this region (table S6) (16).

Unexpectedly, given that the basis for A and B blood groups is shared between humans and gibbons but not chimpanzees (who lack the B type) (18), we found two SNPs shared between humans and chimpanzees in ABO ~4 kb from the sites that distinguish A and B blood types in humans (fig. S6). Neither shared SNP is nonsynonymous (one is synonymous, the other intronic), and they do not meet our criteria for creating shared haplotypes, but there is a peak of diversity around them within both humans and chimpanzees, suggesting that they may be ancient variants (fig. S6).

In addition, we found 199 synonymous SNPs, 135 nonsynonymous SNPs, and 1 premature stop shared between humans and chimpanzees, distributed among 324 genes (table S5). Notable among these is a nonsynonymous SNP in GP1BA, a gene encoding a glycoprotein present on the membrane of platelets that is responsible for binding to the ABO antigens expressed on the Von Willebrand Factor (VWF) (24). The specific polymorphism in GP1BA shared between humans and chimpanzees, corresponding to the human platelet alloantigen 2 (HPA-2) polymorphism, affects the binding affinity to VWF and is associated with platelet count (25). More generally, the blood glycoprotein VWF is used as a bridge to anchor platelets to injured blood vessels for coagulation, and variants in ABO are strongly associated with protein levels of VWF (24). These findings suggest that two genes associated with the same complex may have been targets of long-lived balancing selection.

Fig. 2

Functional information for three regions with a polymorphism shared identical by descent in humans and chimpanzees. We show the nearby genes and direction of transcription, then a close up of the region with shared polymorphisms between humans and chimpanzees. The original shared SNPs used to identify shared haplotypes are shown as solid circles. The region resequenced in the validation experiment is indicated with a solid black bar, and the length of the shared haplotypes is indicated with a dashed black bar (16). Sources of the functional annotation tracks shown are available in (16); darker shading for the formaldehyde-assisted isolation of regulatory elements (FAIRE) and DNaseI hypersensitivity tracks indicates a more intense signal. In the bottom panels, we focus on a shared SNP (hg19, chr4:144658471, chr5:8023976, and chr4:57918492, respectively) and show the mean pairwise difference between allelic classes for humans (in blue) and chimpanzees (in red), for a 500-bp sliding window; the mean pairwise difference within an allelic class in humans is in gray. We further indicate the average genome-wide divergence between human and chimpanzee (1.2%) (39) with a dotted black line. Divergence between more distant ape species and a zoom out of diversity levels in each region can be seen in figs. S7 and S9. (A) FREM3. A duplication in chimpanzees that includes the GYPE gene is shown above the gene structure in humans (26). The shared SNPs and eQTLs for GYPE in monocytes (40) are in almost perfect LD, with a pairwise correlation coefficient (r2) ranging from 0.98 to 1. (B) MTRR. The shared SNP represented by a triangle is also seen in a sample of seven gorillas by Sanger resequencing (16); pairwise differences between allelic classes in gorillas is shown in turquoise for the resequenced region. The maximum pairwise r2 between a shared SNP and the eQTL for MTRR in monocytes is 0.47 (16, 40). The FAIRE signal is enriched in six cell lines. (C) IGFBP7. In the scan for shared haplotypes, five shared SNPs were found within 4 kb, occurring in two clusters with three and two SNPs, respectively, which are not in LD with each other in humans. Two of the shared SNPs found in the resequencing and a SNP outside the resequenced region constitute an additional instance of shared haplotypes. The FAIRE signal is enriched in four cell lines. Using a focal SNP in the second cluster yields similar results (fig. S7).

Regions with shared haplotypes. We identified 125 regions outside the MHC with shared haplotypes between humans and chimpanzees, whose total lengths span 4 bp to 6649 bp (table S4). In five of the regions (nearest FREM3, MTRR, and PROKR2 and in HUS1 and IGFBP7), there are more than two pairs of shared SNPs in significant LD, which simulations suggest should never occur in the genome by neutral recurrent mutations alone (16).

In the regions nearest FREM3 and MTRR and in IGFBP7, there is a peak of diversity in humans and chimpanzees around the shared SNPs that is comparable with or in excess of the average divergence between the two species (and yet there is no evidence for elevated mutation rates in the region, as assessed by the levels of divergence between more distant outgroup species), which is consistent with the polymorphisms predating the human-chimpanzee split (Fig. 2 and fig. S7). Furthermore, when we built a phylogenetic tree based on these regions, haplotypes from different species that carry the same allele are more closely related to each other than to haplotypes from the same species with the other allele (with high posterior probability and based on 800 bp or more) (Fig. 3, A to C) (16). This clustering pattern establishes that these cases cannot be explained solely by recurrent mutation (16).

Fig. 3

Phylogenetic trees of haplotypes labeled with the same focal SNP considered in Fig. 2 or fig. S8 for (A) FREM3, (B) MTRR, (C) IGFBP7, (D) HUS1, (E) PROKR2, and (F) ST3GAL1. Trees were generated from our resequencing data by using MrBayes, with the median posterior probability of the clade over two runs reported in red (16). Results are for the entire resequenced regions for FREM3 and MTRR, and for the largest regions for which we found strong support in other cases. For FREM3, MTRR, and IGFBP7, the regions on which the trees are based are long (>800 bp), providing strong support for a polymorphism shared identical by descent (16). For HUS1, the tree still clusters by allele when considering 1 kb (with posterior probability 0.58), but for ST3GAL1 and PROKR2, this is not the case [more details are available in (16)].

The shared SNPs nearest FREM3 are in almost perfect LD with several expression quantitative trait loci (eQTLs) for GYPE (~130 kb away) in monocytes (Fig. 2A). Along with GYPA and GYPB, GYPE originated from one copy in the common ancestor of African apes (26). GYPA is a known receptor for Plasmodium falciparum proposed to be under balancing selection in humans, which, together with GYPB, codes for the MNS blood group (26); much less is known about GYPE, but it may also specify the M blood group antigen (27). The shared SNPs ~117 kb from MTRR, a gene involved in the production of methionine and implicated in the regulation of folate metabolism, are also in significant LD with an eQTL in monocytes, for MTRR (Fig. 2B). In turn, the shared SNPs in an intron of IGFBP7 occur in a likely enhancer (Fig. 2C). IGFBP7 has been shown to regulate cell proliferation, cell adhesion, and angiogenesis in cancer cell lines and plays a role in innate immunity by interacting with chemokines implicated in the regulation of lymphocyte trafficking (28).

In the two other regions (in HUS1 and nearest PROKR2) as well as in a region with only one pair of shared SNPs in significant LD (nearest ST3GAL1), diversity levels are only unusually high in humans, but nonetheless a phylogenetic tree for a small subset of the region (300 bp) clusters by allele and not by species (Fig. 3, D to F, and fig. S8). These patterns are consistent with the presence of an ancient balanced polymorphism on an ancestral segment that has been highly eroded by recombination [a more in-depth discussion is available in (16)]. PROKR2 is a receptor that functions as a proinflammatory mediator and whose ligand is able to modulate immune response (29). In turn, ST3GAL1 is a sialyltransferase that modifies the cell-surface glycan structure of dendritic cells (30) and for which knockout mice lack peripheral CD8+ T lymphocytes (31).

To check for possible sequencing or mapping errors, we resequenced the six regions with evidence for a polymorphism shared identical by descent (summarized in table S7) in 11 to 12 humans, 10 to 12 chimpanzees, and four to seven gorillas. In all cases, we confirmed the presence of the expected shared SNPs and the predicted LD patterns among them (16). Additionally, we found that in the MTRR and ST3GAL1 regions, one of the SNPs in the shared haplotypes is also segregating in gorilla (Fig. 2B and fig. S8) (16).

Common properties of ancient balanced polymorphisms. The narrow signature of ancient balancing selection allows the possible causal sites to be delimited to a few kilobases. Of the six regions with evidence for a long-lived balanced polymorphism, those in HUS1 and IGFBP7 and nearest ST3GAL1 likely have regulatory activity (Fig. 2 and fig. S8). More generally, only two of the 125 candidate regions include a shared SNP that is coding (in both cases, synonymous), but at least 10 regions appear to have a regulatory role (table S8) (16). Our findings therefore suggest that balancing selection has targeted regulatory variation in the human genome. The possible mechanisms underlying the maintenance of such polymorphisms are unclear but could involve allele-specific properties that lead to differences in levels of expression, in response to stimuli, or in patterns of expression across tissues [as is the case for B4galnt2 in mice (32)].

To further assess the commonalities among the set of 125 regions, we tested for an enrichment of gene categories for the nearest protein-coding gene (Table 1 and table S9) (16). We found significant enrichments of a number of overlapping categories, driven by the presence of 24 membrane glycoproteins in the test set of 54 genes (P < 10−3, corresponding to a 2.4-fold enrichment of glycoproteins over the background and a 1.2-fold enrichment of membrane glycoproteins over a background of only glycoproteins) (Table 1 and tables S10 to S12). Five of the 24 membrane glycoproteins have an immunoglobulin I-set domain (P = 0.006; a 6.3-fold enrichment over a background of membrane glycoproteins). The same trends are seen when considering an almost completely independent set of 335 coding SNPs (only two occur in shared haplotypes, neither of which contributes to these trends): Glycoprotein and cell adhesion are top categories among shared coding SNPs (P < 0.02) (tables S13 and S14). Although the number of genes involved is small, there is also an enrichment of gene ontology categories related to galactosyltransferase activity among genes near shared haplotypes and for categories related to glycosylation among genes with a shared coding SNP (tables S9 and S14).

Table 1

Enrichment analysis. Gene category enrichment of the closest gene within 20 kb of shared human-chimpanzee haplotypes (16). We show only the top categories, for which P < 10−3 (a longer list is available in table S9); because the categories overlap, the Bonferroni correction is conservative. “Count” refers to the number of genes from the gene set with the given property (Term); “List total” refers to number of genes from the gene set that can be annotated in the category; “Pop hits” refers to the number of genes in the background with the given property; and “Pop total” refers to the number of genes from the background that can be annotated in the category.

View this table:

Given that viruses frequently use host glycans to gain entry into host cells and some bacteria imitate host glycans to evade the host immune system (3335), these enrichments suggest that the targets of balancing selection that we identified likely evolved in response to pressures exerted by human and chimpanzee pathogens, mirroring what is known about other genes under balancing selection in humans [(1, 17, 18, 36) and references therein]. Moreover, the observation that variation at loci that lie at the interface of host-pathogen interactions was stably maintained for millions of years is consistent with the hypothesis that arms races between hosts and pathogens can result not only in transient polymorphisms but also, in the presence of a cost to resistance, to a stable limit cycle in allele frequencies in the host (4, 9, 37).

We found several instances of ancient balancing selection in humans in addition to the two previously known cases. Our analysis suggests that this mode of selection has not only involved protein changes but also the regulation of genes involved in the interactions of humans and chimpanzees with pathogens and points to membrane glycoproteins as frequent targets. Because we deliberately focused on the subset of cases of balancing selection that are the least equivocal—requiring variation at two or more sites to be stably maintained in the two species from their split to the present—we likely missed balanced polymorphisms with a high mutation rate to new selected alleles [that is, with high allelic turnover (38)], in which the ancestral segment has been too heavily eroded by recombination, as well as any instance in which balancing selection pressures are more recent than the human-chimpanzee split. Thus, it seems likely that many more cases of balancing selection in the human genome remain to be found.

Supplementary Materials

www.sciencemag.org/cgi/content/full/science.1234070/DC1

Materials and Methods

Figs. S1 to S9

Tables S1 to S20

References

  • § These authors co-supervised this work.

References and Notes

  1. Materials and methods are available as supplementary materials on Science Online.
  2. Acknowledgments: We thank D. Conrad, Y. Lee, M. Nobrega, J. Pickrell, and H. Shim as well as A. Kermany, A. Venkat, and other members of the PPS labs for helpful discussions; I. Aneas, M. Çalışkan, M. Nobrega, and C. Ober for their assistance with experiments; and G. Coop for discussions and comments on an earlier version of this manuscript. E.M.L. was supported in part by NIH training grant T32 GM007197. This work was supported by NIH HG005226 to J.D.W.; Israel Science Foundation grant 1492/10 to G.S.; a Wolfson Royal Society Merit Award, a Wellcome Trust Senior Investigator award (095552/Z/11/Z), and Wellcome Trust grants 090532/Z/09/Z and 075491/Z/04/B to P.D.; Wellcome Trust grant 086084/Z/08/Z to G.M.; and NIH grant GM72861 to M.P. M.P. is a Howard Hughes Medical Institute Early Career Scientist. The data set of shared SNPs is available from http://przeworski.uchicago.edu/wordpress/?page_id=20. Data from the validation experiment are available from GenBank under accession nos. KC541701 to KC542146. The biological material obtained from the San Diego Zoo and used in this study is subject to a materials tranfer agreement.
View Abstract

Navigate This Article