Research Article

Global diversity, population stratification, and selection of human copy-number variation

See allHide authors and affiliations

Science  11 Sep 2015:
Vol. 349, Issue 6253, aab3761
DOI: 10.1126/science.aab3761

Duplications and deletions in the human genome

Duplications and deletions can lead to variation in copy number for genes and genomic loci among humans. Such variants can reveal evolutionary patterns and have implications for human health. Sudmant et al. examined copy-number variation across 236 individual genomes from 125 human populations. Deletions were under more selection, whereas duplications showed more population-specific structure. Interestingly, Oceanic populations retain large duplications postulated to have originated in an ancient Denisovan lineage.

Science, this issue 10.1126/science.aab3761

Structured Abstract

INTRODUCTION

Most studies of human genetic variation have focused on single-nucleotide variants (SNVs). However, copy-number variants (CNVs) affect more base pairs of DNA among humans, and yet our understanding of CNV diversity among human populations is limited.

RATIONALE

We aimed to understand the pattern, selection, and diversity of copy-number variation by analyzing deeply sequenced genomes representing the diversity of all humans. We compared the selective constraints of deletions versus duplications to understand population stratification in the context of the ancestral human genome and to assess differences in CNV load between African and non-African populations.

RESULTS

We sequenced 236 individual genomes from 125 distinct human populations and identified 14,467 autosomal CNVs and 545 X-linked CNVs with a sequence read-depth approach. Deletions exhibit stronger selective pressure and are better phylogenetic markers of population relationships than duplication polymorphisms. We identified 1036 population-stratified copy-number–variable regions, 295 of which intersect coding regions and 199 of which exhibit extreme signatures of differentiation. Duplicated loci were 1.8-fold more likely to be stratified than deletions but were poorly correlated with flanking genetic diversity. Among these, we highlight a duplication polymorphism restricted to modern Oceanic populations yet also present in the genome of the archaic Denisova hominin. This 225–kilo–base pair (kbp) duplication includes two microRNA genes and is almost fixed among human Papuan-Bougainville genomes.

The data allowed us to reconstruct the ancestral human genome and create a more accurate evolutionary framework for the gain and loss of sequences during human evolution. We identified 571 loci that segregate in the human population and another 2026 loci of fixed-copy 2 in all human genomes but absent from the reference genome. The total deletion and duplication load between African and non-African population groups showed no difference after we account for ancestral sequences missing from the human reference. However, we did observe that the relative number of base pairs affected by CNVs compared to single-nucleotide polymorphisms is higher among non-Africans than Africans.

CONCLUSION

Deletions, duplications, and CNVs have shaped, to different extents, the genetic diversity of human populations by the combined forces of mutation, selection, and demography.

Figure Global human CNV diversity and archaic introgression of a chromosome 16 duplication.

(Left) The geographic coordinates of populations sampled are indicated on a world map (colored dots). The pie charts show the continental population allele frequency of a single ~225-kbp duplication polymorphism found exclusively among Oceanic populations and an archaic Denisova. (Right) The ancestral structure of this duplication locus (1) and the Denisova duplication structure (2) are shown in relation to their position on chromosome 16. We estimate that the duplication emerged ~440 thousand years ago (ka) in the Denisova and then introgressed into ancestral Papuan populations ~40 ka.

Abstract

In order to explore the diversity and selective signatures of duplication and deletion human copy-number variants (CNVs), we sequenced 236 individuals from 125 distinct human populations. We observed that duplications exhibit fundamentally different population genetic and selective signatures than deletions and are more likely to be stratified between human populations. Through reconstruction of the ancestral human genome, we identify megabases of DNA lost in different human lineages and pinpoint large duplications that introgressed from the extinct Denisova lineage now found at high frequency exclusively in Oceanic populations. We find that the proportion of CNV base pairs to single-nucleotide–variant base pairs is greater among non-Africans than it is among African populations, but we conclude that this difference is likely due to unique aspects of non-African population history as opposed to differences in CNV load.

In the past decade, genome sequencing has provided insights into demography and migration patterns of human populations (14), ancient DNA (57), de novo mutation rates (810), and the relative deleteriousness and frequency of coding mutations (11, 12). Global human diversity, however, has only been partially sampled, and the genetic architecture of many populations remains uncharacterized. To date, the majority of human diversity studies have focused on single-nucleotide variants (SNVs), although copy-number variants (CNVs) have contributed significantly to hominid evolution (13, 14), adaptation, and disease (1518). Much of the research into CNV diversity has been performed with single-nucleotide polymorphism (SNP) microarray and array comparative genomic hybridization (aCGH) platforms (1922), which provide limited resolution. In addition, comparisons of population CNV diversity with heterogeneous discovery platforms may lead to spurious population-specific trends in CNV diversity (22, 23). Although there are many other forms of structural variation (e.g., inversions or mobile element insertions), in this study we focused on understanding the population genetics and normal pattern of copy-number variation by deep sequencing a diverse panel of human genomes.

Results

CNV discovery

We sequenced to high coverage a panel of 236 human genomes representing 125 diverse human populations from across the globe (Fig. 1 and table S2). Sequencing was performed to a mean genome coverage of 41-fold from libraries prepared by using a standard polymerase chain reaction–free protocol on the HiSeq 2000 Illumina (San Diego, CA) sequencing platform (24). The panel includes representation from a broad swathe of human diversity, including individuals from across Siberia, the Indian subcontinent, and Oceania. We also analyzed the high-coverage archaic Neanderthal (25) and Denisova (26) as well as three ancient human genomes to refine the evolutionary origin and timing of CNV differences (24). We applied a read-depth–based digital comparative genomic hybridization (dCGH) approach (13, 24) to identify 14,467 autosomal CNVs and 545 X-linked CNVs among individuals relative to the reference genome (Table 1 and table S1), which we estimate provides breakpoint resolution to ~210 base pairs (bp) (24). CNV calls were validated with SNP microarrays and a custom aCGH microarray that targeted all CNVs identified in 20 randomly selected individuals (24).

Fig. 1 Analysis of CNVs in several world populations.

The geographical locations of the 125 human populations, including two archaic genomes, assessed in this study. Populations are colored by their continental population groups, and archaic individuals are indicated in black.

Table 1 CNVs and SNVs broken down by their intersection with genomic region.

The number of mega–base pairs of exonic and segmentally duplicated CNVs reflects the amount of exonic and segmental duplication sequences affected, respectively, not the total sum of the intersecting CNVs.

View this table:

The median CNV size was 7396 bp, with 82.2% of events (n = 12,338) less than 25 kbp (24). CNVs mapping to segmental duplications were larger on average (median of 14.4 kbp) than CNVs mapping to the unique portions of the genome (median of 6.2 kbp). Almost one-half of CNV base pairs mapped within previously annotated segmental duplications (a 10-fold enrichment) (Table 1). In total, 217.1 Mbp (7.01%) of the human genome are variable because of CNVs, in contrast to 33.8 Mbp (1.1%) resulting from single-nucleotide variations (Table 1). Deletions (loss of sequence) were less common (representing 85.6 Mbp or 2.77% of the genome) compared with duplications (gain of sequence, 136.1 Mbp or 4.4% of the genome). Furthermore, comparing our data set with other studies of CNVs (21, 27), 67 to 73% of calls we report are unique to our study, whereas we captured 68 to 77% of previously identified CNVs (24).

CNV diversity and selection

African populations are broadly distinguished from non-African populations by a principal component analysis (PCA) for either deletions (Fig. 2A and fig. S20) (24) or duplications (Fig. 2B). In this analysis, we limited the variants to biallelic deletions or biallelic duplications (diploid genotypes of two, three, or four) to eliminate difficulty of inferring phase from multicopy CNVs. For deletions, PC1 (6.8% of the variance) and PC2 (3.94%) distinguish Africans, West Eurasians, East Asians, and Oceanic populations. PC3 and PC4, describing 2.8% and 2.0% of the total variance, cluster Papuans and populations of the Americas, respectively. Many other populations were predictably distributed along clines between these clusters (e.g., Northern Africans, Siberians, South Asians, Amerindians, and indigenous peoples of Philippines and North Borneo). PCAs generated from SNVs showed patterns similar to those from deletions. Africans also show much greater heterozygosity (Fig. 2C and Table 2), for instance, ~25% more heterozygous biallelic deletions and more than a twofold difference when compared with Amerindians (θAfrican = 535 versus θAmericas = 209). The archaic Neanderthal and Denisova genomes form an out-group to all humans (24).

Fig. 2 Population structure and CNV diversity.

PCA of individuals assessed in this study plotted for biallelic deletions (A) and duplications (B) with colors and shapes representing continental and specific populations, respectively. Individuals are projected along the PC1 and PC2 axes. The deletion (C) and duplication (D) heterozygosity plotted and grouped by continental population. The relationship between SNV heterozygosity and deletion (E) or duplication (F) heterozygosity is compared.

Table 2

Summary statistics of biallelic CNV deletions versus SNVs by continental population group.

View this table:

Duplication heterozygosity and PCA in general show similar trends (Fig. 2D), albeit with far less definition. Oceanic populations, especially those from Papua New Guinea, Australia, and Bougainville, showed the greatest separation on PC1 by duplication. Biallelic duplications appear to be somewhat less-informative markers of human ancestry, in contrast to SNVs, which provide the greatest resolution (e.g., SNV PCs 1 to 4 describe 5.8, 3.4, 2.6, and 1.7% of the variance, respectively). This difference is also seen when comparing SNV and CNV heterozygosity (Fig. 2, E and F). Whereas heterozygous biallelic deletions were strongly correlated (R = 0.88) with SNV heterozygosity, the correlation between SNVs and duplications was much weaker (R = 0.27). We compared this correlation for duplications located adjacent to segmental duplications (within or proximal 150 kbp) in contrast to those occurring in unique regions of the genome and therefore less likely to be subject to recurrent mutation. Heterozygous duplications occurring in unique regions were better correlated with heterozygous SNVs (r = 0.29) than those adjacent or within segmental duplications (r = 0.17), although the difference was not significant (two-sided Williams’ test P < 0.1).

Studies of larger (>100 kbp) deletion and duplication events indicate that deletions are more deleterious than duplications (28). We reasoned that this may be reflected in the allele frequency spectrum (AFS) of normal genetic variation and compared the AFS of genic versus intergenic deletions and duplications for smaller events (Fig. 3, A and B). Genic deletions were significantly rarer than intergenic deletions (Wilcoxon rank sum test, P = 1.84 × 10–9), but genic duplications showed no such skew (Wilcoxon rank sum test, P = 0.181). Size also had a significant impact on the AFS of CNVs. Deletions increased in rarity as a function of size (F test, P = 5.02 × 10–11) (Fig. 3C), but only a nominally significant trend was observed for duplications (P = 0.031) (Fig. 3D). These data suggest that selection has shaped the extant diversity of deletions and duplications differently during human evolution.

Fig. 3 Selection on CNVs.

Folded allele-frequency spectra of exon-intersecting deletions (A) and duplications (B). Whereas deletions intersecting exons are significantly rarer than intergenic deletions, exon-intersecting duplications show no difference compared to intergenic duplications. The mean frequency of CNVs beyond a minimum size threshold is plotted for deletions (C) and duplications (D). A strong negative correlation between size and allele frequency is observed for deletions but less so for duplications.

Population stratification

Because population stratification can be indicative of loci under adaptive selection, we calculated Vst statistics for each CNV among all pairs of continental population groups, a metric analogous to Fst (the fixation index) (29). Vst and Fst statistics compare the variance in allele frequencies between populations, with Vst allowing comparison of multiallelic or multicopy CNVs. We identified 1036 stratified copy-number–variable regions (CNVRs with maximum population Vst > 0.2, ~10% of the total), 295 of which intersected the exons of genes and 199 that exhibited extreme stratification (Vst > 0.5) (table S3). After correcting for copy number, duplicated loci were 1.8-fold more likely to be stratified than deletions. This finding is more remarkable in light of the fact that duplications were less discriminatory by PCA, suggesting that a subset of multiallelic duplicated CNVs show large allele frequency differences between different populations (see discussion below). The Vst of stratified duplicated CNVs was weakly correlated with the Fst of flanking SNVs (R2 = 0.03, P = 3.27 × 10–12) in contrast to deletions (R2 = 0.2, P < 2 × 10–16). Stratified duplication loci, thus, are far less likely to be tagged by adjacent SNPs through linkage disequilibrium.

Many of the population-differentiated loci were multiallelic and mapped to segmental duplications, including the repeat domain of ANKRD36 and the DUF1220 domain of NBPF (24) (Table 3). Several of these population differences involve genes of medical consequence, such as the multiallelic duplication of CLPS, a pancreatic colipase involved in dietary metabolism of long-chain triglyceride fatty acids (Fig. 4A). Increased expression in mouse models of this gene is negatively correlated with blood glucose levels (30). A duplication of the haptoglobin and haptoglobin-related (HP and HPR) genes expanded exclusively in Africa. The duplication has recently been associated with a possible protective effect against trypanosomiasis in Africa, although only copy 3 and 4 alleles were reported (31). We find this locus has further expanded to five and six copies in Esan, Gambian, Igbo, Mandenka, and Yoruban individuals (Fig. 4A). We also compared the location of our CNVs with disease loci identified by genome-wide association study (GWAS) (32) and sites of potential positive selection (33). Although only a small fraction of our CNVs (1 to 6%) overlapped such functional annotation, we note that 21% of putative adaptive loci intersected with a CNV when compared with 6% of disease GWAS loci (table S4). Because many of the intervals are large, further refinement and investigation are needed to determine the importance of such overlaps.

Table 3 CNVs differentiated between human populations.

CNVs intersecting genes that show dramatic difference in copy number (as measured by Vst) between human populations (see Fig. 1 for definition of populations).

View this table:
Fig. 4 Population-stratified CNVs and archaic introgression.

(A) Four specific examples of population-stratified CNVs intersecting genes are shown, including LRRIQ3, the pancreatic collipase CLPS, the sperm head and acrosome formation gene DPY19L2, and the haptoglobin and haptoglobin-related genes HP and HPR. Dot plots indicate the copy of the locus in each individual, and pie charts with colors depict the continental population distribution per copy number (see text for details and Figs. 1 and 2 and dot plots for color scheme). (B) Predicted copy number on the basis of read depth for a 73.5-kbp duplication on chromosome 16. It is observed in the archaic Denisovan genome and at 0.84 allele frequency in Papuan and Bougainville populations, yet absent from all other assessed populations. The duplication intersects two microRNAs. The orange arrow corresponds to the position and orientation of this duplication as further highlighted in (C) and (D). (C) A heat map representation of a ~1-Mbp region of chromosome 16p12 (chr16:21518638–22805719). Each row of the heat map represents the estimated copy number in 1-kbp windows of a single individual across this locus. Genes, annotated segmental duplications, arrows highlighting the size and orientation in the reference of the Denisova/Papuan-specific duplication locus (locus D), and three other duplicated loci (A, B, and C) of interest are shown below. (D) The structure of duplications A, B, C, and D [as shown in (C) over the same locus] in the reference genome and the discordant paired-end read placements used to characterize two duplication structures. Structure A/C is found in all individuals, although not present in the reference genome, whereas structure B/D is only found in Papuan and Bougainville individuals, indicating a large (~225 kbp), complex duplication composed of different segmental duplications. Both the A/C and B/D duplication architectures exhibit inverted orientations compared with the reference. The number of reads in all Oceanic and non-Oceanic individuals supporting each structure are indicated. (E) Maximum likelihood tree of the 16p12 duplication locus [duplication D in (B) to (D)] constructed from the locus in orangutan, Denisova, the human reference, and the inferred sequence of the Papuan duplication (24). All bootstrap values are 100%.

Denisovan CNVs are retained and expanded in Oceanic populations

We further searched for highly stratified population-specific CNVs sharing alleles with the archaic Neanderthal and Denisovan individuals assessed in our study. Although no Neanderthal-shared population-specific CNVs were identified, five Oceanic-specific CNVs were identified that shared the Denisova allele at high frequency (24). Papuan genomes have previously been reported to harbor 3 to 6% Denisovan admixture (6, 26). CNVs of putative Denisovan ancestry were at remarkably high frequency in Papuan individuals (all >0.2 allele frequency), with one ~9-kbp deletion lying 2 kbp upstream of the long noncoding RNA LINC00501, another 5-kbp duplication lying 8 kbp upstream of the METTL9 methyltransferase gene, and a 73.5-kbp duplication intersecting the MIR548D2 and MIR548AA2 microRNAs (Fig. 4B).

We determined that the latter two are part of a larger composite segmental duplication that appears to have almost fixed among human Papuan-Bougainville genomes [allele frequency (AF) = 0.84] but has not been observed in any other extant human population (Fig. 4, B and C). We noted three additional duplications proximal to this locus exhibiting strikingly correlated copy number, despite being separated by >1 Mbp in the reference genome (Fig. 4C) (24). We suggest that these constitute a single, larger (~225 kbp) complex duplication composed of different segmental duplications. By using discordantly mapping paired-end reads, we resolved the organization of two duplication architectures not represented in the human reference (Fig. 4D). The first of which (architecture A/C) is present in all individuals assessed in this study (5625 discordant paired-end reads supporting) but not in the human reference genome. The second (B/D) corresponds to the Denisova-Papuan–specific duplication and is only present in these individuals and the Denisova genome. Seventy paralogous sequence variants [markers distinct to paralogous locus (34, 35)] distinguish the Papuan duplication, of which 65/70 (92.9%) were shared with the archaic Denisova genome. On the basis of single-nucleotide divergence, we estimate that the duplication emerged ~440 thousand years ago (ka) and rose to high frequency in Papuan (>0.80 AF) but not Australian genomes, probably over the past 40,000 years after introgression from Denisova (Fig. 4E). This polymorphism represents the largest introgressed archaic hominin duplication in modern humans.

The ancestral human genome

The breadth of the data set allowed us to reconstruct the structure and content of the ancestral human genome before human migration and subsequent gene loss. To identify ancestral sequences potentially lost by deletion, we identified a set of sequences present in chimpanzee and orangutan reference genomes but absent from the human reference genome (20,373 nonredundant loci corresponding to 40.7 Mbp of sequence). Of these, 9666 (27.6 Mbp) were unique (i.e., not composed of common repeats). Because of the inability to accurately genotype copy number for unique segments less than 500 bp by read-depth analysis, we limited our ancestral reconstruction to nonrepetitive sequences greater than this length threshold. Although the majority represented deletions specifically lost in the human lineage since divergence from great apes (6341 loci) or else referenced genome artifacts (2026 loci fixed-copy 2 in all individuals assessed, 6.2 Mbp), a small subset of these (n = 571 or 1.55 Mbp) segregate as biallelic polymorphisms in human populations (Fig. 5A). As expected, Africans were more likely to show evidence of these ancestral sequences compared with non-African populations, because the latter have experienced more population bottlenecks and thus retained less of the ancestral human diversity. A comparison to archaic genomes allowed us to identify sequences (50 loci or 104 kbp) that were present in Denisova or Neanderthal but lost in all contemporary humans as well as ancestral sequences present in all humans but not found in Denisova or Neanderthal (17 loci or 33.3 kbp).

Fig. 5 The ancestral human genome and CNV burden.

(A) A heat map of the allele frequency of 571 (1.55 Mbp) nonrepetitive sequences absent from the human reference genome yet segregating in at least one population ordered in humans by a maximum likelihood tree (49). Four groups of interest are highlighted: G1, ancestral sequences that have almost been completely lost from the human lineage; G2, ancestral sequences that are largely fixed but rarely deleted (also absent in human reference); G3, ancestral sequences that have become copy-number variable since the divergence of humans and Neanderthals/Denisovans ~700 ka; and G4, sequences potentially lost in Neanderthals and Denisovans since their divergence from humans. (B) The resulting distributions of 10,000 block-bootstrapped estimates of the difference in load between African (AFR) and non-African (nAFR) populations considering only the reference genome (GRCh37) and supplemented by sequence absent from the human reference genome (GRCh37 + NHP) included (see text for details). (C) Violin plots of the distribution of the ratio of deletion base pairs to SNV base pairs differing between every pair of African individuals (AFR-AFR), all pairs of non-African individuals (nAFR-nAFR), and every non-African, African pair (nAFR-AFR). (D) Heat map representation of the mean ratio of deletion to SNV base pairs differing between individuals from pairs of populations.

No difference in the CNV load between Africans and non-Africans

The high coverage and uniformity allowed us to contrast putatively deleterious, exon-removing CNVs among human populations, which are of interest in disease studies (3638). In our call set, we identified 2437 CNVRs intersecting exons. The distribution of allele counts of these tended toward lower-frequency events with, again, deletions more rare than duplications (Wilcoxon rank sum test, P = 1.25 × 10–5). Collectively, individuals harbor a mean of 19.2 exon-intersecting deletions per genome (22.8 per diploid genome), with African individuals exhibiting, on average, a mean of 22.4 deletions compared with 18.6 in non-Africans (26.1 and 22.1 per diploid genome, respectively), consistent with the increased diversity of African populations and consistent with data observed for loss-of-function SNVs [(12, 39), ~122 LoF SNVs in Africans versus ~104 in non-Africans].

Whereas non-African individuals exhibited more homozygous deletion variants compared with Africans, among exon-intersecting deletions no such pattern was observed. Exon-intersecting duplications were much more balanced, with African populations showing only a slight excess when compared to non-Africans (98.4 versus 95.2 events per genome). Studies of SNVs have not found consistent evidence of a difference in load between African compared to non-African populations (4042). We compared the difference in load between African and non-African populations for deletions and duplications, respectively. Here, we defined the difference in load as the difference in the sum of derived allele frequencies between African and non-African populations, Embedded Image where PAfr(i) is the derived allele frequency of a variant i. Prima facie Africans exhibited an apparent higher deletion load than non-African populations (Fig. 5B) (P = 0.0003, block bootstrap test), although there was only a nominal difference in the load of exonic deletions (P = 0.0482). Duplications showed no such effect.

We reasoned that this difference might potentially be driven by high-frequency–derived alleles, absent from the human reference genome, which was enriched for clone libraries of non-African ancestry (5). Approaches that rely on identifying CNVs based on read placements to the reference genome would necessarily miss these CNVs, decreasing the number of variants identified in individuals more closely resembling the reference, i.e., non-Africans. To test this hypothesis, we incorporated the biallelic 571 nonrepetitive human CNV loci described above. Copy numbers were estimated for these sequences in each of the individuals and assessed by remapping raw reads against an ancestral human reference genome. As expected, the deletion allele of this sequence was at a high frequency (mean derived allele frequency, DAF = 0.58). After including these sequences, we observed no difference in the CNV load between Africans and non-Africans (95% confidence interval –18.4 to 8.8 load difference as defined above) (Fig. 5B), underscoring the importance of an unbiased human reference for such population genetic assessments.

Although we found no CNV or SNV load differences between populations, we examined whether the relative proportion of base pairs differing among individuals derived from CNVs versus SNVs showed any population-specific trends. We calculated the number of base pairs varying between all pairs of individuals assessed in our study contributed either from SNVs or from deletions, calculating the DEL-bp/SNV-bp ratio. As expected, the number of base pairs differing between individuals by deletions or by SNVs independently was always higher among African individuals when compared with other populations. Unexpectedly, the ratio of deletion-bp to SNV-bp was substantially higher within non-African populations (mean of 1.27 compared to 1.14; Fig. 5, C and D). This relative increase in deleted base pairs was most pronounced among non-African populations, which have experienced more recent genetic bottlenecks (e.g., Siberian and Amerindian). Given the absence of a significant difference in the deletion load comparing African and non-African populations, there is no reason to believe that this finding is due to differences in the effectiveness of selection against deletions since the populations separated. However, selection places a downward pressure on the allele frequencies of both deletions and SNVs, with the pressure being stronger for deletions because the selection coefficients are stronger on average. As has been previously shown for SNVs, different allele-frequency spectra for deletions in contrast to SNVs have the potential to interact with the differences in demographic history across populations—even without differences in the effectiveness of selection after population separation—to contribute to observed differences in the apportionment of genetic variation among human populations (41).

Discussion

Although the mutational properties and selective signatures of SNVs have been explored extensively, similar analyses of CNVs have lagged behind. As a class, duplications show generally poor correlations with SNV density, have poor linkage disequilibrium to SNVs (43, 44), and are less informative as phylogenetic markers but are more likely to be stratified than deletions among human populations. This observation may be explained by the fact that directly orientated duplications show a gradient of elevated mutation rates resulting from nonallelic homologous recombination and, as such, can change their copy-number state more dynamically over short periods of time. This property also makes this class of variation, similar to highly mutable loci such as minisatellites (45), particularly susceptible to homoplasy—that is, identity by state as opposed to identity by descent. Deletions, in contrast, recapitulate most properties of SNVs because they are more likely to exhibit identity by descent as a result of single ancestral mutation event.

We have provided here sequencing data for the study of human diversity and used this resource to explore patterns of human CNV diversity at a fine scale of resolution (>1 kbp). As expected, human genomes differ more with respect to CNVs than SNVs, and almost one-half of these CNV differences map to regions of segmental duplication. Both deletion and duplication analyses consistently distinguish African, Oceanic, and Amerindian human populations. Africans show the greatest deletion and duplication diversity and have the lowest rate of fixed deletions with respect to ancestral human insertion sequences. Oceanic and Amerindian, in contrast, show greater CNV differentiation, likely as a result of longer periods of genetic isolation and founder effects (46). Among the Oceanic, the Papuan-Bougainville group stands out in sharing more derived CNV alleles in common with Denisova, including a massive interspersed duplication that rose to high frequency over a short period of time.

We find that duplications and deletions exhibit fundamentally different population-genetic properties. Duplications are subjected to weaker selective constraint and are four times more likely to affect genes than deletions (Table 1), indicating that they provide a larger target for adaptive selection. After controlling for reference genome biases, we find no difference in CNV load between human populations when measured on a per-genome basis, which is what matters to disease risk, assuming that CNVs act additively. However, we find that the proportion of human variation that can be ascribed to CNVs rather than to SNVs is greater among non-Africans than among Africans. The biological significance of this difference should be interpreted cautiously and will require association studies to determine its relevance to disease and other phenotypic differences.

Supplementary Materials

www.sciencemag.org/content/349/6253/aab3761/suppl/DC1

Materials and Methods

Supplementary Text

Figs. S1 to S48

Tables S1 to S18

References (5257)

References and Notes

  1. Materials and methods are available as supplementary materials at Science Online.
  2. Acknowledgments: We are grateful to the volunteers who donated the DNA samples used in this study. This project has been funded in part with federal funds from the National Cancer Institute, NIH, under contract HHSN26120080001E. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does the mention of trade names, commercial products, or organizations imply endorsement by the U.S. government. This research was supported in part by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research. This work was also partly supported by NIH grant 2R01HG002385 and a grant (11631) from the Paul G. Allen Family Foundation to E.E.E. The sequencing for this study was supported by a grant from the Simons Foundation to D.R. (SFARI 280376) and by a HOMINID grant from the NSF to D.R. (BCS-1032255). T.K. is supported by a European Research Council Starting Investigator grant (FP7 - 26213). R.S. and S.D. received support from the Ministry of Education and Science, Russian Federation (14.Z50.31.0010). H.S., E.M., R.V., and M.M. are supported by Institutional Research Funding from the Estonian Research Council IUT24-1 and by the European Regional Development Fund (European Union) through the Centre of Excellence in Genomics to Estonian Biocentre and University of Tartu. S.A.T. is supported by NIH grants 5DP1ES022577 05, 1R01DK104339-01, and 1R01GM113657-01. C.T.-S. is supported by Wellcome Trust grant 098051. C.M.B. is supported by the NSF (award numbers 0924726 and 1153911). E.E.E. and D.R. are investigators of the Howard Hughes Medical Institute. Data are deposited into ENA (PRJEB9586 or ERP010710), and variant calls are deposited in dbVar (PRJNA285786). E.E.E. is on the scientific advisory board of DNAnexus, Incorporated, and is a consultant for Kunming University of Science and Technology (KUST) as part of the 1000 China Talent Program.
View Abstract

Subjects

Navigate This Article