Large-Scale Copy Number Polymorphism in the Human Genome

See allHide authors and affiliations

Science  23 Jul 2004:
Vol. 305, Issue 5683, pp. 525-528
DOI: 10.1126/science.1098918


The extent to which large duplications and deletions contribute to human genetic variation and diversity is unknown. Here, we show that large-scale copy number polymorphisms (CNPs) (about 100 kilobases and greater) contribute substantially to genomic variation between normal humans. Representational oligonucleotide microarray analysis of 20 individuals revealed a total of 221 copy number differences representing 76 unique CNPs. On average, individuals differed by 11 CNPs, and the average length of a CNP interval was 465 kilobases. We observed copy number variation of 70 different genes within CNP intervals, including genes involved in neurological function, regulation of cell growth, regulation of metabolism, and several genes known to be associated with disease.

Many of the genetic differences between humans and other primates are a result of large duplications and deletions (13). From these observations, it is reasonable to expect that differences in gene copy number could be a significant source of genetic variation between humans. A few examples of large duplication polymorphisms have been reported (4). However, because of previous limitations in the power to determine DNA copy number at high resolution throughout the genome, the extent to which copy number polymorphisms (CNPs) contribute to human genetic diversity is unknown.

In our previous studies of human cancer with the use of representational oligonucleotide microarray analysis (ROMA), we have detected many genomic amplifications and deletions in tumor genomes when analyzed in comparison to an unrelated normal genome (5), but some of these genetic differences could be due to germline CNPs. To correctly interpret genomic data relating to cancer and other diseases, we must distinguish abnormal genetic lesions from normal CNPs.

We used ROMA to investigate the extent of copy number variation between normal individuals. ROMA measures the relative concentration of DNA in two samples by hybridizing differentially labeled samples to a set of probes. Briefly, the complexity of the samples is reduced by making Bgl II genomic representations, consisting of small (200 to 1200 base pair) Bgl II restriction fragments amplified by adaptor-mediated polymerase chain reaction of genomic DNA (6). Oligonucleotide microarray probes are designed in silico from the human genome sequence assembly to be complementary with these fragments and are further optimized by performance (7). Microarrays are used to analyze genomic representations of unrelated individuals. Hybridization data are analyzed with a hidden Markov model (HMM) that is designed to distinguish differences between the DNA copy number and other variation in probe ratios, which can result from experimental noise or sequence polymorphisms at the restriction endonuclease sites used to make the representations (8).

Observed differences in the copy number of genome segments between samples from two individuals could reflect germline differences or somatic variation. Therefore, we sampled multiple tissues and Epstein-Barr virus–immortalized lymphoblastoid cell lines (LCLs) from a subset of the donors in this study (8), and by comparing the variants detected in the same donor, we determined that somatic mutations occurring in whole blood and LCLs were located exclusively within gene clusters encoding T cell receptors or immunoglobulins (fig. S1 and table S2), which most likely reflects normal V(D)J-type recombination of T cells and B cells, respectively. Therefore, the use of blood and LCLs as sources of genetic material for this study was not problematic.

In experiments with Bgl II representations, we identified 210 differences in 20 donors (excluding somatic differences, Fig. 1). For the sake of simplicity, overlapping CNPs from different experiments were assumed to represent the same polymorphism even if they did not overlap perfectly. Based on these criteria, we identified a nonredundant set of 71 CNPs (table S1).

Fig. 1.

Genome-wide map of CNPs identified by ROMA. The position of all CNPs (excluding somatic differences) is shown. CNPs identified in multiple individuals (by Bgl II–ROMA) are indicated in yellow, and CNPs observed in only one individual are indicated in red. Additional CNPs identified by one Hind III–ROMA experiment are indicated in blue. Symbols denoting CNPs are not drawn to scale. Genome assembly gaps in pericentromeric and satellite regions are indicated by gray boxes. Genomic regions where recurring de novo rearrangements cause the developmental disorders Prader-Willi and Angelman syndromes, cat eye syndrome, DiGeorge/velocardiofacial syndrome, and spinal muscular atrophy are labeled A, B, C, and D, respectively.

Nine of twelve CNPs were unambiguously confirmed by cytogenetic analysis (Fig. 2 and fig. S2). Five CNPs were found to be hemizygous deletions, and four were duplications. Figure 2 presents array data and fluorescence in situ hybridization (FISH) confirmation for CNPs 15, 21, 32, and 56, which encompass the full length of genes RAB6C, NT_016297.17, DUSP22, and PPYR1, respectively. By interphase FISH, we confirmed a deletion of RAB6C (Fig. 2B), a duplication of PPYR1 (Fig. 2D), and a deletion of NT_016297.17 (Fig. 2F). By metaphase FISH, CNP32 was determined to involve an interchromosomal duplication of a region containing the DUSP22 gene on 6p25 and 16p11.2 (Fig. 2, G, H, and I). FISH results were inconclusive for CNPs 68, 69, and 73. In these cases, FISH signals were too numerous, and a consensus copy number could not be reached. CNPs 68 and 69 were validated by other means (table S2); thus, 11 of 12 CNPs were validated by one of two methods, which is consistent with a false positive rate of about 10%.

Fig. 2.

Validation of ROMA results by FISH. (A), (C), (E), and (G) show CNPs identified by ROMA and include the CNP identification number, the name of one gene located entirely within the interval, and the experiment name. (B), (D), (F), (H), and (I) show cytogenetic analyses of one or both individuals with probes that target the same CNP intervals. In all panels, the polymorphic probe is labeled red. In interphase cells [(B), (D), and (F)], a control probe (labeled green) was also included to confirm that cells were diploid. (B) CNP15 probe in GM11322 cells; (D) CNP56 probe in GM10470 cells; (F) CNP21 probe in GM10470 cells; (H) CNP32 probe in GM10540 cells; (I) CNP32 probe in SKN1 cells. In (I), one parental copy of chromosome 16 in SKN1 lacks the duplication (arrow).

Additional validation of CNPs was obtained by microarray analysis of genomic representations made with a different restriction enzyme. A pair of individuals analyzed by Bgl II–ROMA (experiment JA437, table S1) was also analyzed with Hind III representations and arrays of Hind III probes (JT393). The results of Bgl II–ROMA and Hind III–ROMA were generally in agreement (8). In addition, because of differences in the genomic distribution of Hind III probes, some unique CNPs were identified, bringing the total of copy number differences identified in this study to 221 and the total of unique CNPs to 76.

Our study population consisted of 20 individuals from a variety of geographic backgrounds. These results provide an indication of the extent of human copy number variation and the frequency of the most common alleles. In all experiments, there were a total of 221 observed copy number differences (not including somatic differences) comprising a nonredundant set of at least 76 CNPs (Fig. 1 and table S2). There was an average of 11 CNPs between two individuals, with an average length of 465 kb and a median length of 222 kb. At least five of these polymorphisms have been described previously (913). The overwhelming majority of CNPs were previously unidentified. About half of the above CNPs were recurrent in multiple individuals.

The CNPs observed here represent only a subset of the total CNPs in the population. For example, some CNPs that have previously been reported were not observed in this study (14, 15). Undoubtedly, an increase in the size of our study population would reveal additional CNPs, as would an increase in the density of probe coverage. By comparing Hind III and Bgl II results and analyzing Bgl II results with replicate samples, we estimate that in any given experiment we may miss up to 30% of the large-scale copy number changes that we ought to find (table S3). In addition, there are theoretical limits to the detection of CNPs with only 85,000 probes. Based on Poisson distributions of probes and the probabilities of detecting CNPs of given lengths, we estimate that there are 226 nonredundant CNPs in our study population covering 44 Mb of the genome (table S4).

CNPs were widely distributed throughout the genome. Some locations such as 6cen, 8pter, and 15q13-14 contained clusters of three to four CNPs, which may be evidence that these regions are “hotspots” of copy number variation. We observed no CNPs on the X chromosome. This may be due to the underrepresentation of females in our study population (16 donors and SKN1 were male). A larger study would be necessary to determine if selective pressure against copy number variation is greater on the X chromosome than on autosomes, or if it is especially apparent in the X chromosomes present in males.

CNPs were frequently located near other types of chromosomal rearrangements. Some CNPs occurred within genomic regions where recurring de novo rearrangements are causes of developmental disorders, specifically, Prader-Willi and Angelman syndromes, cat eye syndrome, DiGeorge/velocardiofacial syndrome, and spinal muscular atrophy (labeled A, B, C, and D, respectively, in Fig. 1). These CNPs are not directly implicated in the above diseases, but they may reflect the instability of these genomic regions. A preliminary analysis of the duplication content of CNPs determined that 30% of the sequence within intervals of polymorphic deletions consists of segmental duplications, a sixfold enrichment relative to the genome average. As would be expected, a greater enrichment (12-fold) was observed for polymorphic duplications (16). The former is consistent with previous observations of a positive correlation between segmental duplications and microdeletions (17, 18). A more thorough characterization of CNP junctions at the sequence level is necessary to determine a causal relationship between the two. Fixed segmental duplications, unstable regions, and CNPs are probably manifestations of the same underlying process. Just as chromosomal rearrangements have played a significant role in primate evolution and human disease, structural polymorphisms may play an analogous role in determining genetic diversity within the human population.

We observed copy number variation of 70 genes (table S5). Variation in the dosage of individual genes can lead to a profound phenotype; for instance, the familial inheritance of gene copy number variants is a cause of some neurological disorders (19, 20). Notably, one of the donors in this study was determined to carry a deletion of COH1 (CNP48), a gene whose inactivation causes the autosomal recessive disease Cohen syndrome (21). Several additional CNPs contained genes involved in neurodevelopment, such as GTF2H2, ATOH1, CASPR3, CHRFAM7A, and NCAM2. Other compelling examples from table S5 include the Enhancer of Split (TLE1) and RAB6C, which are implicated in leukemia and drug resistance in breast cancer, respectively (22, 23). Lastly, some CNPs identified in this study involve genes with a known influence on “normal” human phenotypes. For example, we observed triplication of the neuropeptide-Y4 receptor (PPYR1, Fig. 2, C and D), a gene that is directly involved in the regulation of food intake and body weight (24). Thus, a relationship between CNPs and susceptibility to health problems such as neurological disease, cancer, and obesity is an intriguing possibility.

Owing to their size and gene content, CNPs are unlikely to be selectively neutral. Indeed, a large proportion of CNPs observed in this study are rare (i.e., they occur once in 20 donors). A preliminary analysis of the comparative frequency of variants (25) suggests that CNP as a class is under negative selection. However, more data are required to reach this conclusion with confidence.

As evident by ROMA, there is considerable structural variation in the human genome, most of which was not previously apparent by other methods of genomic analysis. Previous studies using array comparative genomic hybridization have identified a handful of large-scale polymorphisms (26, 27). For example, by using a 1-Mb-resolution bacterial artificial chromosome (BAC) array, Shaw-Smith et al. detected five inherited CNPs from a set of 50 patients with developmental disabilities (27). The ROMA chips used here have a resolution of approximately one probe every 35 kb, which accounts for much of the enhanced sensitivity of our method. Furthermore, by designing oligonucleotide probes that are free of repetitive sequence, by empirically selecting 85,000 probes that yield maximum signal, and by reducing the complexity of the genome, ROMA achieves a ratio of signal-to-background superior to that which can be attained by hybridization of total genomic DNA to an array of BACs. Thus, ROMA has additional advantages even compared with arrays with “complete” coverage of the genome, such as the 32,000-probe tilingpath BAC array (28). Further developments of ROMA are under way, including a 380,000-probe microarray, which promise to reveal a great deal more about large-scale polymorphism in the human genome.

Supporting Online Material

Materials and Methods

SOM Text

Figs. S1 and S2

Tables S1 to S5

References and Notes

Stay Connected to Science

Navigate This Article