Human Lineage–Specific Amplification, Selection, and Neuronal Expression of DUF1220 Domains

See allHide authors and affiliations

Science  01 Sep 2006:
Vol. 313, Issue 5791, pp. 1304-1307
DOI: 10.1126/science.1127980


Extreme gene duplication is a major source of evolutionary novelty. A genome-wide survey of gene copy number variation among human and great ape lineages revealed that the most striking human lineage–specific amplification was due to an unknown gene, MGC8902, which is predicted to encode multiple copies of a protein domain of unknown function (DUF1220). Sequences encoding these domains are virtually all primate-specific, show signs of positive selection, and are increasingly amplified generally as a function of a species' evolutionary proximity to humans, where the greatest number of copies (212) is found. DUF1220 domains are highly expressed in brain regions associated with higher cognitive function, and in brain show neuron-specific expression preferentially in cell bodies and dendrites.

Extreme gene duplication in a species-specific manner, followed by divergence and functional specialization, can be an important factor in the evolution of phenotypic traits unique to that species (1). Copy number variations between human and chimpanzee have been discovered with the use of a draft sequence of the chimpanzee genome (2), although primate outgroup information is currently limited to draft sequence from only one other species, the macaque (3). Draft sequences are prone to misassembly of recently duplicated sequences (4), a limitation that is the most severe for the most evolutionarily recent (i.e., similar) duplications. A complementary approach is cDNA array–based comparative genomic hybridization (aCGH) (5). We previously used cDNA aCGH to carry out genome-wide gene copy number comparisons between human and great ape species (5), and identified 134 genes showing human lineage–specific (HLS) increases and six genes showing HLS decreases.

To obtain an independent estimate of the copy number of each HLS gene, we determined the full insert sequences of the cDNAs (table S1) and used these as BLAT ( queries to search a recent human genome assembly (Build 35) (6) as well as available genome draft sequences from chimp (2) and macaque (3). The great majority (86.4%) of genes predicted by cDNA aCGH to have an HLS increase in copy number produced more BLAT hits (score >200) in the human genome than in either chimp or macaque (table S2), and 44 of these (31%) had more than five BLAT hits in the human genome (Fig. 1A).

Fig. 1.

Cross-species BLAT survey of HLS cDNAs and organization of the MGC8902 gene. (A) BLAT searches were performed using full cDNA insert sequences for 140 HLS genes (5) as queries. The IMAGE clones ( that yielded >5 BLAT hits in the human genome are shown. BLAT hits with span sizes exceeding the size of the cDNA query were scored as potentially containing introns. Potentially “intronless” BLAT hits are shown in white. The asterisk denotes BLAT hits associated with the ribosomal protein gene RLP23AP7, which had hit totals of 150, 144, and 133 for human, chimp, and macaque, respectively. All of these were intronless. (B) The genomic exon/intron organization of MGC8902 and the predicted domain structure of the translated protein. A representative DUF1220 genomic repeat unit is also shown.

After removal of all BLAT hits predicted to be intronless, one gene, MGC8902 (cDNA IMAGE clone 843276), showed the most striking human-specific increase, with 49, 10, and 4 hits found in human, chimp, and macaque, respectively (Fig. 1A). All human hits associated with MGC8902 (49/49) were predicted to be nonretroposed copies. It was also ranked as the fifth highest HLS aCGH signal out of the 134 genes predicted to have HLS increases in copy number (7), and contains six predicted DUF1220 domains (Fig. 1B). The genomic sequences predicted to encode DUF1220 domains typically show a unique signature of an evenly spaced two-exon repeat unit (Fig. 1B). A recent report treats this exon pair as a new repeat that is part of a gene family termed NBPF (8). The repeat is inclusive of the DUF1220 domain but also contains additional protein-coding sequences that may not share all the biological and evolutionary characteristics of DUF1220.

It has been estimated that 34 different human genes encode DUF1220 domains (table S3) ( Pfam (Version 17.0) (9) predicts that 60 human DUF1220-containing proteins exist, containing a total of 271 DUF1220 domains (fig. S1) derived from 11 seed domains (10) (fig. S2A). Estimates based on cDNA sequences indicated that 22 genes exist, including six pseudogenes (8). None of these cDNAs showed perfect identity to human genomic sequences, raising the possibility that this count is an underestimate. Recent additional sequencing of chromosome 1 identified at least 15 gene sequences that encode DUF1220 domains, although several sequence gaps still remain in DUF1220-encoding regions (11).

The amino acid sequences of each of the 11 DUF1220 seed domains were next used as BLAT queries against genome sequences from several species (table S4). The 11 seed domains showed no matches outside of mammals, and 10 of the 11 were primate-specific, with the highest number of copies always found in human (fig. S2B). The remaining seed domain (O75042) was found in primate and nonprimate mammals, usually as a single domain encoded by a single-copy gene [Myomegalin/PDE4DIP (12)] that also encodes a spindle-associated domain (fig. S2C). The most human BLAT hits were obtained with three domains found in one predicted protein, Q8IX62. One of these (Q8IX62_Human/17-83) had 90 hits in human but only 16 and 11 in chimp and macaque, respectively. Of the human hits, 37 (41%) were 100% matches (fig. S2B), by far the highest frequency of identical human matches found for any of the 11 seed domains. For macaque, a similar number of BLAT hits was obtained whether the January 2005 or January 2006 sequence assembly was used (table S5).

To provide an independent estimate of the frequency of this domain, we carried out quantitative polymerase chain reaction (QPCR) analysis on multiple individuals from each of several species, using a primer and probe set with sequences that are identical to sequences in the human and chimp genomes (Fig. 2 and table S6). Consistent with BLAT results and previously reported aCGH data, QPCR analysis also found that the human genome had significantly more copies than that of any other species (P < 0.01) and, generally, the evolutionarily closer the species was to human, the more DUF1220 domains were encoded in its genome. Interhominoid cDNA aCGH data reported previously (5) for cDNA IMAGE 843276 [average log2 ratio: bonobo (3) = –1.79; chimpanzee (4) = –1.98; gorilla (3) = –1.19; orangutan (3) = –2.74] are in very close agreement (r = 0.9886) with the cross-species QPCR data presented in Fig. 2. Taken together, aCGH, BLAT, and QPCR data indicate that the number of DUF1220 copies is highly expanded in humans, reduced in African great apes, further reduced in orangutan and Old World monkeys, single-copy in nonprimate mammals, and absent in nonmammalian species. Some intraspecies copy number variability was apparent, although a survey of a limited number of individuals (22 individuals from diverse human populations) revealed no population-specific trends (table S6).

Fig. 2.

QPCR-based estimation of the number of DUF1220 domains found within different species. QPCR was carried out to survey the frequency of DUF1220 domain (Q8IX62 17-33) sequences in various primate species. Corresponding numerical values can be found in table S6.

Human genomic locations predicted from the BLAT analysis using the 11 seed domains (fig. S3, A and B) were in general agreement with fluorescence in situ hybridization (FISH) analysis (fig. S4) and two recent reports (8, 11), positioning the majority of DUF1220 (NBPF) sequences at 1q21.1, a complex genomic region immediately adjacent to the pericentromeric C-band 1q12. Additional sequences are found at 1p36 and 1p13.3. After eliminating redundant (overlapping) positions, we identified 212, 37, and 30 unique DUF1220-positive BLAT hits in human, chimp, and rhesus, respectively, along with only one each for mouse and rat.

Evolutionary analysis (13) was performed with a nonredundant set of the human, chimp, rhesus, mouse, and rat DUF1220 nucleotide sequences derived from the BLAT searches described above (table S7). These sequences were filtered for frame-shift insertions and aligned, and the resulting 256 sequences were used for construction of a phylogenetic tree (fig. S5). In addition, Ka/Ks ratios (ratios of the rate of amino acid substitution to silent substitution) were determined for each pairwise combination of sequences (Fig. 3, A and B, and table S8). The domain that is found as a single copy in nonprimate mammals, O75042, is the likely ancestral domain, consistent with a phylogenetic analysis of the 11 DUF1220 seed domains (, with the primate-specific domains appearing more recently. The ladder-like nature of the phylogenetic tree suggests that serial domain amplification and subsequent divergence are the rule in this large set of repeats. Also, 33% (10,583/32,131) of pairwise comparisons showed a Ka/Ks ratio of 1 or greater—a traditional signature of positive selection (14).

Fig. 3.

Ka/Ks values of DUF1220-encoding sequences. (A) Sequences with Ka/Ks values 0 through 2.0. Rodent-primate and primate-primate comparison means are shown (arrows); human-specific comparisons produce an even higher mean. Of these comparisons, 3106 have a Ka/Ks value of 0, and 2035 have a Ka/Ks value of 2.0 or greater; the scale here is limited so that the major patterns can be seen. (B) Sequences with Ka/Ks values 2.25 through 8.0; all of these comparisons are primate homologous comparisons, and most are human-human pairwise comparisons.

On average, primate-primate homologous comparisons had a higher ratio of nonsynonymous to synonymous changes (Ka/Ks mean = 0.91) than did rodent-primate homologous comparisons (Ka/Ks mean = 0.61), indicative of either a higher level of positive selection or a relaxation of functional constraint (Fig. 3A). The average Ka/Ks value for primate homologous comparisons was unusually high (0.91) relative to general estimates of primate evolutionary rate (15), with two human-versus-rhesus comparisons producing the highest values (>7.1) (Fig. 3B). In contrast, Ka/Ks analysis of non-DUF1220 sequences from a DUF1220-containing gene did not appear to show evidence of positive selection (8).

Western blot analysis was carried out on a panel of normal adult human tissues, using an affinity-purified antibody directed against a 20–amino acid peptide derived from a primate-specific DUF1220 domain. A heavy band was visible at ∼36 kD in heart, brain, spleen, skeletal muscle, and small intestine (Fig. 4A), which was blocked in all tissues, except in the skeletal muscle, by the adsorption control (fig. S6A). This same band was faintly present in kidney, lung, stomach, colon, and rectum. In addition, other heavy bands were visible between 25 and 40 kD throughout the tissue panel. The same ∼36 kD band was highly expressed in frontal lobe, temporal lobe, parietal lobe, occipital lobe, and cerebellum, whereas it was absent in placenta (Fig. 4B).

Fig. 4.

Western and immunofluorescence analysis of normal adult human tissues with antibody to a peptide derived from a primate-specific DUF1220 domain. (A and B) Western blot analysis of total protein lysates (50 μg) from normal adult human tissues (male and female; ages ranging from 22 to 82 years). Lysates were electrophoresed on 4 to 20% denaturing SDS–polyacrylamide gel electrophoresis gels, transferred to polyvinylidene difluoride membranes, and probed with DUF1220 affinity-purified antibody (A). Further blotting analysis was performed on adult human brain regions with DUF1220 affinity-purified antibody (B). (C to E) Double-label immunofluorescence of DUF1220 antibody in the human cerebellum (30-year-old white female). (C) DUF1220 affinity-purified antibody; Purkinje cells and dendrites were detected. (D) Double labeling with DUF1220 affinity-purified antibody and neurofilament 160 kD; (E) higher magnification of inset in (D). Double labeling with DUF1220 affinity-purified antibody and glial fibrillary acidic protein (GFAP) in (F) hippocampus, (G) cortical regions of the hippocampus, and (H) frontal lobe. (I to K) Neuron-specific DUF1220 signals in (I) temporal lobe, (J) parietal lobe, and (K) occipital lobe. Nuclei are labeled with 4′,6′-diamidino-2-phenylindole (DAPI). P, Purkinje cell; den, dendrite; igl, internal granule layer; ml, molecular layer. Scale bars, 100 μm [(C), (D), and (F) to (I)], 50 μm [(E), (J), and (K)].

Using double-label immunofluorescence, we analyzed normal adult brain regions from several individuals with the same affinity-purified DUF1220 antibody. DUF1220 sequences were consistently found in neurons but not in glia. In the cerebellum, preferential expression was observed in Purkinje cells, where signals were restricted to cell bodies (cytoplasm) and dendrites (Fig. 4, C to E, and fig. S6, B to D). In addition to labeling in the cerebellum, neuron-specific DUF1220 signals were present in the cortical layers of the hippocampus (Fig. 4, F and G). DUF1220 domains were also abundantly expressed in neurons within the neocortex (frontal, parietal, occipital, and temporal lobes), thought to be critical to higher cognitive functions (Fig. 4, H to K).

Although the precise function of genes encoding DUF1220 domains and the domains themselves is at present unknown, the pattern of amplification and location of expression have led us to speculate that the domains and the genes that encode them may be important to cognitive function. In light of the strong DUF1220 expression we observed in neurons of the neocortex, it is intriguing that multiple independent evolutionary processes [brain enlargement, neocortex expansion (16), gene duplication, and domain amplification] can be seen as having individually and cumulatively contributed to increasing the DUF1220-coding potential of the human brain, suggesting that such an increase may have conferred strong selective advantages.

The genomic regions that harbor DUF1220 sequences appear to be particularly complex and, as a result, different genome assemblies differ with respect to the predicted number of DUF1220-encoded sequences. However, two recent genome-wide BAC aCGH cross-species studies (17, 18) independently support the findings reported here that DUF1220-encoding genes show human lineage–specific increases in copy number and appeared with remarkable rapidity. If they indeed are the result of strong positive selection, they may play an important role in human lineage–specific traits (19) and serve to illustrate how certain regions of the genome can undergo episodes of “punctuated” evolution (20).

Supporting Online Material

Materials and Methods

Figs. S1 to S6

Tables S1 to S8


References and Notes

Stay Connected to Science

Navigate This Article