Report

Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals

See allHide authors and affiliations

Science  20 May 2016:
Vol. 352, Issue 6288, pp. 1009-1013
DOI: 10.1126/science.aad8411

Evolutionary maintenance of gene duplications

Understanding genetic redundancy—the maintenance of multiple copies of a gene after duplication—and its relevance to genetic evolution have long been debated. Lan and Pritchard examined gene duplicates within human and other mammalian genomes. The expression of genes appears to be controlled by dosage balance and tight coregulation of tandem duplicates. They found little evidence for gene copies evincing significantly different expression patterns. However, such changes can evolve later, after gene copies become physically separated within the genome and thus are no longer jointly regulated.

Science, this issue p. 1009

Abstract

Gene duplication is a fundamental process in genome evolution. However, most young duplicates are degraded by loss-of-function mutations, and the factors that allow some duplicate pairs to survive long-term remain controversial. One class of models to explain duplicate retention invokes sub- or neofunctionalization, whereas others focus on sharing of gene dosage. RNA-sequencing data from 46 human and 26 mouse tissues indicate that subfunctionalization of expression evolves slowly and is rare among duplicates that arose within the placental mammals, possibly because tandem duplicates are coregulated by shared genomic elements. Instead, consistent with the dosage-sharing hypothesis, most young duplicates are down-regulated to match expression levels of single-copy genes. Thus, dosage sharing of expression allows for the initial survival of mammalian duplicates, followed by slower functional adaptation enabling long-term preservation.

Gene duplications are a major source of new genes and ultimately of new biological functions (1). However, recently arisen gene duplicates tend to be functionally redundant and thus susceptible to loss-of-function mutations that degrade one of the copies into a pseudogene. The average half-life of new primate duplicates has been estimated at just 4 million years (2). This raises the question of what evolutionary forces govern the persistence of young duplicates.

Various models have been proposed to understand why some duplicate pairs do survive over long evolutionary time scales (3). Dosage-balance models focus on the importance of maintaining correct stoichiometric ratios in gene expression between different genes (46) and likely explain how gene copies are maintained after whole-genome duplication (WGD), because subsequent gene losses would disrupt dosage balance (6, 7).

Alternatively, functional partitioning of duplicates can occur, either by neofunctionalization (one copy gains new functions) or subfunctionalization (the copies divide the ancestral functions between them). The duplication-degeneration-complementation (DDC) model proposes that complementary degeneration of regulatory elements causes the two copies to be expressed in different tissues, such that both copies are required to provide the overall expression of the ancestral gene (8). Similarly, neofunctionalization of expression could lead to one gene copy gaining function in a tissue where the parent gene was not expressed. Functional divergence may also occur at the protein level (9), but this is thought to be a slow process, with initial divergence more often occurring through changes in gene regulation (10).

It is currently unclear which factors are most important for long-term survival of gene duplications in mammals, where most duplications arise through segmental duplications or retrotranspositions that increase copy numbers of just one or a few genes. These small-scale duplications most likely disrupt overall dosage balance and should thus favor gene loss rather than preservation.

We therefore set out to investigate whether gene expression data across tissues in human and mouse support either model of duplicate preservation. We analyzed RNA-sequencing (RNA-seq) data from 10 individuals for each of 46 diverse human tissues collected by the Genotype-Tissue Expression (GTEx) project (11) and replicated our main conclusions using RNA-seq from 26 diverse mouse tissues (12).

We developed a computational pipeline to identify duplicate gene pairs in the human genome (13). After excluding annotated pseudogenes, we identified 1444 high-confidence reciprocal best-hit duplicate gene pairs with >80% alignable coding sequence and >50% average sequence identity. We used synonymous divergence, dS, as a proxy for divergence time, while noting that divergence of some gene pairs may be affected by nonallelic homologous gene conversion in young duplicates. Additional analyses using the phylogenetic distribution of duplicates to refine date estimates were highly concordant with results based on dS alone (figs. S5 to S7). We estimate that dS for duplicates that arose at the time of the human-mouse split averages ~0.45 and that most pairs with dS > ~0.7 predate the origin of the placental mammals (figs. S3 and S4). Thus, most of our analysis focuses on duplicates that likely arose within the mammalian lineage and postdate the early vertebrate whole-genome duplications.

Accurate measurement of expression in gene duplicates can be challenging if RNA-seq reads map well to both gene copies. Mapping may also be biased if the two copies have differential homology with other genomic locations. To overcome these challenges, we estimated expression ratios using only paralogous positions for which reads from both copies would map uniquely to the correct gene (13). This approach is related to a method for measuring allele-specific expression (14). These strict criteria mean that some very young genes are excluded from our expression analyses as unmappable, but, for the remaining genes, simulations show that our pipeline yields highly accurate, unbiased estimates of expression ratios (fig. S1).

This read-mapping pipeline allowed us to classify duplicates into categories on the basis of their coexpression patterns (13). First, within each pair, we classified the gene with higher overall expression as the “major” gene and its partner as the “minor” gene. We then defined a gene pair as potentially sub- or neofunctionalized if both the major and minor copy are significantly more highly expressed than the other in at least one tissue each (at least a twofold difference and P < 0.001 with paired t test) (Fig. 1A). We refer to pairs with consistent asymmetry as asymmetrically expressed duplicates (AEDs) if the major gene is significantly more highly expressed in at least 1/3 of tissues where either gene is expressed and not expressed at a significantly lower level than its partner in any tissue (Fig. 1B). The remaining duplicates were classified as having no difference, although many of these pairs show weaker levels of asymmetry.

Fig. 1 Expression profiles of duplicate genes.

(A) A gene pair whose expression profile is consistent with sub- or neofunctionalization: i.e., each gene is significantly more highly expressed than the other in at least one tissue. (B) An asymmetrically expressed gene pair. Expression of CBR1 exceeds expression of CBR3 in all tissues. Introns shortened for display purposes. The y axis shows read depth per billion mapped reads. Green regions in the gene models are unmappable.

Few duplicate pairs show evidence of sub- or neofunctionalization of expression (Fig. 2, A to C). Moreover, most gene pairs with such patterns are very old, dating to before the emergence of the placental mammals: For duplicates with dS < 0.7, just 15.2% of duplicates are classified as potentially sub- or neofunctionalized in expression. Given that even modest variation in expression profiles across tissues would meet our criteria for subfunctionalization, the fraction of truly subfunctionalized duplicates may be even lower.

Fig. 2 Properties of subfunctionalized genes.

(A) Classification of gene pairs by expression patterns. For context, note that duplicates arising at the human-mouse split would have dS ~ 0.45. (B) Heat map of expression ratios for duplicate pairs. For each duplicate pair (plotted in columns), the ratios show the tissue-specific expression level of the minor gene relative to its duplicate. Green indicates evidence for subfunctionalization; consistently blue columns indicate AEDs. Black indicates tissue ratios not significantly different from 1 (P > 0.001). (C) Distributions of expression ratios in different tissues (minor genes/major genes). Ratios significantly >1 marked in green. (D) Frequency spectra of human polymorphism data (15) for synonymous and nonsynonymous variants in subfunctionalized duplicates (green) and duplicates without significant expression differences (black). The plots show cumulative derived allele frequencies at segregating sites. The lines that climb more steeply (subfunctionalized genes) have a higher fraction of rare variants, indicating stronger selective constraint. (E) Disease burden of minor genes is highly correlated with degree of subfunctionalization (top) and overall expression relative to major genes (bottom). Data in (B), (C), and (D) are for dS < 0.7.

We also found similar levels of potential subfunctionalization in a mouse data set (12) that, unlike GTEx, includes fetal tissues (fig. S14). We examined whether subfunctionalization might instead be occurring through differential splicing of exons; however, we found little evidence for this (fig. S20). Last, we hypothesized that subfunctionalization might be more prevalent in gene pairs with higher tissue specificity (because they likely have more tissue-specific enhancers), but this is not the case (fig. S13).

Although relatively scarce, the genes identified as potentially subfunctionalized exhibit systematic differences from other duplicates. First, subfunctionalized gene pairs are expected to be under stronger selective constraint than genes without diverged expression, because the two copies are not functionally redundant. Consistent with this, we find that putatively subfunctionalized genes tend to have a higher fraction of rare variants in human polymorphism data (15) (P = 2 × 10–5 for missense mutations; Kolmogorov-Smirnov test) (Fig. 2D). Second, we hypothesized that if subfunctionalized genes have distinct functions, then they may be associated with distinct genetic diseases. Examining a database of gene associations with disease (16), we found a correlation between the degree of expression subfunctionalization and the number of diseases reported for only one member of the gene pair (P = 5 × 10–12, controlling for relevant covariates; Wald test) (Fig. 2E and table S3).

In sharp contrast to the expectations of subfunctionalization, many duplicate pairs exhibit systematically biased expression, as seen in some species after whole-genome duplication (17). Across all duplicate pairs, the mean expression of the less-expressed gene is 40% that of its duplicate (Fig. 2, B and C) (P ~ 0, relative to a model with no true asymmetry). Among duplicates that likely arose within the placental mammals (dS < 0.7), 52.6% of duplicate pairs are AEDs, compared with just 15.2% that are potentially subfunctionalized. As might be expected, the minor genes at AEDs show evidence of reduced selective constraint relative to their duplicate partners, both within the human population (fig. S23) and between species (fig. S21). Furthermore, in gene pairs with asymmetric expression, the minor genes tend to be associated with significantly fewer diseases (P = 8 × 10–7; Wald test) (Fig. 2E). Nonetheless, despite their reduced importance, minor genes are not dispensable: 97% of minor genes have dN/dS < 1, a hallmark of protein-coding constraint (fig. S21).

Together, these results show that subfunctionalization of expression evolves slowly. However, we noticed much higher rates of sub- or neofunctionalization for duplicates located on different chromosomes, compared with duplicates in tandem (P = 5 × 10–23; Fisher’s exact test) (fig. S24). We thus wanted to understand whether separation of duplicates enables subfunctionalization or whether the higher rate simply reflects the greater age of separated duplicates. Most duplicates arise as segmental duplications (18) and are close together in the genome: 87% of young gene pairs (dS < 0.1) are on the same chromosome (Fig. 3A). Duplicates may subsequently become separated as the result of chromosomal rearrangements; however, this is a slow process. It is not until dS = 0.6 that half of gene duplicates are found on different chromosomes.

Fig. 3 Coregulation of tandem duplicates.

(A) Numbers of duplicate pairs on the same or different chromosomes, as a function of dS, showing that most young pairs are close in the genome. (B) Correlation of expression profiles of duplicates across tissues, for tandem and separated pairs. (C) Expression correlations for duplicates that are separated in human but not mouse, or vice versa (P = 0.03; one-sided paired t test). (D) Overall distributions of correlations for different classes of genes. (E) Numbers of Hi-C links between neighboring gene pairs. (Gene pairs within 20 kb were excluded due to limited resolution of the assay; singleton pairs were randomly downsampled for plotting.)

Even controlling for duplicate age, however, there is a strong signal that genomic separation is a key factor enabling expression divergence (Fig. 3B). Separated duplicates have roughly 50% lower correlation of expression across tissues: P = 3 × 10–30, controlling for age by dS in a multiple regression model (table S4 and fig. S26); P = 6 × 10–18, controlling for age by phylogenetic distribution (fig. S7). Further, we see the same effect in a paired test of duplicates that are separated in human but not mouse, or vice versa (Fig. 3C). Notably, duplicate age itself is a much weaker predictor (P = 2 × 10–6 for dS) than is genomic separation (P = 3 × 10–30) (table S4). [In contrast to correlation across tissues, the asymmetry of mean expression is uncorrelated with whether the duplicates are on the same chromosome or not (P = 0.9, controlling for dS; Wald test).]

These results echo previous observations that, in general, genes that are close in the genome tend to be coregulated, with correlated expression (19) and often shared expression quantitative trait loci (eQTLs) (20). This effect is yet stronger for duplicates: Gene expression is more correlated for tandem duplicates than for singleton neighbors (P = 10–19; t test) (Fig. 3D), and duplicates share eQTLs at higher rates than matched singletons (P = 6 × 10–4 and 5 × 10–4 in two data sets; Fisher’s exact test) (13, 20, 21). Further, duplicates show higher connectivity by whole-genome chromosome conformation capture (Hi-C) (22), including higher numbers of promoter-promoter links than neighboring singletons (Fig. 3E) (mean effect size = 1.7-fold, P = 3 × 10–6; Wald test) (13). Promoter-promoter links may reflect a tendency of coregulated genes to be transcribed simultaneously within transcription factories (23). In contrast, duplicates on different chromosomes show no evidence of Hi-C linkage. In summary, we hypothesize that tandem duplicates tend to be highly coregulated and that genomic separation is a key factor enabling independent evolution.

Thus far, our results argue that expression subfunctionalization evolves slowly, in large part because tandem duplicates tend to be coregulated. An alternative explanation for the initial survival of duplicates is that they are both necessary to produce the required expression dosage (6). However, in contrast to whole-genome duplications, the small-scale duplications that are typical in mammals would initially disrupt dosage of the duplicated genes relative to all other genes. Thus, if dosage sharing is important in mammals, this would suggest that after tandem duplication, the duplicates should rapidly evolve reduced expression. Subsequent loss of either gene would then cause a deficit of expression and be deleterious.

To evaluate this, we analyzed the expression of human duplicates that arose since the human-macaque split, using RNA-seq data from six tissues in human and macaque (Fig. 4A) (13, 24). Indeed, there is a very clear signal that both human copies tend to evolve reduced expression, such that the median summed expression of the human duplicates is close to the expression of the singleton orthologs in macaque (median expression ratio 1.11; this is significantly less than the 2:1 expression ratio expected on the basis of copy number, P = 3 × 10–7; t test). Interestingly, polymorphic duplicates also show partial down-regulation, whereas the youngest fixed duplicates are about as down-regulated as older pairs, suggesting that reduced expression occurs rapidly (fig. S19). In contrast, we find no evidence for coding adaptation in these relatively young duplicate pairs (fig. S16). Thus, dosage sharing may be a frequent first step in the preservation of tandem duplicates. However, although dosage sharing evolves quickly, it is notable that duplicate genes remain less conserved than singleton genes over long evolutionary time scales (dS ≤ 0.7, or roughly the age of placental mammals) (Fig. 4B and fig. S22).

Fig. 4 Long-term survival of duplicate genes.

(A) Expression levels of young duplicates compared to their macaque orthologs in six tissues (24), for human duplicates that are single-copy genes in macaque. Sum shows the summed expression of both duplicates, relative to expression of the macaque orthologs in the same tissues. “Major” and “Minor” show corresponding ratios for major and minor genes separately, classified using GTEx data. The green data show a random set of singleton orthologs. Each tissue-gene expression ratio is plotted separately. (B) The strength of purifying selection in humans increases with duplicate age. The fraction of rare missense variants in a large human data set (15) is used as a proxy for the strength of purifying selection. (C) Conceptual model of duplicate gene evolution. Other transitions not explicitly shown would occur at lower but nonzero rates.

We propose that down-regulation is a key first step enabling the initial survival of duplicates, followed by dosage sharing, as suggested for WGDs (Fig. 4C) (6). In this view, the early survival of young duplicates is a race between down-regulation to achieve dosage balance versus mutational degradation of one copy. If dosage balance is achieved, then the relative expression levels of the two genes evolve slowly as a random walk due to constraint on their combined expression (7, 25). Both copies tend to evolve under reduced constraint, especially for minor genes of AEDs. Genomic separation frees expression of the duplicates to evolve independently and may also encourage protein adaptation, potentially leading to true functional differentiation and long-term survival. In summary, we find that subfunctionalization of expression evolves slowly in mammals due to coregulation of tandem duplicates and that rapid evolution of dosage sharing may be the most frequent first step to duplicate preservation.

Supplementary Materials

www.sciencemag.org/content/352/6288/1009/suppl/DC1

Materials and Methods

Supplemental Text

Figs. S1 to S27

Tables S1 to S5

Supplementary Files S1 to S3

References (2679)

References and Notes

  1. See supplementary materials and methods on Science Online.
Acknowledgments: This work was funded by NIH grants ES025009 and MH101825 and by the Howard Hughes Medical Institute. We thank H. Fraser for prepublication access to data (12) and H. Fraser, A. Fu, A. Harpak, Y. I. Li, D. Petrov, P. C. Phillips, M. Przeworski, A. Stoltzfus, and the anonymous reviewers for comments or discussion. J.K.P. is on advisory boards for 23andMe and DNAnexus, with stock options in both.
View Abstract

Subjects

Navigate This Article