Origin and Spread of de Novo Genes in Drosophila melanogaster Populations

See allHide authors and affiliations

Science  14 Feb 2014:
Vol. 343, Issue 6172, pp. 769-772
DOI: 10.1126/science.1248286


Comparative genomic analyses have revealed that genes may arise from ancestrally nongenic sequence. However, the origin and spread of these de novo genes within populations remain obscure. We identified 142 segregating and 106 fixed testis-expressed de novo genes in a population sample of Drosophila melanogaster. These genes appear to derive primarily from ancestral intergenic, unexpressed open reading frames, with natural selection playing a significant role in their spread. These results reveal a heretofore unappreciated dynamism of gene content.

Losses and Gains

In order to better understand the process by which de novo genes originate, Zhao et al. (p. 769, published online 23 January) examined testis-based gene expression among Drosophila melanogaster strains and identified both fixed and polymorphic de novo genes. The results suggest that spontaneous activation of previously noncoding DNA may be an important factor in generating genetic novelty.

Although the vast majority of genes present in any species descend from a gene present in an ancestor, recent analyses suggest that some genes originate from ancestrally nongenic sequences (13). Evidence for these “de novo” genes has generally derived from a combination of phylogenetic and genomic/transcriptomic analyses that reveal evidence of lineage- or species-specific transcripts associated with nongenic orthologous sequences in sister species. De novo genes, which were first identified in Drosophila (13), have also been identified in humans, rodents, rice, and yeast (49). In Drosophila, de novo genes tend to be specifically expressed in tissues associated with male reproduction (2, 10), which suggests that sexual or gametic selection may be important (13, 9), although other functional roles may evolve (10, 11). Because previous studies of de novo gene evolution used comparative rather than population genetic approaches, the earliest steps in de novo gene origination remain mysterious. Here, we used population genomic and transcriptomic data from Drosophila melanogaster and its close relatives to investigate the origin and spread of de novo genes within populations.

Illumina paired-end RNA sequencing (RNA-seq) and de novo and reference-guided assembly and alignment were used to characterize the testis transcriptome of six previously sequenced inbred Raleigh (RAL) D. melanogaster strains (12); an average of 65 million paired-end reads were produced for each strain (table S1). We inferred (13) the presence of 142 polymorphic de novo candidate genes that are expressed in at least one RAL strain but are not known on the basis of publicly available data from D. melanogaster. The median number of segregating de novo genes carried per strain was 49. Reverse transcription polymerase chain reaction (RT-PCR) and 5′ and 3′ rapid amplification of cDNA ends (RACE) in a subset of genes supported inferences from RNA-seq analysis (table S2). These candidate polymorphic genes correspond to unique, intergenic sequence in the D. melanogaster reference sequence (table S3), are alignable to unique orthologous regions in the D. simulans and D. yakuba reference sequences, and show no significant BLASTP hits to the NCBI nr (nonredundant) protein database. The candidate genes exhibited expression neither in testis RNA-seq data from three D. simulans and two D. yakuba strains (table S1 and fig. S1) nor in whole male and female RNA-seq data from 59 D. simulans strains (13). None of the candidates showed significant expression in whole females from the same D. melanogaster strains used for testis RNA-seq (table S4). These data support the hypothesis that the 142 candidates are new, male-specific, de novo genes still segregating in D. melanogaster. Expression levels of the candidate genes greatly exceed levels of background transcription in intergenic sequence (fig. S2) (13); several additional attributes of these genes, as described below, support the hypothesis that the observed transcripts are biologically meaningful.

Segregating de novo genes were moderately expressed (Fig. 1A and Table 1), but their expression was significantly lower than that of annotated male-biased genes (13) (Table 1) or annotated genes (table S6). We observed no enrichment of polymorphic de novo genes near annotated male-biased genes and no significant correlation between the strand (+/−) of polymorphic de novo genes and that of their immediate annotated neighbors [χ2 test, P > 0.1 (table S5 and fig. S3), supported by simulations (13)]. There was a marginally significant underrepresentation of X chromosome segregating de novo genes relative to annotated male-biased genes (10 genes are X-linked; t test, P = 0.01; Fig. 1B). This result stands in contrast to speculation based on a small sample of older, fixed de novo genes (2, 3) that de novo male-biased genes are overrepresented on the X chromosome.

Fig. 1 Basic properties of segregating de novo genes.

(A) Expression estimates of segregating de novo genes, fixed de novo genes, all annotated genes, and annotated male-biased genes in D. melanogaster. (B) Simulation of de novo gene locations. The boxplot for each chromosome is the simulated number of genes from intergenic regions. The black dot is the observed number. The X chromosome is the only chromosome arm that deviates from the expected number of genes (t test, P = 0.01). (C) Pie chart of segregating de novo gene frequency.

Table 1 Properties of segregating and fixed de novo genes and comparison with annotated male-biased genes in D. melanogaster.

Wilcoxon test, ***P < 0.001, **P < 0.01, *P < 0.05; ns, not significant. For segregating de novo genes, P values are comparisons of segregating versus fixed genes and segregating versus male-biased genes. For fixed de novo genes, P values are comparisons of fixed de novo genes versus male-biased genes. Male-biased genes are as defined in (13). All estimates are medians, except for exon number (mean).

View this table:

As expected, de novo genes were significantly shorter and simpler than annotated genes and annotated male-biased genes (Table 1 and table S6). This pattern is likely due mostly to the larger proportion of polymorphic de novo genes that are single-exon (57.0%) compared to the proportion of annotated single-exon (table S6) or single-exon male-biased genes (Table 1) (13). Among the 61 multi-exon de novo genes, the majority of splice events (98%) were associated with canonical sites; rare noncanonical splice sites were found in four genes as minor isoform splice events, which were similar to those previously observed in D. melanogaster (14). Alternative splicing was observed in 20 of the 61 multi-exon segregating de novo genes (table S7), with conserved reading frames across alternative isoforms. Genes associated with alternative splicing generally exhibited multiple isoforms across strains that expressed the corresponding gene, with no evidence of genetic variation for alternative splice use.

Of 142 polymorphic genes, 134 (94%) had a minimum open reading frame (ORF) of at least 150 base pairs (bp) and were classified as potentially coding. To determine the likelihood that the high proportion of genes harboring long ORFs occurred by chance, we investigated the coding potential of intergenic regions in the reference sequence, focusing on single-exon ORFs. We observed that 59.9% of random 800-bp intergenic sequences were associated with a ≥150-bp single-exon ORF, whereas 97.5% of the observed single-exon de novo genes were associated with such an ORF (P < 0.01). Moreover, the mean length of single-exon de novo gene ORFs was substantially greater than that expected in random intergenic sequence (P < 0.05). These observations further support the idea that the observed transcripts are unlikely to be explained simply as random noise. The eight polymorphic de novo genes that did not satisfy our arbitrary minimum ORF criterion were autosomal and slightly smaller (mean transcript length = 743 bp) than ORF-containing polymorphic genes. Because orthologous sequences from expressing and non-expressing D. melanogaster lines have similar coding potential, most segregating de novo genes are likely to have resulted from the recruitment of small, preexisting, unexpressed ORFs (1). For D. simulans and D. yakuba orthologous sequences, 70% and 45%, respectively, contained ORFs similar to those observed for segregating genes in D. melanogaster. Of the 134 predicted de novo proteins, 41.8% may be intrinsically unfolded (fig. S4, A to D) and 50% of these have predicted binding regions (fig. S4E); both observations are consistent with potential biological function (15). For putative protein-coding genes, the average 5′ and 3′ untranslated region (5′UTR and 3′UTR) lengths—248 bp and 364 bp, respectively—were slightly shorter than the average lengths for annotated D. melanogaster genes but were slightly longer than the averages for annotated male-biased genes (Table 1). The incidence of the two major polyadenylation signals (AAUAAA and AUUAAA) in or near the putative 3′UTRs of segregating de novo genes was similar to, but slightly lower than, the incidence in the whole genome (table S8). Overall, polymorphic de novo genes have structural organization consistent with small protein-coding genes in the species.

Segregating de novo genes either were expressed at a relatively high level in expressing strains or showed almost no evidence of expression in other strains. Hartigan’s dip test on transcript abundance estimates rejected unimodality for 134 of 142 genes and was consistent with bimodal expression across lines for most genes. We used a cutoff of two fragments per kilobase of exon per million fragments mapped (FPKM > 2) for inferring expression of a transcript in a line (16) to determine the proportion of strains, from 0.17 (1/6) to 1.0 (6/6) expressing each transcript. Because no candidates show expression in the reference sequence strain, the genes expressed in all six RAL strains are considered to be polymorphic in the species. More than half the genes (55%) were not rare in the Raleigh sample, as they were expressed in at least two of the six RAL strains (Fig. 1C); 29.5% were definitely common, being expressed in three or more strains, which is inconsistent with mutation-selection balance. We observed 106 unannotated male-specific transcripts expressed in all six strains and in the reference strain (table S9) but not in the outgroup strains. The corresponding “fixed” de novo genes were not included in downstream analyses relating to segregating genes.

We extracted the 100 bp upstream and 50 bp downstream of the inferred transcription start site (TSS) from the genome sequences of the expressing strains for each of the 61 multi-exon genes. MDscan identified and clustered motifs in these flanking sequences; sequence logos were then generated. We observed four common consensus sequence motifs (8 or 10 bp; Fig. 2A), each of which was found associated with roughly half the segregating de novo genes (13) (table S10). In total, 371 annotated male-biased genes (23.3%) were also associated with at least one of these motifs, which suggests that the de novo genes share regulatory features with known male-biased genes. We identified 67 annotated male-biased genes (table S11) that have two or more motifs in the 5′ regions. However, GO (Gene Ontology) enrichment analysis (fig. S5) provided no insight into the possible functions of de novo genes.

Fig. 2 Regulation and population genetics of segregating de novo genes.

(A) Potential cis-regulatory elements. The most common shared 8- and 10-bp consensus motifs in 5′ flanking regions are listed. From top to bottom, 34, 29, 25, and 30 multiple-exon genes show these motifs. (B) Nucleotide diversity (π) for de novo genes and flanking regions. The red line is the ratio of π-expressing lines to π–non-expressing lines; the green line shows expected values from resampling of intergenic DNA conditional on the same derived allele frequency distribution as the observed de novo genes. π estimates for 5′ and 3′ flanking regions of genes were incremented in 5-kb windows. (C) A gene (Gene_X_141) that may have experienced a hard selective sweep. Gray box denotes expressing lines. The TSS region contains a derived allele fixed in expressing strains and absent in non-expressing strains; flanking regions are homozygous in expressing strains. (D) A gene (Gene_3L_079) showing no evidence of hard sweep. Gray box denotes expressing lines. The TSS region includes a derived allele fixed in expressing lines, but the flanking regions of expressing chromosomes retain nucleotide variation.

These data support the hypothesis that de novo gene expression is influenced by cis-acting variants in the regions corresponding to the 5′ flanking regions of expressing chromosomes. In the simplest case that de novo gene expression is due to a single noncoding nucleotide change, one would predict an excess of fixed differences between expressing and non-expressing chromosomes in flanking regions compared to random samples of intergenic sequences. We focused on the 32 genes expressed in more than two strains and for which our genetic analysis (13) supported cis-acting variation driving de novo gene expression. Of these genes, 31.2% exhibited a fixed, derived single-nucleotide polymorphism (SNP) within 500 bp upstream of the TSS, whereas only 8.43% of simulated “genes” (intergenic regions defined by harboring derived SNPs with the same frequency distribution as the 32 observed genes) exhibited a fixed SNP in the comparable 5′ region (P < 0.01). More generally, divergence between expressing and non-expressing chromosomes for these 500-bp regions was significantly greater than divergence in simulated data (P = 0.048); this finding supports the hypothesis that cis-regulatory changes play a role in de novo gene origination.

Under this hypothesis, segregating genes should be associated with allele-specific expression. We thus measured allelic imbalance (17, 18) in the testis in a set of three unique F1 genotypes created by crossing the six RAL strains (table S1) (13). For the 59 autosomal genes for which one parent expressed the gene and the other did not, expression patterns in the heterozygote for 51 genes were explained completely by cis-acting variation (i.e., allelic imbalance was complete); 7 genes showed evidence of regulation by both cis-acting and trans-acting factors. Only 1 of the 59 genes showed no evidence of allelic imbalance, consistent with expression driven solely by trans-acting variation (table S12). More generally, for genes expressed in both parents, the expression of alleles in F1 was consistent with expression levels in each parental line (table S13), further supporting the importance of cis-acting expression variants. The roughly bimodal expression patterns and the dominant role of cis effects support the idea that the proportion of lines expressing a gene provides an estimate of its population frequency.

One population genetic explanation for polymorphic de novo genes is that singleton genes (45% of genes) are primarily deleterious and that higher-frequency genes are primarily neutral. If the deleterious nature of de novo genes were due to the cost of transcription or translation, or from toxic interactions of the resulting RNAs or proteins with other molecules, then lower-frequency genes should be more abundantly expressed and longer than higher-frequency genes. However, contrary to this expectation, lower-frequency genes were expressed at a lower level, were shorter, and were less complex than higher-frequency genes (table S6) (13). The different properties of rare versus common de novo genes (Table 2) (13) supports the idea that de novo genes having certain properties (e.g., greater expression, longer transcripts, more exons) are more likely to spread under selection.

Table 2 Properties of segregating genes differ across frequency classes.

Wilcoxon test, ***P < 0.001, **P < 0.01, *P < 0.05; P values are comparison of singleton versus nonsingleton genes and singleton versus high-frequency (≥ 3/6) genes. FPKM and transcript length estimates are medians; exon numbers are means.

View this table:

We investigated the role of directional selection on polymorphic de novo genes by determining whether they are associated with reduced nucleotide diversity (19, 20). For each de novo gene expressed in at least two strains, we compared the nucleotide diversity (π) for expressed sequence (strains) versus non-expressed orthologous sequence (non-expressing strains) and compared the observed differences to a frequency-corrected expected value from resampling of intergenic sequence from the six RAL strains (13). For 46 of 65 genes, π was lower in the expressed lines (mean = 0.0060) than in the non-expressed lines (mean = 0.0092) and exhibited a roughly 38% reduction relative to non-expressed orthologous sequence over the 65 genes (Wilcoxon test, P = 0.003). For 30 genes, π was significantly lower in the expressed lines (Wilcoxon test, P < 0.05). The region of reduced heterozygosity near expressed sequences is on the scale of 5 kb or less (Fig. 2B and fig. S6), which is contrary to the expectation of strong selection on new mutations (19) but consistent with weaker selection (20) or soft selective sweeps (21) (Fig. 2, C and D). Polymorphic de novo genes were significantly (Wilcoxon test, P < 0.001) (13) more likely to be differentially expressed between populations (29 of 142, or 17%) relative to annotated genes (4.5%) and male-biased genes (6.3%), which also supports the idea that selection may play a role in their spread.

We used the Hudson-Kreitman-Aguade–like (HKAl) test statistic (22, 23) to compare the heterozygosity/divergence ratio for genomic regions associated with fixed de novo genes to that observed for appropriately sampled intergenic regions (13, 20). The HKAl for fixed regions (mean = –0.48) was significantly smaller than that expected for comparable random intergenic regions (mean = 0.12; Wilcoxon test, P < 0.001). Moreover, regions corresponding to fixed genes associated with higher expression (FPKM > 10) exhibited a smaller HKAl statistic relative to regions associated with fixed genes having lower (FPKM ≤ 10) expression (HKAl = –0.33 versus –0.86; Wilcoxon test, P < 0.001). These observations also support the hypothesis that de novo genes have been influenced by directional selection.

Our analyses suggest that there are many polymorphic de novo male-specific genes in D. melanogaster populations, likely recruited by selection primarily from ancestral, unexpressed ORFs (fig. S7). Given the small number of genotypes investigated for a single tissue and our strict filtering criteria, we have likely substantially underestimated the number of polymorphic de novo genes. Our results also suggest the existence of many more fixed de novo D. melanogaster genes than previously inferred (2, 4, 10), which supports the idea that a substantial genetic component of male reproductive biology in this species remains completely unexplored.

More generally, our results suggest that important attributes of an organism’s biology cannot be accurately represented or investigated without knowledge of de novo gene variation within species. In the absence of gene loss, de novo gene gain would lead to a long-term increase in gene number. Although our analyses are consistent with substantial numbers of polymorphic gene losses, we observed no population genetic evidence that losses result from directional selection (13). Thus, de novo genes may often spread under selection, while gene loss may occur primarily as a result of drift associated with loss of ancestral gene function. However, important details of such processes remain obscure, and much additional work is required to clarify the dynamics, biochemical and genetic properties, and phenotypic effects of young de novo genes and the processes underlying gene loss in natural populations.

Supplementary Materials

Materials and Methods

Supplementary Text

Figs. S1 to S10

Tables S1 to S17

References (2458)

References and Notes

  1. See supplementary materials on Science Online.
  2. Acknowledgments: We thank the Begun lab for valuable comments, especially N. Svetec, J. Cridland, A. Sedghifar, T. Seher, and Y. Brandvain. We also thank three anonymous reviewers and M. W. Hahn, C. H. Langley, J. Anderson, and A. Kopp for comments and suggestions. We acknowledge Phyllis Wescott (1950–2011) for more than 30 years of service in the Drosophila media kitchen at UC Davis. Funded by NIH grant GM084056 (D.J.B.) and NSF grant 0920090 (C.D.J and D.J.B.). Author contributions: L.Z. and D.J.B. conceived and designed the experiments; L.Z., P.S., and C.D.J. generated data; L.Z. performed data analysis; P.S. and L.Z. did molecular experiments; L.Z. and D.J.B. wrote the paper; and L.Z., D.J.B., and C.D.J. revised the paper. Illumina reads produced in this study are deposited at NCBI BioProject under accession number PRJNA210329.

Stay Connected to Science

Navigate This Article