The coffee genome provides insight into the convergent evolution of caffeine biosynthesis

Science  05 Sep 2014:
Vol. 345, Issue 6201, pp. 1181-1184
DOI: 10.1126/science.1255274

Coffee, tea, and chocolate converge

Caffeine has evolved multiple times among plant species, but no one knows whether these events involved similar genes. Denoeud et al. sequenced the Coffea canephora (coffee) genome and identified a conserved gene order (see the Perspective by Zamir). Although this species underwent fewer genome duplications than related species, the relevant caffeine genes experienced tandem duplications that expanded their numbers within this species. Scientists have seen similar but independent expansions in distantly related species of tea and cacao, suggesting that caffeine might have played an adaptive role in coffee evolution.

Science, this issue p. 1181; see also p. 1124


Coffee is a valuable beverage crop due to its characteristic flavor, aroma, and the stimulating effects of caffeine. We generated a high-quality draft genome of the species Coffea canephora, which displays a conserved chromosomal gene order among asterid angiosperms. Although it shows no sign of the whole-genome triplication identified in Solanaceae species such as tomato, the genome includes several species-specific gene family expansions, among them N-methyltransferases (NMTs) involved in caffeine production, defense-related genes, and alkaloid and flavonoid enzymes involved in secondary compound synthesis. Comparative analyses of caffeine NMTs demonstrate that these genes expanded through sequential tandem duplications independently of genes from cacao and tea, suggesting that caffeine in eudicots is of polyphyletic origin.

With more than 2.25 billion cups consumed every day, coffee is one of the most important crops on Earth, cultivated across more than 11 million hectares. Coffee belongs to the Rubiaceae family, which is part of the Euasterid I clade and the fourth largest family of angiosperms, consisting of more than 11,000 species in 660 genera (1). We sequenced Coffea canephora (2n = 2x = 22 chromosomes), an outcrossing, highly heterozygous diploid, and one of the parents of C. arabica (2n = 4x = 44 chromosomes), which was derived from hybridization between C. canephora and C. eugenioides (2). A total of 54.4 million Roche 454 single and mate-pair reads and 143,605 Sanger bacterial artificial chromosome–end reads were generated from a doubled haploid accession, representing ~30× coverage of the 710-Mb genome (3). Additional Illumina sequencing data (60×) were used to improve the assembly (table S1) (4). The resulting assembly consists of 25,216 contigs and 13,345 scaffolds with a total length of 568.6 Mb (80% of 710 Mb), including 97 Mb (17%) of intercontig gaps. Eighty percent of the assembly is in 635 scaffolds, and the scaffold N50 (the scaffold size above which 50% of the total length of the sequence assembly can be found) is 1.26 Mb (table S2). A high-density genetic map covering 349 scaffolds and comprising ~64% of the assembly (364 Mb) and 86% of the annotated genes was anchored to the 11 C. canephora chromosomes (4). More than 96% of the scaffolds larger than 1 Mb were anchored (Fig. 1A).

Fig. 1 Structure of the C. canephora genome.

(A) Alignment of the pseudochromosome 1 sequence with the genetic map of C. canephora and genomic overview. Correspondences between the genetic linkage map and the DNA pseudomolecule are shown at left (oriented and nonoriented scaffolds are indicated in blue and green, respectively; gray lines denote consistent data; orange lines indicate markers with an approximate genetic location). The relative proportions (percentage of nucleotides) in sliding windows (1-Mb size, 500-kb step) of transposable elements (Copia in red, Gypsy in green) and genes (exons in blue, introns in dark blue) are shown at right. (B) Coffee chromosomal blocks descending from the seven ancestral core eudicot chromosomes. The three paralogous descendants of the seven ancestral chromosomes are shown in shared colors but different textures. (C) Comparison of three grapevine chromosomes (descendants of the prehexaploidization core eudicot chromosome) mapped to a single coffee chromosome and three regions in the tomato genome. (D) Phylogeny and genome duplication history of core eudicots. Arrowheads indicate tetraploidization (blue) or hexaploidization (green) events. Red lines trace lineages of six species that have not undergone further polyploidization. Bar graphs and colors reflect gene-order differences (table S17) between each of the six species (column labels) and the entire set, showing the gene order conservatism of coffee, especially among asterids, and of peach and cacao among rosids.

We annotated 25,574 protein-coding genes (4) (table S6), 92 microRNA precursors, and 2573 organellar-to-nuclear genome transfers (4). Transposable elements account for ~50% of the genome (4), of which ~85% are long terminal repeat (LTR) retrotransposons. Large-scale comparison between C. canephora LTR retrotransposons and those of reference plant genomes shows outstanding conservation of several Copia groups across distantly related genomes, suggesting that horizontal mobile element transfers may be more frequent than generally recognized (58).

Structurally, the coffee genome shows no sign of a whole-genome polyploidization in its lineage since the γ triplication at the origin of the core eudicots (9) (Fig. 1B). Coffee contains exactly three paralogous regions for each of the seven pre-γ ancestral chromosomes (Fig. 1B). Coffee chromosomal regions show unique one-to-one correspondences with grapevine chromosomes (Fig. 1C and fig. S12) and a one-to-three correspondence with the tomato genome, which underwent a second lineage-specific triplication during its evolutionary history (10). Although grapevine, a rosid, is the most conservative core eudicot in terms of integrity of gross chromosomal structure, coffee displays less gene-order divergence to all other rosids, despite being an asterid itself (9). Coffee also shows little syntenic divergence relative to other sequenced asterids (Fig. 1D, table S17, and supplementary text).

To classify gene families in the C. canephora genome, we ran OrthoMCL on inferred protein sequences from coffee, grapevine, tomato, and Arabidopsis (4), generating 16,917 groups of orthologous genes (fig. S5). To examine coffee-specific gene family expansions with potential adaptive value, we fit different branch models implemented in BadiRate (11) to these orthogroups (4). In the coffee lineage, 202 orthogroups clustering 1270 genes were supported as expanded (Akaike information criterion > 2.7). Among gene ontology (GO) terms annotating these, 98 out of 4300 generic terms were significantly over- or underrepresented (table S14). Most GOs enriched in C. canephora (P < 0.05) belonged to two main functional categories: defense response and metabolic process, the later including different catalytic activities (table S15).

Among defense response functions, there is a clear expansion of nucleotide binding site disease-resistance genes (12, 13) in the C. canephora genome (4). Most genes that grouped together within single orthogroups were tandemly arrayed, suggesting that R genes evolved by tandem duplication and divergence of linked gene families (supplementary text). Several gene functions involved in secondary metabolite biosynthesis are significantly expanded in the C. canephora genome, including enzymes associated with the production of phenylpropanoids such as flavonoids and isoflavones (naringenin 3-dioxygenase, isoflavone 2′-hydroxylase), alkaloids (strictosidine synthase, tropine dehydrogenase), monoterpenes (e.g., menthol dehydrogenase), and caffeine [N-methyltransferases (NMTs)] (Fig. 2). For example, indole alkaloids such as the monoamine oxidase inhibitor yohimbine and antimalaria drug quinine are prominent secondary compounds of the coffee family and its parent order, Gentianales (14), and the GO term indole biosynthetic process was highly enriched (P < 0.001) in coffee relative to tomato, grapevine, and Arabidopsis.

Fig. 2 Evolution of caffeine biosynthesis.

(A) The principal caffeine biosynthetic pathway. Three methylation steps are necessary to produce caffeine from xanthosine, involving the successive action of three NMTs: xanthosine methyltransferase (XMT), theobromine synthase [7-methylxanthine methyltransferase (MXMT)], and caffeine synthase [3,7-dimethylxanthine methyltransferase (DXMT)]. SAM, S-adenosylmethionine; SAH, S-adenosylhomocysteine. (B) Evolutionary position of caffeine-producing plants with respect to other eudicots (phylogeny adapted from (C) ML phylogeny of coffee, tea, and cacao NMTs. Bootstrap support values (percentages) from 1000 replicates are shown next to relevant clades. Branch lengths are proportional to expected numbers of nucleotide substitutions per site. Colors identify genes assignable to the genomic blocks denoted in (D). (D) (Left) A model summarizing the duplication history of coffee NMT genes, following the phylogeny in (C). Three distinct tandem gene arrays evolved in situ on chromosome 1 from nearby gene duplicates (bold squares). The red and green blocks, colored as in (C), translocated (to chromosome 9) or rearranged (to elsewhere on chromosome 1) from their ancestral locus (blue region), respectively. (Right) Gene orders on modern chromosomes. Translocation of the red block, containing the putative caffeine NMT metabolic cluster, left the phylogenetically derived CcDXMT gene behind. Similarly, CcNMT19 is a derived gene within its own NMT clade that remained in place following movement of the green block. Numbers at branches indicate relative times since major duplication events or diversification times of the tandem arrays, calculated from approximately neutral synonymous substitution rates. (E) Expression profiles (reads per kilobase per million reads mapped) of known Coffea canephora NMTs. The genes in the putative metabolic cluster (along with CcDXMT and CcMXMT) exhibit similar expression patterns, higher in perisperm than endosperm. Data are plotted as log2 values. DAP, days after pollination.

Caffeine is a purine alkaloid synthesized by several eudicot plants, including coffee, cacao (Theobroma cacao), and tea (Camellia sinensis) (Fig. 2). Caffeine is synthesized in both coffee leaves, where it has insecticidal properties (15), and fruits and seeds, where it inhibits seed germination of competing species (16). The late steps in caffeine biosynthesis are mediated by a series of NMTs (Fig. 2A) (17).

Among coffee-expanded genes, NMT activity is one of the more highly enriched GO terms (table S15). A single gene family (ORTHOMCL170) clusters 23 genes in coffee, but none in grapevine, tomato, or Arabidopsis (table S12), and this cluster contains genes encoding known enzymes of the caffeine biosynthetic pathway (18, 19). Maximum likelihood (ML) phylogenetic analysis of ORTHOMCL170 with tea and cacao NMTs that have similar activities reveals species-specific gene clades (Fig. 2C). We analyzed these relationships in a broader evolutionary context by including genome-wide samples of NMTs from coffee, cacao, and other eudicot species. ML trees show that the genes encoding the closest Arabidopsis NMT relatives of coffee caffeine biosynthetic enzymes are involved in benzoic, salicylic, and nicotinic functions (4) (supplementary text). Caffeine biosynthetic NMTs from coffee nested within a gene clade distinct from those of cacao or tea, which group together as sister lineages. Thus, a minimum of two independent origins of caffeine biosynthetic NMT activity can be inferred, as proposed previously (20).

Microsynteny analyses of ORTHOMCL170, which includes three tandem arrays, show that some known and putative coffee caffeine synthase genes—CcXMT (encoding xanthosine N-methyltransferase), CcMTL, and CcNMT3—form a tight assemblage of coexpressed tandem duplicates (Fig. 2D) reminiscent of a metabolic gene cluster (21, 22). Given that some plant metabolic gene clusters are of relatively recent origin (23), we sought to further unravel the role of gene duplication in the expansion of the coffee NMT gene family (Fig. 2D) (supplementary text). The three main coffee NMT clades in ORTHOMCL170 are distributed among a minimum of three genomic blocks; however, some phylogenetically recent tandem duplicates have moved away from their original positions via block rearrangements (Fig. 2D). One such movement involving the putative metabolic cluster appears to have left the CcDXMT gene (encoding 3,7-dimethylxanthine methyltransferase) behind, physically separated from its ancestral tandem array. In cacao, the functionally characterized TcBCS1 gene has a tandem duplicate, but this pair of genes evolved independently from the NMT tandem arrays found in C. canephora (fig. S29). We also examined the role of positive selection (PS) in the evolution of caffeine biosynthesis among coffee, tea, and cacao (4) (supplementary text). We found significant evidence for PS [likelihood ratio test for PAML (Phylogenetic Analysis by Maximum Likelihood) branch-site test, P = 5.78 × 10–3 (24)] only for the coffee NMT lineage, indicating that the independent evolution of caffeine biosynthesis in coffee was adaptive and probably involved specific amino acid changes fixed by PS. These results highlight the distinct acquisition of caffeine biosynthesis in the coffee plant, providing an example of convergent evolution of secondary metabolic pathways encoded by tandemly duplicated genes.

Genomic functional diversification via tandem duplication may have helped shape other aspects of coffee bean chemical composition. Linoleic acid, which is produced by the oleate desaturase FAD2, is the major polyunsaturated fatty acid in the coffee bean (25, 26), where it contributes to aroma composition and flavor retention after roasting (4). Coffee has six FAD2 genes compared with one in Arabidopsis, and most of these have arisen from tandem duplications on chromosome 1 (fig. S33). RNA sequencing data suggest transcriptional specialization for two of the six FAD2 copies, with CcFAD2.3 being actively transcribed in developing endosperm (supplementary text). Peak transcript abundance coincides with the dramatic increase in linoleic acid content that occurs during seed development at the perisperm-endosperm transition (27).

Our analysis of the adaptive genomic landscape of C. canephora identifies the convergent evolution of caffeine biosynthesis among plant lineages and establishes coffee as a reference species for understanding the evolution of genome structure in asterid angiosperms.

Correction (8 October 2014): Figure 1D has been updated in the PDF version.

Supplementary Materials

Materials and Methods

Supplementary Text

Figs. S1 to S33

Tables S1 to S27

References (28175)

References and Notes

  1. Materials and methods are available as supplementary materials on Science Online.
  2. Acknowledgments: We acknowledge the following sources for funding: ANR-08-GENM-022-001 (to P.L.); ANR-09-GENM-014-002 (to P.W.); Australian Research Council (to R.J.H.); Natural Sciences and Engineering Research Council of Canada (to D.S.); CNR-ENEA Agrifood Project A2 C44 L191 (to G.Gi.); FINEP-Qualicafé, INCT-CAFÉ (to A.C.A.); NSF grants 0922742 (to V.A.A.) and 0922545 (to R.M.); and the College of Arts and Sciences, University at Buffalo (to V.A.A.). We thank P. Facella (ENEA) for Roche 454 sequencing and Instituto Agronômico do Paraná (Paraná, Brazil) for fruit RNA. This work was supported by the high-performance cluster of the SouthGreen Bioinformatics platform (UMR AGAP) CIRAD ( The C. canephora genome assembly and gene models are available on the Coffee Genome Hub ( and the CoGe platform ( Sequencing data are deposited in the European Nucleotide Archive under the accession numbers CBUE020000001 to CBUE020025216 (contigs), HG739085 to HG752429 (scaffolds), and HG974428 to HG974439 (chromosomes). Gene family alignments and phylogenetic trees for BAHD acyltransferases and NMTs are available in the GreenPhylDB ( under the gene family IDs CF158535 and CF158539 to CF158545, respectively. We declare no competing financial interests.
View Abstract


Navigate This Article