Special Research Articles

Comparative Genomics of Trypanosomatid Parasitic Protozoa

See allHide authors and affiliations

Science  15 Jul 2005:
Vol. 309, Issue 5733, pp. 404-409
DOI: 10.1126/science.1112181

Abstract

A comparison of gene content and genome architecture of Trypanosoma brucei, Trypanosoma cruzi, and Leishmania major, three related pathogens with different life cycles and disease pathology, revealed a conserved core proteome of about 6200 genes in large syntenic polycistronic gene clusters. Many species-specific genes, especially large surface antigen families, occur at nonsyntenic chromosome-internal and subtelomeric regions. Retroelements, structural RNAs, and gene family expansion are often associated with syntenic discontinuities that—along with gene divergence, acquisition and loss, and rearrangement within the syntenic regions—have shaped the genomes of each parasite. Contrary to recent reports, our analyses reveal no evidence that these species are descended from an ancestor that contained a photosynthetic endosymbiont.

The protozoan pathogens Leishmania major, Trypanosoma cruzi, and Trypanosoma brucei (family Trypanosomatidae, order Kinetoplastida) collectively cause disease and death in millions of humans and countless infections in other mammals, primarily in developing countries in tropical and subtropical regions (1). There are no vaccines for these diseases and only a few drugs, which are inadequate because of toxicity and resistance. Although the three pathogens (referred to here as the “Tritryps”) share many general characteristics, including subcellular structures such as the kinetoplast and glycosomes, each is transmitted by a different insect and has its own life-cycle features, different target tissues, and distinct disease pathogenesis in their mammalian host [box 1 in (2) and fig. S1]. They also use different immune evasion strategies: L. major alters the function of the macrophages it infects, T. cruzi expresses a complex variety of surface antigens from within the cells it infects, and T. brucei remains extracellular but circumvents the host immune response by the periodic switching of its major surface protein (3).

The availability of the three draft genome sequences (46) allows better understanding of the genetic and evolutionary bases of the shared and distinct parasitic modes and lifestyles of these pathogens. In the accompanying Research Articles, the discussion of each species reflects the current state of knowledge for each organism. Thus, the Research Article by Berriman et al. (4) emphasizes metabolism and biochemical pathways of T. brucei; the Research Article by Ivens et al. (5) highlights fundamental aspects of molecular biology (transcription, translation, posttranslational modification, and proteolysis) of L. major; and the Research Article by El-Sayed et al. (6) focuses on repeats and retroelements, DNA replication and repair, and signaling pathways of T. cruzi. Here, we compare gene content and genome architecture, composition, and organization of protein domains encoded by each genome and offer an analysis of the rates of gene evolution.

Core proteome. The T. brucei, L. major, and T. cruzi haploid genomes contain between 25 and 55 megabases (Mb) distributed over 11 to 36 (generally) diploid chromosomes, and encode about 8100, 8300, and 12,000 protein-coding genes, respectively (Table 1). An “all-versus-all” basic local alignment search tool (BlastP) comparison of the predicted protein sequences within each of the three genomes was made using a suite of algorithms designed to collapse closely related paralogous genes. In the case of T. cruzi, all alleles were included because of the hybrid nature of this genome (2, 6). The mutual best BlastP hits between the three collapsed proteomes were grouped as clusters of orthologous genes (COGs). Iteration of this process with manual inspection and reannotation, especially of two-way COGs (i.e., those with members in only two of the Tritryps), resulted in 6158 three-way COGs, which defined the Tritryp core proteome, as well as 1014 two-way COGs (Table 1, Fig. 1A, and table S1). Amino acid sequence alignment of a large sample of three-way COGs reveals an average of 57% identity between T. brucei and T. cruzi, and 44% identity between L. major and the two other trypanosomes, reflecting expected phylogenetic relationships (710). The intracellular parasites, L. major and T. cruzi, appear to share slightly more two-way COGs than do T. brucei and T. cruzi and considerably more than do L. major and T. brucei. The remainder of each proteome is composed of species-specific members (table S1), of which T. cruzi (32%) and T. brucei (26%) have a much greater proportion than L. major (12%). Because the majority of the species-specific proteins appear to be members of surface antigen families, the different numbers may relate to different strategies of survival and immune evasion used in each organism. Other species-specific proteins carry out distinct metabolic and physiological functions, some of which are discussed below [see also (46)].

Fig. 1.

Distribution of genes and domains among the kinetoplastid parasites and other organisms. (A) Gene distribution, calculated with the use of Jaccard-filtered COGs (2). (B) Domain distribution calculated with the use of TIGRFAMs and Pfam domains. The numbers indicate all domains that score above the trusted cutoff after manual curation with the three-way genome comparisons (2). (C) Pfam domain distribution among the three kingdoms of life and the Tritryps. Numbers in small circles indicate the number of domains that occur more than once in Tritryp parasite genomes. The numbers above the small circles indicate Pfam domains that are not present in the Tritryps.

Table 1.

General features of the Tritryp genomes. We found 5812 syntenic three-way COGs and 346 nonsyntenic three-way COGs. Mbp, mega—base pairs; NC, not computed.

T. bruceiT. cruziL. major
Haploid genome size (Mbp) 25View inline 55 33
No. of chromosomes (per haploid genome) 11View inline ∼28View inline 36
No. of genes (per haploid genome) 9068View inline ∼12,000View inline 8311View inline
Total regions with synteny blocks (Mbp) 19.9 NC 30.7
Mean CDS size (bp) in syntenic three-way COGs 1511 1457 1731
Mean inter-CDS size (bp) between syntenic three-way COGs 721 561 1431
  • View inline* Excluding ∼100 mini- and intermediate-sized chromosomes (totaling ∼10 Mb).

  • View inline The exact number is not known and homologs can differ substantially in size.

  • View inline Includes 904 pseudogenes.

  • View inline§ The exact number of haploid genes has not been determined in T. cruzi.

  • View inline Includes 34 pseudogenes.

  • Species-specific protein domains. A comparison of the Pfam and TIGRFAMs protein domains in the Tritryp genomes revealed very few that are unique to individual organisms (Fig. 1B). Of the 1617 protein domains identified in the Tritryp genomes, fewer than 5% are unique to a single species (table S2). For example, macrophage migration inhibitory factor (Pfam accession number PF01187) domain, restricted to L. major, may inhibit macrophage activation and consequent destruction of the parasites, as described in Brugia malayi (5, 11). Another domain specific to L. major (PF04133) is involved in vacuolar transport, suggesting that the protein may act to divert proteases within the host phagolysosome. These domains are not seen in T. cruzi, which escapes from the lysosomal compartment into the cytoplasm soon after invasion, or in T. brucei, which is extracellular.

    The variant surface glycoprotein (VSG) expression site–associated gene (ESAG) domains ESAG1 (PF03238) and ESAG6-7 (PF05446) are restricted to T. brucei. Likewise, the AOX domain (PF01786), which acts as an alternative terminal oxidase in mitochondria, and the LigB domain (PF02900), which is involved in aromatic compound metabolism, account for some of the few metabolic capabilities of T. brucei that are not found in L. major or T. cruzi (46).

    T. cruzi has a serine carboxypeptidase S28 domain (PF05577) not found in T. brucei or L. major. Several lines of evidence indicate that T. cruzi secretes a small peptide processed by a serine peptidase, which interacts with a host-cell receptor in a wide variety of mammalian cells (12). This interaction leads to a calcium signaling reaction that triggers lysosome migration to the host-cell plasma membrane, enabling parasite entry (13). T. cruzi also contains a number of hormone-type domains such as PF00220 (neurohypophysial hormones, N-terminal domain) and the PF02044 (bombesin-like peptide), which are not found in T. brucei or L. major, but the functional significance of these domains is uncertain.

    Specific domain expansion and loss. Several interesting examples of domain expansion or contraction (table S3) were revealed (fig. S2), similar to those seen in other parasites such as Plasmodium (14). Many of these proteins appear to be involved in host interactions and often are encoded in tandem arrays, typically at species-specific subtelomeric locations (46). For example, T. brucei has expanded ESAG4 proteins that contain adenylate and guanylate cyclase catalytic domains and proteins containing leucine-rich repeat domains. T. cruzi has expanded bacterial neuraminidase/Asp-box repeat, mucin-like glycoprotein, leishmanolysin, and trypanosome retrotransposon hot spot (RHS) domains in trans-sialidases, mucins, glycoprotein (gp) 63 proteases, and RHS proteins, respectively. L. major contains a large tandem array of amastin surface glycoproteins but also possesses expanded protein families containing mitochondrial carrier protein, adenosine 5′-triphosphate–binding cassette transporters, and heat shock protein (HSP) 90. Interestingly, compared with T. brucei and T. cruzi, L. major has a marked underrepresentation of domains involved in RNA binding (pumilio and zinc finger domains), protein-protein interaction (leucine-rich and tetratricopeptide repeats), and calcium signaling (calmodulin and EF hand), suggesting a reduced role or alternate pathways for these activities.

    Large-scale synteny. Despite having diverged 200 to 500 million years ago (1518) and thus predating the emergence of mammals (19), the genomes of the trypanosomatid species are highly syntenic (i.e., show conservation of gene order). Of all the genes in T. brucei and L. major, 68 and 75%, respectively, remain in the same genomic context. Moreover, almost all (94%) of the three-way COGs that form the core proteome fall within regions of conserved synteny. The T. brucei and L. major genomes (2) show 110 blocks of synteny spanning 19.9 and 30.7 Mb, respectively (Fig. 2). Detailed examination of the synteny breakpoints revealed that 40% were associated with expansions of multigene families, retroelements and/or structural RNAs (Plate 1, fig. S3, and table S4). Enrichment of segmental duplication in regions of synteny breakpoints has also been observed in mammals (20, 21), but the implications are unknown. Interestingly, 43% of the synteny breakpoints in T. brucei and L. major (excluding chromosome ends) occur at or very close to the strand-switch regions separating the directional gene clusters (DGCs), which are characteristic of and unique to trypanosomatid genomes (5). Thus, there appears to be strong selective pressure to maintain gene order and to keep the DGCs intact, despite the extensive sequence divergence between the genes themselves. This may also be related to the relatively low incidence of sexual recombination in these organisms (22), which would limit opportunities for rearrangement during meiosis.

    Fig. 2.

    Synteny maps. The 36 different colors in the T. brucei (left) panel represent the locations of the indicated synteny blocks in the 36 chromosomes of L. major, and the 11 colors in the L. major (right) panel depict the locations of the indicated synteny blocks in the 11 chromosomes of T. brucei. Each synteny block is named using a double nomenclature that refers to the chromosomal location of the block in both species. Labels on the left outside margin of the synteny blocks denote the block number in the reference genome. Labels within synteny blocks refer to their location on the other genome. For example, synteny block Tb1.1 of T. brucei chromosome 1 (Tb1; lower right of left panel) is synteny block Lm20.1 of L. major chromosome 20 (Lm20). As another example, all of the yellow synteny blocks in the L. major panel (blocks Tb7.1 to Tb7.7) are on T. brucei chromosome 7 (Tb7). Synteny blocks are defined as groups of five or more T. brucei genes that possess an ortholog on the same L. major chromosome (2). The entire map contains 7974 T. brucei and 7466 L. major protein-coding genes in 110 synteny blocks. Plate 1 and fig. S3, A to K, show more detailed views and table S4, A and B, has complete lists of genes and block coordinates.

    Localized chromosomal rearrangements. Despite the marked overall conservation, many local insertions, deletions, or substitutions were seen within otherwise syntenic regions. Although evidence for all three processes was found in the Tritryp genomes, gene insertions or substitutions (which result in species-specific genes and nonsyntenic three-way COGs) were more common than gene loss (two-way COGs that include a L. major gene). Some of these events can result in substantial physiological and biochemical differences between these parasites.

    One example of an insertion involves two genes in T. brucei encoding subunits (ESAG6 and ESAG7) of the heterodimeric transferrin receptor (23) in the syntenic block Tb7.3/Lm 22.1 (Plate 1, inset A). The two genes are more similar (97% amino acid identity) to one another than they are to any of the subtelomeric copies (73 to 77% identity), indicating they encode a different form of the receptor than the telomeric copies. The surrounding region in this synteny block contains other insertions specific to L. major and T. cruzi and several translocated genes in the three genomes. T. cruzi seems to have undergone four separate insertions of genes belonging to a metabolic pathway that converts l-histidine to l-glutamate (5). Intriguingly, the gene (hutG) for the final enzyme has been previously found only in bacteria, and this may mark a horizontal gene transfer from bacteria, where the genes occur in a single operon.

    Interestingly, a component of the RNA interference (RNAi) pathway is present only in T. brucei (Tb10.406.0020/TbAGO1)(24, 25) and a gene (LmjF33.0290) encoding a glucose transporter (26) ispresentonlyin L. major within the same synteny block (Plate 1, inset B). This region is also associated with a cluster of tRNA genes and HSP83 arrays (27, 28) of variable lengths in the three genomes. In eukaryotes, RNAi maintains genome integrity and prevents invasion by nucleic acid, by means of double-stranded RNAs that induce the degradation of homologous mRNAs (25). None of the Tritryps possesses an obvious homolog of Dicer, an essential component of the RNAi pathway in other organisms, but two predicted proteins (named TbRN3A and TbRN3B; table S5) containing a single ribonuclease III domain are present in T. brucei (but are not seen in L. major or T. cruzi) and represent potential Dicer candidates. Thus, gene insertions are likely responsible for the RNAi activity seen in T. brucei but not T. cruzi or L. major (28), although the source(s) of the gene insertions remains unknown.

    Both L. major and T. cruzi contain a gene encoding 5-oxoprolinase in synteny block Tb10.15/Lm18.4, which is absent from T. brucei (Plate 1, inset C) and thus appears to represent gene deletion. This enzyme catalyzes the formation of l-glutamate from 5-oxo-l-proline in the glutathione metabolism pathway. Further examples of small synteny breaks occur immediately upstream of this same region: T. brucei contains a receptor-type adenylate cyclase gene (GRESAG4) gene; T. cruzi contains one or two dispersed gene family 1 (DGF-1) pseudogenes, as well as a RHS pseudogene on one allele; and L. major contains a degenerate Ingi/L1Tc-related retroelement (DIRE). These sequences are often associated with frequent recombination in the Tritryps (6) and large synteny breaks.

    Nonsyntenic and subtelomeric regions. Plotting the location of two-way and three-way COGs across the L. major and T. brucei genomes revealed that synteny extends over most of both genomes (Fig. 2 and fig. S4). L. major contains a few large chromosome-internal nonsyntenic regions, which mostly correspond to tandem arrays of both protein-coding and RNA genes. By contrast, T. brucei contains large blocks of nonsyntenic genes at the telomeres of all chromosomes (fig. S4), which can be several hundred kilobases (kb) in size and contain large arrays of species-specific VSG (pseudo)genes and ESAGs, as well as a large number of retroelements and RHS genes (4). The subtelomeric regions of T. cruzi are also large and nonsyntenic, consisting mostly of interspersed arrays of transsialidase superfamily, DGF-1 and RHS (pseudo)genes, as well as vestigial interposed retroelements (VIPER), short interspersed repetitive elements (SIRE), T. cruzi L1Tc T. cruzi nonautonomous non-LTR retrotransposons (NARTc), and/or DIRE retroelements (6). Another intriguing feature of T. cruzi is the presence of large (up to 600 kb) nonsyntenic “islands” of genes coding for surface proteins such as trans-sialidase, mucin, mucin-associated surface protein (MASP), and gp63 peptidase, along with retrotransposons and RHS genes. Although the precise location of these “islands” is not certain, they appear often to lie between chromosome-internal synteny blocks. The L. major subtelomeric regions are quite short (<20 kb), with relatively few repetitive sequences, although there is evidence that there may be some recombination between telomeres (5). Nevertheless, the most telomere-proximal genes in L. major are often nonsyntenic, although usually not specific to L. major. Thus, the organization and gene content of the subtelomeric regions is quite different in each genome (Fig. 3).

    Fig. 3.

    Prototypes of Tritryp subtelomeric regions. Subtelomeric regions are defined here as the area that extends from the telomeric hexamer repeats to the first nonrepetitive sequence. Boxes indicate genes and/or gene arrays. Genes and/or gene arrays shown above the line are oriented toward the telomeres, whereas those shown below the line are oriented in the opposite direction. The size range of the subtelomeric regions in each genome is indicated on the right. The TS and TS/GP-85 boxes depict the trans-sialidase and GP-85 trans-sialidase superfamilies, respectively.

    Chromosome evolution. A comparison of the Tritryp genomes provides interesting insights into the karyotype of their common ancestor. T. brucei has only 11 large diploid chromosomes (plus numerous small chromosomes that contain largely repetitive sequence), and T. cruzi and L. major contain ∼28 and 36 pairs of smaller chromosomes, respectively. Most rearrangements of synteny blocks represent inversions and/or translocations (Fig. 2 and fig. S3), but there appear to be several cases of chromosome fusions in T. brucei. Twenty of the 36 L. major chromosomes are almost entirely syntenic within a substantially larger T. brucei chromosome, except for a few instances of synteny block inversion or shuffling. In 10 further cases, there is only a single segmental translocation that has moved one end of the L. major chromosome to a different T. brucei chromosome. Although the T. cruzi genome contains gaps, many nearly chromosome-sized scaffolds were defined by virtue of arrays of telomeric repeats and characteristic subtelomeric genes at one end (6). Notably, many of the L. major chromosomes that are syntenic with these T. cruzi scaffolds also contain telomeric sequences at the corresponding end. In contrast, the syntenic T. brucei regions at the corresponding position generally represent internal chromosome regions with no typical telomeric structures. For example, the ends of the two L. major and T. cruzi chromosomes appear to have joined to form a single T. brucei chromosome at the junction between synteny blocks Tb11.7/Lm13.1 and Tb11.8/Lm24.1 (Plate 1). Interestingly, this synteny break region in T. brucei contains RHS, DIRE, and Ingi sequences, often associated with T. brucei subtelomeres, pointing to a telomeric origin for this region. Other examples of similar apparent chromosome fusions in T. brucei can be seen between synteny blocks Tb7.3/Lm22.1 and Tb7.4/Lm14.1, as well as Tb7.5/Lm6.1 and Tb7.6/Lm.1. Thus, the current chromosomal architecture of T. brucei seems to have derived from an ancestor with the more fragmented genomic organization of L. major and T. cruzi. This evolutionary topology supports the prevailing view of an early divergence of the Leishmania genus and the monophyly of the Trypanosoma genus (710).

    The marked difference in the gene size and density between the Tritryp genomes is notable. The average L. major protein-coding sequence (CDS) is considerably longer than in T. brucei or T. cruzi (Table 1), often in regions specifying low-complexity amino acid insertions or expansions. The length differential is even more extreme in noncoding regions, with the average inter-CDS length in L. major being almost twice that in T. brucei and three times that in T. cruzi (Table 1). Consequently, gene density in L. major is considerably less than in T. brucei and T. cruzi (251 versus 319 and 385 genes/Mb, respectively). Thus, genome compaction does not appear to be associated with an intracellular lifestyle in the Tritryps, in contrast to the suggestion for Encephalitozoon cuniculi (29, 30).

    The Tritryp chromosomes exhibited systematic purine excess, GC bias, and AT skew, correlated with the coding strand (31). This phenomenon is associated with replication in Eubacteria and Archaea (32), but local skews are linked to mutational bias arising from transcription in eukaryotes (33). Although this seems to be the case for L. major, GC skew has the opposite correlation with the coding strand in T. brucei and T. cruzi (34). The AT skew correlation is the same in all three species. Thus, there may be differences in DNA repair and/or transcriptional processes between Leishmania and Trypanosoma, which may account for their different GC contents (60% in L. major, 46 to 51% in T. brucei and T. cruzi).

    No evidence for ancestral plastid endosymbiont. On the basis of recently reported evidence of plantlike traits associated with the metabolism of Trypanosoma parasites (35) and the close phylogenetic relatedness of kinetoplastids to Euglena (a photosynthetic protist), it has been suggested that the common ancestor of these protists harbored the same endosymbiotic green alga that gave rise to the secondary plastid in Euglena. Our data show that the protein domain content of the Tritryps is not consistent with large-scale horizontal transfer of genetic material from plants, given that we did not observe a large number of plant-specific domains in the Tritryps (Fig. 1C and table S6).

    We used phylogenetic analyses to search for genes of cyanobacterial or green algal ancestry in the Tritryp genomes. Phylogenetic trees were made for all L. major genes (because this genome has the fewest protein-coding genes) with the use of alignments against proteins from all available completed genomes (2). Although some genes appeared to branch with plants or cyanobacteria in an initial screen, these relationships were not supported by more sophisticated Bayesian methods and a more comprehensive sampling of protein sequences (2). These analyses included the genes previously reported to have plantlike traits (35), as shown in fig. S5. We conclude from our analyses that the genome data provide no unambiguous support for the hypothesis that trypanosomatids have acquired genes from the endosymbiont that gave rise to the Euglena secondary plastid, suggesting that it was acquired subsequent to speciation with the kinetoplastida.

    Gene evolution. Pathogen proteins involved in interaction with the host are often rapidly evolving, and can be identified by comparison of the number of synonymous mutations per synonymous site (dS) and the number of nonsynonymous mutations per nonsynonymous site (dN) (36). As the Tritryp genomes are too divergent to accurately estimate dS, we calculated dN using pairwise comparisons for every COG where there was a simple 1:1:1 orthologous relationship between genomes (or 1:1:2 in cases where both T. cruzi alleles were present), because this effectively gives a measure of how rapidly each protein sequence is diverging between species (2). Categorization of these genes by gene ontology (GO) term for biological processes (Fig. 4) showed that those with no functional annotation had the highest median dN value, suggesting that they were subject to positive selection causing active accumulation of mutations or that they were under neutral evolution allowing the sequences to drift. Such genes of unknown function probably include trypanosomatid-specific genes involved in unique processes (including interaction with the host) or highly variable genes that elude annotation by homology.

    Fig. 4.

    Median dn values for genes categorized by GO process annotation. Each bar corresponds to a pairwise comparison of sets of orthologous genes from two species. Amino acid sequences were aligned for each gene and converted to nucleotide sequences to calculate nonsynonymous substitutions (40). Genes were annotated with GO process terms in T. brucei and then transferred to the other species with the use of orthology. Results from the selective constraint analyses are in table S7, A to C.

    Genes in the transport category also had a relatively high median dN value for L. major versus T. brucei (Fig. 4). Rapid evolution of transport proteins may be due to their surface location (and consequent exposure to the host immune system) but may also reflect the different niches occupied by each parasite within their hosts and requirement for different nutrient uptake from their environment. Conversely, genes representing metabolism, cell growth, and maintenance have low dN values, probably reflecting the core processes common to the Tritryps.

    Concluding remarks. Although the majority of trypanosomatid genes in the same genomic context are conserved, there are substantial differences, which presumably reflect specific adaptations to distinct species-specific selection pressures and the distinct pathophysiologies and survival strategies of each organism. Antigenic variation and diversity are characteristic of T. brucei and T. cruzi, and the localization of large arrays of genes encoding surface proteins at or near telomeres and/or the presence of numerous retroelements within these regions may enhance recombination frequency and provide for rapid sequence variation. Thus, colocalization of previously uncharacterized genes (e.g., T. cruzi MASPs and DGF-1, as well as RHS in both T. cruzi and T. brucei) in these regions leads to the suspicion that they may also be involved with immune evasion or survival in different hosts. The frequent recombination in these regions results in large (up to 2 Mb) size polymorphisms between homologous chromosomes seen in T. brucei and T. cruzi.

    The frequent correlation between conserved synteny blocks and the large DGCs characteristic of the Tritryps may also reflect their unique linkage of transcription with subsequent RNA processing by trans-splicing and polyadenylation. Transcription of protein-coding genes has been postulated to initiate at only a few sites on each chromosome (3739), suggesting that there may be selective pressure against synteny breaks within the polycistronic gene clusters downstream of these sites. It is also possible (and not necessarily unrelated) that the synteny breaks associated with strand-switch regions may reflect higher rates of recombination at these sites, possibly as a result of linkage with replication processes.

    The identification of numerous Tritryp-conserved and species-specific genes provides the opportunity for development of previously unexplored chemotherapeutic approaches against these parasites. Drugs designed against conserved core processes hold the advantage of being potentially useful against all three organisms, provided that they are sufficiently divergent from mammalian host proteins.

    Supporting Online Material

    www.sciencemag.org/cgi/content/full/309/5733/404/DC1

    Materials and Methods

    Figs. S1 to S5

    Tables S1 to S7

    References

    References and Notes

    View Abstract

    Navigate This Article