Chromosome 2 Sequence of the Human Malaria Parasite Plasmodium falciparum

See allHide authors and affiliations

Science  06 Nov 1998:
Vol. 282, Issue 5391, pp. 1126-1132
DOI: 10.1126/science.282.5391.1126

This article has a correction. Please see:


Chromosome 2 of Plasmodium falciparum was sequenced; this sequence contains 947,103 base pairs and encodes 210 predicted genes. In comparison with the Saccharomyces cerevisiaegenome, chromosome 2 has a lower gene density, introns are more frequent, and proteins are markedly enriched in nonglobular domains. A family of surface proteins, rifins, that may play a role in antigenic variation was identified. The complete sequencing of chromosome 2 has shown that sequencing of the A+T-rich P. falciparum genome is technically feasible.

Malaria, a disease caused by protozoan parasites of the genus Plasmodium, is one of the most dangerous infectious diseases affecting human populations. Approximately 300 million to 500 million people are infected annually, and 1.5 million to 2.7 million lives are lost to malaria each year, with most deaths occurring among children in sub-Saharan Africa (1). Of the four species that cause malaria in humans,P. falciparum is the greatest cause of morbidity and mortality. The resistance of the malaria parasite to drugs and the resistance of mosquitoes to insecticides have resulted in a resurgence of malaria in many parts of the world and a pressing need for vaccines and new drugs. The identification of new targets for vaccine and drug development is dependent on the expansion of our understanding of parasite biology; this understanding is hampered by the complexity of the parasite life cycle. The sequencing of the Plasmodiumgenome may circumvent many of these difficulties and rapidly increase our knowledge about these parasites.

The P. falciparum genome is ∼30 Mb in size; has a base composition of 82% A+T; and contains 14 chromosomes, which range from 0.65 to 3.4 Mb. Chromosomes from different wild isolates exhibit extensive size polymorphism. Mapping studies have indicated that the chromosomes contain central domains that are conserved between isolates and polymorphic subtelomeric domains that contain repeated sequences.P. falciparum also contains two organellar genomes. The mitochondrial genome is a 5.9-kb, tandemly repeated DNA molecule; a 35-kb circular DNA molecule, which encodes genes that are usually associated with plastid genomes, is located within the apicoplast [an organelle of uncertain function in Plasmodium and the related parasite Toxoplasma (2)].

Chromosome 2 (GenBank accession number AE001362) was sequenced with the shotgun sequencing approach, which was previously used to sequence several microbial genomes (3, 4), with modifications to compensate for the A+T richness of P. falciparum DNA (5). These modifications included the following: the extraction of DNA from agarose under high-salt conditions to prevent the DNA from melting at a high temperature, the avoidance of ultraviolet (UV) light, the use of the “vector plus insert” protocol for library construction, sequencing with dye-terminator chemistry, the use of a reduced extension temperature in polymerase chain reactions (PCRs), and the use of a transposon-insertion method for the closure of gaps that are very rich in AT. The assembly software was also modified to minimize the misassembly of A+T-rich sequences. The complete sequence included portions of both telomeres and had an average redundancy of 11-fold; colinearity of the final sequence and genomic DNA was proven with optical restriction and yeast artificial chromosome (YAC) maps.

Chromosome 2 of P. falciparum (clone 3D7) is 947 kb in length and has an overall base composition of 80.2% A+T. The chromosome contains a large central region that encodes single-copy genes and several duplicated genes, subtelomeric regions that contain variant antigen genes (var) (6–8), repetitive interspersed family (RIF)–1 elements (9) and other repeats, and typical eukaryotic telomeres (Fig. 1). The terminal 23-kb portions of the chromosome are noncoding and exhibit 77% identity in opposite orientations. The left and right telomeres consist of tandem repeats of the sequence TT(TC)AGGG (10) and total 1141 and 551 nucleotides (nt), respectively. The subtelomeric regions do not exhibit repeat oligomers until ∼12 to 20 kb into the chromosome, where rep20 (11) (a 21-bp tandem direct repeat found exclusively in these regions) occurs 134 and 96 times in the left and right ends of the chromosome, respectively. The sequence similarity that was observed between the subtelomeric regions supports previous suggestions that recombination between chromosome ends may be one mechanism by which genetic diversity is generated. A region with centromere functions could not be identified on the basis of sequence similarity to S. cerevisiae or other eukaryotic centromeres (12). However, several regions of up to 12 kb are devoid of large open reading frames (ORFs) and might contain the centromere. Alternatively, centromeric functions may be defined by higher order DNA structures and chromatin-associated protein complexes (13).

Figure 1

Gene map of P. falciparumchromosome 2. Predicted coding regions are shown on each strand. Exons of protein-encoding genes are indicated by rectangles, and lines linking rectangles represent introns. The single tRNAGlu gene is indicated by a cloverleaf structure. Genes are color-coded according to broad role categories as shown in the key. Genes identification numbers correspond to PF numbers in Table 2. The letters CC, NG, and TM followed by numerals indicate the number of predicted coiled-coil, nonglobular, and transmembrane domains in the proteins, respectively.

Two hundred and nine protein-encoding genes and a gene for tRNAGlu (Fig. 1 and Table 1) were predicted (14) on chromosome 2, giving a gene density of one gene per 4.5 kb, which is a value between that observed in yeast (one gene per 2 kb) and in Caenorhabditis elegans (one gene per 7 kb). Of the 209 protein-encoding genes, 43% contain at least one intron. This percentage is an estimate because some introns may have been missed by the gene-finding method. Most spliced genes consist of two or three exons. In terms of intron content and gene density, thePlasmodium genome, which was assessed by the analysis of the first completed chromosome sequence, appears to be intermediate between the condensed yeast genome and the intron-rich genomes of multicellular eukaryotes.

Table 1

Summary of features of P. falciparumchromosome 2 (P. f. chr 2) and comparison to S. cerevisiae chromosome 3 (S. c. chr 3). Protein structural features were predicted as described (14). ND, not determined. Numbers in parentheses indicate the percentage of the total genes or proteins with the specified properties.

View this table:

The proteins encoded in chromosome 2 (Table 2) fall into the following three categories: (i) 72 proteins (34%) are conserved in other genera and contain one or more distinct globular domains; (ii) 47 proteins (23%) belong to Plasmodium-specific families with identifiable structural features and, in some cases, known functions; and (iii) 90 predicted proteins (43%) have no detectable homologs, although many contain structural features such as signal peptides and transmembrane domains. Homologs outside Plasmodium were detected for 87 (42%) of the 209 predicted proteins. These include proteins in the first category, in addition to those proteins in the second category that possess a conserved domain or domains that are arranged in a manner unique to Plasmodium. The percentage of evolutionarily conserved proteins is about two times lower than that found for other genomes, mainly because most of the remaining proteins were predicted to consist primarily of nonglobular domains (15) (Table 1). The abundance of nonglobular domains inPlasmodium proteins is very unusual; the proportion of proteins with predicted large nonglobular domains in other eukaryotes, such as S. cerevisiae (Table 1) or C. elegans(16), is approximately half that observed inPlasmodium. Furthermore, 13 of the 87 conserved proteins on chromosome 2 appear to contain large nonglobular structures (>30 amino acids) that are inserted directly into globular domains, as determined by alignment with homologs from other species.

Table 2

Identification of genes on P. falciparum chromosome 2. The PF number is the systematic name assigned according to a method adapted from S. cerevisiae (14). The description contains the name (if known) and prominent features of the gene. The table includes genes with homologs in other species and members ofPlasmodium gene families. An expanded version of this table with additional information is available on the World Wide Web Prt, protein; OO, organellar origin; TP, transit peptide; ATP, adenosine triphosphate; euk., eukaryotic; nt, nucleotide.

View this table:

To determine whether nonglobular domains and proteins are expressed inP. falciparum, we performed a reverse transcriptase (RT)–PCR on 11 nonglobular domains and on two genes that encoded predominantly nonglobular proteins, using total blood-stage RNA as a template. In all cases, RT-PCR products were the same size as those that were amplified from genomic DNA, and the sequence of RT-PCR products matched the genomic DNA sequence (17). Thus, it is likely that most, if not all, predicted nonglobular domains in chromosome 2 genes are expressed. One example of the insertion of a nonglobular domain into a well-defined globular domain is seen in a protein containing a 5′-3′ exonuclease (Fig. 2). The alignment of thePlasmodium sequence with four bacterial exonucleases revealed a 176–amino acid insertion in a region between a strand and a helix in the three-dimensional structure of this protein (18). This suggests that eukaryotic proteins can accommodate inserts that may be excluded from the protein core folding without impairing the protein function. The propagation of nonglobular domains in Plasmodium suggests that such proteins provide specific selective advantages to the parasite. A structural analysis ofPlasmodium proteins that contain nonglobular inserts may be valuable for understanding the general principles of protein folding.

Figure 2

Multiple alignment of the predicted 5′-3′ exonuclease (PFB0180w) encoded in chromosome 2 with homologous bacterial exonuclease domains showing the large nonglobular insert inPlasmodium. The alignment was constructed with the profile alignment option of CLUSTALW (34). The alignment column shading is based on a 100% consensus, which is shown underneath the alignment; h indicates hydrophobic residues (A, C, F, I, L, M, V, W, and Y), u indicates “tiny” residues (G, A, and S), o indicates hydroxy residues (S and T),c indicates charged residues (D, E, K, R, and H), and + indicates positively charged residues (K and R) (35). The aspartates involved in metal coordination have a red background and inverse type. Secondary structure elements derived from the crystal structure of Thermus aquaticus DNA polymerase (18) are shown above the alignment (H indicates α helix, and E indicates extended conformation, or β strand). 5′-3′-exo_Aae is a stand-alone exonuclease from Aquifex aeolicus, and the remaining bacterial sequences are the NH2-terminal domains of DNA polymerase I.

Of the 87 conserved proteins that are encoded on chromosome 2, 71 (83%) show the greatest similarity to eukaryotic homologs (Table 2). In contrast, the remaining 16 proteins are most similar to bacterial proteins, and 4 of these represent the first eukaryotic members of protein families that have previously been seen only in bacteria. At least some of these 16 genes may have been transferred to the nuclear genome from an organellar genome after the divergence of the phylum Apicomplexa from other eukaryotic lineages. Several of these proteins appear to contain NH2-terminal organellar import peptides (19) and may function within the apicoplast or the mitochondrion. One such gene encodes 3-ketoacyl–acyl carrier protein (ACP) synthase III (FabH), which catalyzes the condensation of acetyl–coenzyme A and malonyl-ACP in type II (dissociated) fatty acid synthase systems. Type II synthase systems are restricted to bacteria and the plastids of plants, confirming previous hypotheses that the Plasmodium apicoplast contains metabolic pathways that are distinct from those of the host (20,21).

Because the phylum Apicomplexa represents a deep branch in the eukaryotic tree, the presence of eukaryotic-specific genes in P. falciparum suggests the appearance of these genes early in eukaryotic evolution. Most of these genes code for proteins that are involved in DNA replication, repair, transcription, or translation (Table 2) and include the origin recognition complex subunit 5, excision repair proteins ERCC1 and RAD2, and proteins involved in chromatin dynamics (such as the BRAHMA helicase, an ortholog of the DRING protein containing the RING finger domain, and chromatin protein SNW1). Furthermore, several eukaryotic proteins involved in secretion are encoded in chromosome 2 (such as the SEC61 γ subunit, the coated pit coatamer subunit, and syntaxin), suggesting an early emergence of the eukaryotic secretory system.

Proteins of the DnaJ superfamily act as cofactors for HSP70-type molecular chaperones and participate in protein folding and trafficking, complex assembly, organelle biogenesis, and initiation of translation (22). Five proteins containing DnaJ domains are present on chromosome 2, which suggests multiple roles for this domain in the Plasmodium life cycle. Two of these proteins consist primarily of the DnaJ domain, whereas three of the five proteins also contain a large nonglobular domain. Several proteins containing a DnaJ domain have been detected on other chromosomes, indicating that this is a large gene family in Plasmodium (23). One of its members, the ring-infected erythrocyte surface antigen, binds to the cytoplasmic side of the erythrocyte membrane, suggesting that DnaJ domains perform chaperone-like functions in the formation of protein complexes at this location (24). DnaJ domains in someP. falciparum proteins contain substitutions in the His-Pro-Asp signature that is required for interaction with HSP-70–type proteins, which may indicate a modification of the typical chaperone function.

Chromosome 2 contains five protein families that are unique toPlasmodium in terms of their distinct domain organization, although three of them contain domains that are conserved in other genera. The genes encoding the Plasmodium-specific families are primarily located near the ends of the chromosome. A singlevar gene was identified in each subtelomeric region. The var genes encode large transmembrane proteins (PfEMP1) expressed in knobs on the surface of schizont-infected red cells. PfEMP1 proteins exhibit extensive sequence diversity; are clonally variant; and are involved in antigenic variation, cytoadherence, and rosetting (6–8). In addition to the full-length var genes, six small ORFs were identified in the subtelomeric regions that were similar tovar sequences. Five of these ORFs resembled thevar exon II cDNAs or the Pf60.1 sequences that were reported previously (7, 25).

The largest Plasmodium-specific family found on chromosome 2 encodes proteins that were dubbed rifins, after the RIF-1 repetitive element. RIF-1 contained a 1-kb ORF but no initiation codon, was found on most chromosomes, and was transcribed in late blood-stage parasites (9). The function of the RIF-1 element was unknown. Eighteen ORFs with similarities to RIF-1 were found in the subtelomeric regions of chromosome 2, centromeric to the var genes. An inspection of the sequence upstream of these ORFs revealed exons encoding signal peptides, which indicated that the RIF-1 elements were actually genes consisting of two exons. These genes encode potential transmembrane proteins of 27 to 35 kD, with an extracellular domain that contains conserved Cys residues that might participate in disulfide bonding, a transmembrane segment, and a short basic COOH-terminus. The extracellular domain also contains a highly variable region (Fig. 3). RT-PCR with schizont RNA showed that one of six rifin genes that were tested was transcribed. The function of the rifins is unknown, but their sequence diversity, predicted cell surface localization, and expression in schizont stages suggest that, like var genes, they may be clonally-variant. Multiple rifin genes were detected in the telomeric regions of chromosomes 3 and 14, suggesting that rifin genes have propagated as clusters in the course of Plasmodium evolution (26). If the number found on chromosome 2 is representative of other chromosomes, there may be 500 or more rifin genes in the P. falciparumgenome (∼7% of all protein-coding genes), making it the most abundant gene family in this organism. The presence of varand rifin genes and other ORFs in subtelomeric regions of P. falciparum chromosomes confirms that the subtelomeric regions are not transcriptionally silent (27).

Figure 3

Multiple sequence alignment of rifins encoded on chromosome 2. The predicted coding regions were aligned with CLUSTALW (34) using the default settings. The alignment column shading is based on a 95% consensus, which is shown underneath the alignment; h indicates hydrophobic residues (A, C, F, I, L, M, V, W, and Y), p indicates polar residues (D, E, H, K, N, Q, R, S, and T), b indicates “big” residues (F, I, L, M, V, W, Y, K, R, Q, and E), and+ indicates positively charged residues (K and R) (35). The cysteines conserved in subsets of rifins are shown by inverse type.

Another family of membrane-associated proteins, serine repeat antigens (SERAs), contains a papain protease-like domain. A cluster of three SERA genes, which were all transcribed in the same direction (from centromere to telomere), was known to be on chromosome 2 (28); at least one SERA has been evaluated for use in blood-stage vaccines. These genes are part of an eight-gene cluster; seven genes have a similar four-exon structure, but the gene at the 3′ end of the cluster contains only three exons. The protease domains in these proteins are unusual because five of the eight contain serine instead of cysteine in the active nucleophile position, suggesting that they are serine proteases with a structure that is typical of cysteine proteases (29).

Two proteins (MSP-4 and MSP-5) that contain an epidermal growth factor (EGF) module in their extracellular domains were identified (30, 31). In organisms that are not classified in the animal kingdom, MSP-4, MSP-5, and MSP-1 (a multi-EGF domain protein encoded on chromosome 3) and two Plasmodium sexual-stage antigens (32) are the only proteins that contain EGF repeats, which suggests that Plasmodium obtained the sequence for this domain from its animal host. The plasmodial EGF domains may be involved in parasite adhesion to host cells.

In addition to the families of Plasmodium-specific proteins, chromosome 2 contains genes for many secreted and membrane proteins. One of these genes encodes a protein with a modified thrombospondin domain and was transcribed in blood-stage parasites (17). Other Plasmodium proteins containing thrombospondin domains, such as sporozoite surface protein 2/TRAP and circumsporozoite protein, are involved in the parasitic invasion of host cells (33), suggesting that this protein may be involved in the binding of infected red cells to host-cell ligands.

Determination of the first P. falciparum chromosome sequence demonstrates that the A+T richness of P. falciparum DNA will not prevent the sequencing of the genome. Although technical difficulties not observed during the sequencing of other microbial genomes were encountered, solutions to these problems were found that will facilitate sequencing of the remaining chromosomes. The genome sequence should be of value in the study of Plasmodiumbiology and in the development of new drugs and vaccines for the treatment and prevention of malaria. In addition to these practical benefits, the Plasmodium genome sequence should provide broader biological insights, particularly in regard to the plasticity of the eukaryotic genome that is manifest in the preponderance of the predicted nonglobular domains in plasmodial proteins.

  • * Present address: ARIAD Pharmaceuticals, 26 Landsdowne Street, Cambridge, MA 02139, USA.

  • Present address: Celera Genomics, 45 West Gude Drive, Rockville, MD 20850, USA.

  • To whom correspondence should be addressed. E-mail: hoffmans{at}


View Abstract

Navigate This Article