Microbial Genes in the Human Genome: Lateral Transfer or Gene Loss?

See allHide authors and affiliations

Science  08 Jun 2001:
Vol. 292, Issue 5523, pp. 1903-1906
DOI: 10.1126/science.1061036


The human genome was analyzed for evidence that genes had been laterally transferred into the genome from prokaryotic organisms. Protein sequence comparisons of the proteomes of human, fruit fly, nematode worm, yeast, mustard weed, eukaryotic parasites, and all completed prokaryote genomes were performed, and all genes shared between human and each of the other groups of organisms were collected. About 40 genes were found to be exclusively shared by humans and bacteria and are candidate examples of horizontal transfer from bacteria to vertebrates. Gene loss combined with sample size effects and evolutionary rate variation provide an alternative, more biologically plausible explanation.

Studies of the evolution of species long assumed that gene flow between species is a minor contributor to genetic makeup, generally thought to only occur between closely related species. This picture changed when researchers began to study the genetics of microorganisms. Genes, including those encoding antibiotic resistance, can be exchanged between even distantly related bacterial species (horizontal or lateral gene transfer). A growing body of evidence suggests that lateral gene transfer may be a much more important force in prokaryotic evolution than was previously realized (1). Lateral gene transfers involving eukaryotes have also been well documented, in most cases involving transfers from organellar genomes into the eukaryotic nucleus (2).

Analysis of the rough draft of the human genome led to the suggestion recently (3) that 223 bacterial genes have been laterally transferred into the human genome sometime during vertebrate evolution. Such a possibility is of interest because it implies that bacterial infections have led to permanent transfer of genes into their hosts. One possible implication is that bacteria might be manipulating the human genome for their own benefit and that this process may be continuing. Such an event would require (i) that genes be transferred into the germ cell lineage, not just into any somatic cell, and (ii) that the transferred genes be stably maintained in the host cell, either by insertion into a chromosome or as extrachromosomal elements. For these genes to spread through the population, they need either to provide a selective advantage to their host or to exhibit some kind of “selfish” properties, such as the ability to duplicate and transpose.

Although the possibility of lateral gene transfer has gained much support in recent years from analysis of complete genome sequences (1, 4, 5), the inference of such gene transfer events is still fraught with difficulty, because of problems with methods and with the data analyzed (6,7). As in the recent study (3), we focused on detecting possible gene transfers from bacteria to vertebrates by analysis of gene distribution patterns across taxa. Those genes found in bacteria and vertebrates but not in nonvertebrates are considered possible cases of lateral transfer (putative bacteria to vertebrate transfers, or BVTs). Our study differed in that it included the human proteome reported by Venteret al. (8) and it included proteins from parasite lineages not included in the previous study (9).

We focused on analyzing complete genome sequences because the absence of a gene from a species cannot be inferred from incomplete genome sequences. Human genes for which homologs are found in completed prokaryotic genomes were identified by searching against all publicly available complete genome sequences. For our analysis of the human proteome, we used the Ensembl set, containing 31,780 proteins (3), and the Celera set, containing 26,544 proteins (8). In the Ensembl proteome, 4388 genes have BlastP matches with E-values less than 10−10 to a protein from a complete prokaryotic genome. Likewise, 3915 genes from the Celera proteome match at least one prokaryotic gene with the same E-value threshold (Table 1). As in (3), transfers into vertebrates were ruled out if a homolog of a gene was found in a nonvertebrate eukaryotic genome.

Table 1

Proteome sizes and number of genes shared with each of the human protein sets, with a Blast cutoff of 10−10.

View this table:

If the pattern of genes shared between prokaryotic and eukaryotic species is a robust measure of lateral gene transfer, then we would expect that the total number of true BVTs would be independent of which and how many nonvertebrate genomes have been sampled. However, as the number of nonvertebrate proteomes screened against human increased, the number of BVTs decreased (Fig. 1). The two plots show comparable results for the Ensembl and Celera protein sets, and each line shows the effect with a different starting proteome. Subsequent points on the plots show averages after removing one more proteome; for example, the “fruit fly” line shows the average number of genes remaining in the BVT set after removing all Drosophila melanogaster genes plus one, two, three, and four additional protein sets. After removal of all genes found in complete nonvertebrate genomes, only 135 Ensembl genes and 89 Celera genes remained as possible BVTs.

Figure 1

Genes shared by humans and prokaryotes after removing successive proteome sets from five nonvertebrates and a collection of miscellaneous nonvertebrates (“Other”). (Top) Ensembl protein set. (Bottom) Celera protein set.

The downward trend of the plot in Fig. 1 suggests that the number of BVTs might decrease further if more nonvertebrate genomes are added to the analysis. Our analysis confirms this: Searching through all proteins in GenBank from numerous other eukaryotic nonvertebrates (labeled “Other” in Fig. 1), most of which have a relatively small number of characterized genes, identified matches to organisms such asSuberites domuncula (sponge), soybean, and Aspergillus terreus. As a result of this filtering, 21 genes were removed from the Ensembl BVTs and 21 from the Celera BVTs, leaving only 114 and 68 genes in the two sets, respectively.

One explanation for the species-sampling effect shown in Fig. 1, and the reason why species distribution patterns must be interpreted with great caution, is the phenomenon of gene loss. It is likely that many genes shared by the eukaryotic common ancestor have been lost in some lineages. This seems especially likely in some of the species analyzed here, such as Arabidopsis thaliana, which was chosen for genome sequencing in part because of its small genome size, and Saccharomyces cerevisiae, for which extensive gene loss has been documented (10). A simple computation illustrates the possible contribution of gene loss to the pattern. Suppose the five eukaryotic genomes analyzed all resulted from a single adaptive radiation. If this common ancestor started with 10,000 genes [see Rubin et al. (11) for a discussion of “core proteome” sizes] and each lineage lost 30% of its genes, then the probability that any one gene was lost from four lineages is (0.3)4 = 0.00081, or 81 genes lost from all four of the nonvertebrate lineages. Of course, some genes are probably less likely to be lost than others (e.g., DNA polymerase genes). Supposing that 20% of a proteome cannot be lost, then 30% loss translates into 65 genes lost in all four lineages. It appears likely that gene loss alone could account for a large proportion of the BVT set.

Another important aspect of the species-sampling effect is the phylogenetic bias in the data sets being analyzed. All of the eukaryotic complete genomes are from so-called “crown” eukaryotes: animals, plants, and fungi. In addition, three of these (Caenorhabditis elegans, D. melanogaster, andHomo sapiens) are animals, further limiting the sample of evolutionary diversity. In contrast, the sampling of prokaryotic evolutionary diversity is much broader, containing representatives from many widely divergent bacterial and Archaeal lineages (12). It seems likely that the sequencing of a broader variety of eukaryotic genomes will lead to a further reduction in the number of BVTs.

The rate of nucleotide substitution varies for different genes within a genome as well as for the same gene in different species. This rate variation is due to a combination of factors, including variation in DNA replication accuracy, DNA repair, selection, recombination, genetic drift, and generation time (13). Because of the effects of rate variation, sequence similarity alone is not an accurate measure of evolutionary relatedness (14, 15). Thus, Blast E-values, which are measures of sequence similarity, should not be used to measure evolutionary relatedness (15). This is particularly true in analyses of complete genomes, where it can be expected that at least some genes will be nonessential, with low selective pressure allowing more rapid mutation. In the analysis used to support the claim that 223 genes have been laterally transferred into human (3), a gene was considered a BVT if the Blast score for the bacterial match was at least 10−9-fold smaller than the nonvertebrate match score. From a statistical perspective, the null hypothesis should be that two genes with sufficiently high sequence similarity share a common ancestor. Our analysis used the same threshold for prokaryotic and nonvertebrate matches, with a maximum E-value cutoff of 10−10 (i.e., the likelihood that any Blast hit was due to chance is less than 1 in 1010). The use of any fixed E-value cutoff, though, will miss genes with slightly weaker similarity to nonvertebrate proteins. Because the weaker alignment scores may simply be the result of more rapid mutation in the invertebrate lineage, it is impossible to rule out common ancestry on the basis of this evidence alone. By reducing the E-value cutoff for nonvertebrate genes to 10−7, we reduced the size of the Ensembl BVT set to 74 genes and the Celera BVT set to 56 genes. In addition, after comparing the 74 Ensembl BVTs to invertebrate mitochondrial genomes, we found two genes of mitochondrial origin, reducing that BVT set to 72 genes.

If a gene was transferred from a prokaryotic lineage into the vertebrate lineage, this likely occurred within the past 400 to 500 million years, after most of the major prokaryotic phyla were established. Therefore, any transferred gene should be more closely related to its donor lineage than to any other prokaryotic lineage, which would be detectable in phylogenetic trees. For example, phylogenetic trees built from genes that have been transferred from mitochondrial or plastid genomes to eukaryotic nuclei (16–18) indicate that the transferred genes branch with α-proteobacteria and cyanobacteria, respectively. We generated phylogenetic trees for genes from the BVT sets for which sufficient numbers of related genes were available and found that most did not show patterns consistent with bacterial to vertebrate gene transfer. One such example is shown in Fig. 2, which shows a phylogenetic tree of three human hyaluronan synthase paralogs, all from the BVT set reported in (3). The phylogenetic analysis reveals that the vertebrate genes do not branch within any particular prokaryotic lineage. Instead, the placement of groups in the tree is consistent with normal vertical inheritance; the absence of the gene from nonvertebrate lineages may be due either to gene loss or rate variation.

Figure 2

Phylogenetic tree of homologs of three human hyaluronan synthase (HAS) proteins that were proposed as lateral transfers from bacteria to vertebrates (3). Homologs of the human HAS genes were identified with iterative Blastp searches of a low-redundancy protein database and aligned with clustalW. More distantly related proteins were used as outgroups to root the tree. The tree was generated from the alignment (variable regions and gaps excluded) with the neighbor-joining algorithm implemented by Phylip (25) with a PAM-based distance matrix. Species names, major evolutionary groupings, gene names if available, and sequence IDs (gi for Genpept and sp for Swissprot) are indicated in the tree. Scale bar corresponds to estimated evolutionary distance units. The presence of multiple HAS genes in different vertebrate species is likely due to duplication in vertebrates.

The absence of a gene from the annotation for fruit fly, nematode, or any other organism is not proof that the gene is missing from that organism's genome. First, not all of these genomes are complete. Second, the annotation of the completed portions of some eukaryotic genomes is still in progress, and the state of the art in eukaryotic gene finding is imperfect. To check for genes missing from the annotation, we used TBlastN to search the human proteins from the initial BVT sets against the nucleotide sequences of the genomes of complete Eukaryotes. This analysis resulted in two matches between Ensembl BVTs and A. thaliana and three matches toCaenorhabditis elegans, all with E-values of 10−32 or lower. Three of these five genes had already been removed in the steps that reduced the set to 72 BVTs; removal of the other two left 70 Ensembl BVTs.

The Ensembl proteome set has been further curated, and numerous genes have been removed from the 31,780 used for the analysis in (3). The October release (version 8.0), containing 29,304 genes, has eliminated some genes (including possible contaminants), collapsed multiple genes into one, and otherwise improved the data. We screened the 70 BVTs against the newer proteome and found that 23 genes had been eliminated, reducing the BVT set to 47 genes. If the original 135 Ensembl BVTs are screened against the newer release, this set is reduced to 89 genes. There were also 89 genes in the initial Celera BVT set.

Comparing the 47 Ensembl BVTs against the 56 Celera BVTs yields some interesting final reductions in the data set. Both sets contain genes not included in the other set; more interesting, though, are the genes shared between the two sets. In most cases, the sequences do not match exactly, and the differences in the gene models sometimes yield further matches to nonvertebrate genes. Of the 56 Celera BVTs, 10 genes match an Ensembl protein that in turn matches one or more nonvertebrates; six of these match all four of the complete nonvertebrate genomes. This reduces the Celera BVT set to 46 genes. Of the 47 Ensembl BVTs, five genes match Celera proteins that in turn match nonvertebrates, and one short (115 amino acid) protein falls on an 825–base pair unmapped contig, which appears to be a contaminant. This reduces the Ensembl BVT set to 41 genes.

After careful reexamination of the human proteome, we find only 46 genes in the Celera protein set, and 41 in the Ensembl set, that comprise candidates for possible lateral transfer between bacteria and human (19). The evidence presented here provides several plausible biological explanations for the presence of these genes in the human genome. The argument for lateral gene transfer (3) is essentially a statistical one, necessarily so because of the inherent impossibility of observing events that may have occurred in the distant past. As with all statistical arguments, great care needs to be exercised to confirm assumptions and explore alternative hypotheses. In cases where equally if not more plausible mechanisms exist, extraordinary events such as horizontal gene transfer do not provide the best explanation. The more probable explanation for the existence of genes shared by humans and prokaryotes, but missing in nonvertebrates, is a combination of evolutionary rate variation, the small sample of nonvertebrate genomes, and gene loss in the nonvertebrate lineages.

  • * To whom correspondence should be addressed. E-mail: salzberg{at}


View Abstract

Navigate This Article