Report

# Comparative Genome Sequencing for Discovery of Novel Polymorphisms in Bacillus anthracis

See allHide authors and affiliations

Science  14 Jun 2002:
Vol. 296, Issue 5575, pp. 2028-2033
DOI: 10.1126/science.1071837

## Abstract

Comparison of the whole-genome sequence ofBacillus anthracis isolated from a victim of a recent bioterrorist anthrax attack with a reference reveals 60 new markers that include single nucleotide polymorphisms (SNPs), inserted or deleted sequences, and tandem repeats. Genome comparison detected four high-quality SNPs between the two sequenced B. anthracischromosomes and seven differences among different preparations of the reference genome. These markers have been tested on a collection of anthrax isolates and were found to divide these samples into distinct families. These results demonstrate that genome-based analysis of microbial pathogens will provide a powerful new tool for investigation of infectious disease outbreaks.

On 4 October 2001, the Centers for Disease Control reported a highly unusual case of inhalational anthrax in a photo editor at a West Palm Beach, Florida, media organization (1). This turned out to be the first in a series of letter-based attacks over several weeks. The attacks resulted in five fatalities (including the first-diagnosed victim) and several cases of severe inhalational anthrax. This wave of bioterrorism caused several billion dollars of loss to the U.S. economy and inflicted further anxiety on a population suffering after the World Trade Center tragedy.

Bacillus anthracis, the causative agent of anthrax, is a highly monomorphic species with genes from different isolates that typically have greater than 99% nucleotide sequence identity (2). The most accurate strain-typing tool available examines copy-number modulation of variable-number tandem repeat (VNTR) markers (3). VNTR analysis of the Florida, New York, and Washington, DC, isolates used in the letter attacks suggested that they were from the same source (4), namely, a strain originally isolated from a dead cow in Texas in 1981. This strain of B. anthracis, designated Ames, was subsequently sent to the U.S. Army Medical Research Institute (USAMRIID) in Fort Detrick, Maryland, where it was actively used in the U.S. defensive biological weapons program and also provided to other research laboratories in the United States and Europe (Fig. 1).

Though VNTR analysis examines genomic regions known to be highly variable, this method typically samples only a limited number of loci. The availability at TIGR (The Institute for Genomic Research) of a nearly complete genome sequence of a B. anthracis Ames isolate (the Porton isolate) that lacks the pXO1 and pXO2 virulence plasmids (5) provided an opportunity to use a comparative genomics approach to identify potential polymorphic sites in this genome (6).

The assembled sequences from the Florida isolate were aligned to the Porton chromosome and to previously sequenced B. anthracis plasmids (6) by MUMmer (7). The index pXO1 plasmid [181,654 base pairs (bp)] is from the Sterne strain, and the pXO2 plasmid (96,231 bp) is from the Pasteur strain (8, 9). Alignments were processed further to identify three types of genetic differences: (i) SNPs, (ii) VNTRs, and (iii) inserted or deleted sequences (indels). These differences were compiled, and the underlying chromatogram files were examined. The average level of sequence coverage for the main Florida chromosome was 6-fold (i.e., each position in the genome was contained, on average, in six independent sequencing reads). The coverages for the pXO1 and pXO2 virulence plasmids were approximately 20-fold and 14-fold, respectively. This level of coverage suggests a molecular ratio of 3:2:1 for pXO1:pXO2:chromosome in the Florida isolate of B. anthracis.

The accuracy of any base pair in the assembled genome can be computed as a function of the accuracy of each of the underlying sequences. Because all sequence reads are independent samples of the genome, the probabilities derived from each read can be multiplied together to obtain a joint probability that the consensus sequence is correct. Base-calling software assigns an error probability to each base pair in a sequence read (10–12).

TIGR's error rate for finished genomes has been independently measured at less than 1 error in 88,000 nucleotides (13). At this rate, the 5.2-Mbp B. anthracis chromosome would be expected to have approximately 60 sequencing errors, and a comparison of the two isolates would be expected to yield 120 differences. The error rate in DNA sequencing can be dramatically reduced by eliminating regions of low sequence coverage, because error rate is not constant across the entire genome sequence or even across a single read (14). To limit the number of false positives due to sequencing errors, the comparative analysis in this study used only regions with at least 3-fold coverage in both the Florida and Porton sequences; within each isolate, only positions for which the sequence chromatograms agreed with each other were used. More precise estimates of accuracy were computed for each of the polymorphisms described (12).

Only four differences were discovered between the main chromosomes of the Florida and Porton isolates (Table 1). Two of these are SNPs and two are short indels. CS-2 (Table 1) is an AT indel that was seen only in particular clones from the Porton1 library. Seven additional polymorphisms were discovered within the Porton isolate chromosome. These differences can be explained by the fact that TIGR received DNA from two different cultures of the same isolate (5), separated chronologically by about 3 years.

Table 1

Polymorphisms between main chromosomes of Porton and Florida isolates of B. anthracis, Ames strain. Porton1 and Porton2 are derived from different cultures of the Porton isolate (5). CI-x indicates a chromosomal indel. X indicates depth of coverage. The probability that the polymorphism was correctly sequenced is computed as described and shown under P. NS indicates lack of evidence from particular isolate (i.e., all reads were from the other isolate). Not shown in those tables are more than 150 apparent SNPs in regions of low coverage; on the basis of statistical analysis, it is estimated that no more than three of these represent real differences.

View this table:

Because the Porton isolate was cured of the pXO plasmids, we were only able to compare the Florida plasmids against the B. anthracis Sterne strain pXO1 (8) and Pasteur strain pXO2 (9). A total of 38 SNPs, 8 VNTRs, and 3 large indels (Table 2) were found. Fewer plasmid SNPs occur in pXO1 than in pXO2 (15 versus 23), though pXO1 is twice as large as pXO2. However, this data cannot be used to infer a higher mutation rate in pXO2, because the Ames strain is more closely related to Sterne than to the Pasteur strain (3).

Table 2

Insertions distinguishing the plasmids of the Florida isolate of B. anthracis from the plasmids in the Sterne and Pasteur strains. IXy-z indicates indel in plasmid PXO y. X indicates depth of coverage. Repeat units are indicated in brackets. For IX1-1 and IX1-2, which are long insertions in pXO1, only the flanking 10- and 9-bp repeats are shown, with the length of the omitted sequence indicated by brackets (e g., 125 bp). Supporting online material includes a 1000-bp region centered on each of these polymorphisms.

View this table:

Larger polymorphisms included both VNTRs (Table 3) and simple indels (Table 2). The 135-bp inserted sequence in pXO1, IX1-2, is homologous to a region from Bacillus thuringiensis (15), demonstrating that the Sterne lineage has lost this sequence. The 85-bp indel in pXO1, IX1-1, has 84% nucleotide identity (and identical flanking repeats) to a DNA sequence from Bacillus cereus [American Type Culture Collection (ATCC) 10987; T. Read, unpublished data] and is deleted in all but one of the Ames isolates tested (Table 4). This indicates that IX1-1 has been lost very recently within the Ames lineage. The conservation of as many as half of the pXO1 plasmid genes at greater than 40% nucleotide similarity in some B. cereus and B. thuringiensisstrains has been reported in recent hybridization studies (15, 16).

Table 3

VNTRs distinguishing the plasmids of the Florida isolate of B. anthracis from the plasmids in the Sterne and Pasteur strains. VXy-z indicates VNTR numberz in plasmid pXOy. X indicates depth of coverage. Repeat units are indicated in brackets, and superscripts denote fractional copy number (e.g., VX1-4 consists of seven copies of the sequence TTA in one strain and nine copies in the other). For VX2-1, the 3.75 copy number refers to an additional copy containing only six of the eight bases. The probability that the polymorphism was correctly sequenced exceeds 10−97 for all entries. Supporting online material includes a 1000-bp region centered on each of these polymorphisms.

View this table:
Table 4

Distribution of the polymorphisms presented in B. anthracis isolates. Primers were designed to sequences flanking SNPs and indels and were used to amplify from a range of isolates. VNTRs were analyzed by a direct sizing method (3). For reference, the sequence of the Florida isolate is shown under “Florida SEQ.” The sequence of the locus for each isolate is shown. Entries not marked by an asterisk were not tested because the plasmid was missing from that strain. NS, no sequence of sufficient quality from the reaction available. Δ indicates that an indel was deleted from the sequence. The values in the VNTR columns are numbers of units. For VX2-1, the 3.75 copy number refers to an additional copy containing only six of the eight bases. Isolates A to D, 1925 Iowa cow, 2001 California cow, and 1997 Texas goat are described in the text.

View this table:

The largest difference is the 1.4-kbp IX2-7 indel, which may have been lost from the Ames pXO2 plasmid through recombination, as suggested by two flanking copies of a 143-bp direct repeat (17). Both IX1-1 and IX1-2 are also flanked by much shorter (9 to 10 bp) direct repeats. Among the VNTRs, VX2-1 has a more complex structure. This repeat has one extra copy in the Pasteur strain, but both strains contain another three tandem copies of the repeat approximately 100-bp downstream from the first set. The polymorphic variation occurs only in the first set of copies. VX2-1 and VX2-4 both contain an additional, inverted copy of the repeat adjacent to the tandem direct repeats. VX1-4 and VX2-5, the short TTA and AT repeat polymorphisms, were known previously (3); the remaining nine VNTRs and indels are novel.

SNP PS-33 from pXO2 occurs within a 42-bp repeat that occurs five times, in four variations. If we label these variants A to D, then the order of the repeats is ABCBD in the Florida isolate and ABCDD in the Sterne strain, with the SNP occurring in the fourth copy of the repeat.

In addition to the SNPs and VNTRs detected in this analysis, we identified two large inversions in pXO1 of the Florida isolate in relation to the previously sequenced Sterne strain. The largest (44.8 kbp) occurs between coordinates 117,178 and 162,008 (using the Sterne strain coordinates). It is flanked by inverted copies of an IS1627 sequence and is centered on the pXO1 “pathogenicity island” (11), which includes the genes for the tripartite lethal factor toxin (18). Inversion of the pathogenicity island has been described (19); our sequence coverage data and polymerase chain reaction (PCR) amplification across the junctions show that the Florida isolate contains a mixture of both orientations.

The second pXO1 inversion event occurs between coordinates 43,233 and 48,988 (using the Sterne strain coordinates). This 5755-bp region, flanked by two perfect 929-bp inverted repeats of degenerate transposase genes, contains genes pXO1-35 through pXO1-40 (8), all of which encode proteins of unknown function. PCR reactions across the junctions of this region indicate that the Florida isolate also contains plasmids with both orientations of this region. The inversions detected in this comparative analysis provide another marker for genotyping anthrax strains. In both cases, inversions appear to have been the result of recombination between nearly identical, more than 900-bp inverted segments of transposase genes.

Although the differences between the Porton and Florida isolates are relatively few, many of the SNPs in Table 1 have a potential phenotypic effect. These changes are possibly a result of relaxed selection ofB. anthracis in the laboratory environment. CI-1 and CS-2 result in frameshifts in ABC (ATP-binding cassette) transport genes in the Florida isolate. CS-3 results in nonsynonymous nucleotide change within a conserved domain of a histidine kinase sensor protein. SNP CS-1 causes a histidine (Porton) to tryptophan (Florida) residue change within a conserved “TGS” domain of the predicted guanine pyrophosphokinase (RelA) protein. This gene product plays a key role in response to environmental stress in bacteria (20). Another nonsynonymous mutation (CS-8) within the Porton strain is found in the cell differentiation regulator spo0A. Each of the eight plasmid VNTRs (Tables 2 and 3) occurs in an intragenic region. The VX1-4 mutation may be a major factor in pathogenicity because it occurs immediately upstream of the lethal toxin precursor gene.

To evaluate the utility of the novel polymorphisms as genetic markers for strain discrimination, we screened genomic DNA from a set ofB. anthracis isolates (Table 4) (21). In addition to the Florida and Porton isolates used as controls, the set included five Ames isolates that were indistinguishable from each other on the basis of existing genotype information (3). Four are laboratory isolates obtained before the October 2001 attack that have been designated A to D. The fifth is a 1997 isolate from a goat in Texas that is the only other known natural occurrence of the Ames genotype. Two non-Ames B. anthracis isolates that caused North American bovine anthrax infections were included for comparison: one was isolated in Iowa in 1925, the other in California in 2001. The genotypes defined by the polymorphisms allowed us to divide these nine isolates into six distinct groups [Table 4 and Fig. 2; the Porton strain was cured of plasmids and is not represented in Fig. 2]. Data from typing of DNA from the Porton and Florida isolates validated the differences identified from genomic sequence comparisons (Table 4).

Of the plasmid markers tested, nine SNPs and six VNTRs or indels varied between B. anthracis isolates (Table 4). All but three (PS-1, PS-32, and IX1-1) were found in the pXO2 plasmid, suggesting that this molecule is the source of much short-term variation in the B. anthracis genome. The chromosomal polymorphisms were identical in sequence to the Florida isolate in all Ames isolates tested. The fact that the chromosomal SNPs were unique to the Porton isolate is evidence that these mutations occurred after transfer of the strain to Porton Down. The B. anthracisstrain isolated from a cow in 2001 has 15 differences from the Florida isolate (Table 4) with no recent genetic ties to the Florida or Porton isolates. The 1925 isolate (1925 cow, Table 4) has 12 differences from the Florida isolate, many of them matching the 2001 cow genotype. Within the Ames group, the Texas goat isolate varied at IX1-1 (retaining the 85-bp insert), VX2-1, VX2-3, and PS-52 by a transition mutation. Strain A formed a fifth distinct group, missing pXO1 and containing two extra adenines at VX2-3. Strain D differed only by single nucleotide at VX2-3 (36 adenines compared with 35 in Florida isolate) and cannot be considered a distinct group from B, C, and the controls. Figure 2 shows the relationships between five groups of isolates as a phylogenetic tree. The Porton strain was classified as a sixth group (Table 4) on the basis of its unique chromosomal SNPs. This comparison does not permit any conclusions about mutation rates. The calculation of mutation rates would require specific evidence about the number of generations separating the two samples, which is unknown, particularly with regard to the Florida isolate.

Comparative genomics approaches that are based on whole genome sequence data provide the most comprehensive picture of species diversity and lack the bias of methods such as multilocus sequence typing that sample only a small part of the genome (22). Genome-wide comparisons identify polymorphic loci that can be used for cost-efficient PCR-based sampling of variation within bacterial populations (3). Of the VNTRs identified in this study, only one with any variation among the strains tested, VX2-5, was known before comparative genomic sequencing. VX2-5 can distinguish the two most distantly related isolates, the 1925 Iowa cow and the 2001 California cow, from the Florida strain but not from each other (Table 4). In addition, comparative genomic sequencing can reveal novel, lineage-specific markers that enhance the resolution of genotyping schemes. In this study, we found 11 DNA sequence differences between the chromosomes of the Florida and Porton isolates of B. anthracis (Table 1). These isolates were indistinguishable by their VNTR profiles. Our results indicate that polymorphisms can appear after relatively few generations, because these strains were derived from a common ancestor in the mid-1980s and have been kept frozen.

Comparison of completed genome sequences also allows for elucidation of differences in gene content that may underlie observed phenotypic differences in infectivity and virulence between closely related strains or isolates. It may also facilitate the identification of genes that have been deliberately introduced into potential biowarfare agents. Intergenome comparisons are more efficient when at least one of the genomes is completely finished, rather than when the representations are the multiple, unordered assemblies typical of a “draft” project. When looking for differences between two very closely related strains, as we did here, both genomes must be sequenced to a high level of coverage in order to obtain accurate data. With the rapidly diminishing costs of DNA sequencing (currently at $0.03 per base pair in total costs to generate 8-fold coverage and$0.10 per base pair for a fully finished and annotated sequence), the increasing breadth of databases, and the likelihood of future advances in technology, this approach will become increasingly important.

To evaluate whether SNPs or other small genetic differences are genuine, the underlying quality values need to be available for every sequence that was used in an assembly. We are releasing on the TIGR Web site (www.tigr.org) four sets of B. anthracis Florida isolate genome data: (i) the set of all assembled contigs of all sizes, (ii) the set of individual sequences that were used to create those contigs, (iii) a file containing the quality values for each individual sequence, and (iv) a set of 1000-bp regions centered on each SNP that will allow investigators to identify and test these markers on any other anthrax strain. In addition, the genomic scaffold representing the main chromosome of the Florida isolate has been deposited with GenBank (accession number AAAC01000000), and the complete plasmid sequences are available as accession numbers AE011190 and AE011191. Shotgun electropherogram traces from the chromosome and plasmids were deposited in the National Center for Biotechnology Information Trace Archive (trace IDs 107051265 to 107116771).

Genome-based analysis will provide a powerful new tool for investigation of unexpected outbreaks of infectious disease, whether these represent biological warfare attacks, emerging agents, or more familiar pathogens. Because even highly monomorphic species such asB. anthracis are known to exhibit phenotypic variation in infection potential, genomic comparisons can be used to investigate the genetic basis of pathogenesis. To lay the groundwork for future investigations, databases derived from genome-based surveys of natural variation in all major pathogens, not just potential biological warfare agents, should be constructed.

• * To whom correspondence should be addressed. E-mail: anthrax{at}tigr.org

View Abstract