Technical Comments

Patterns of Genome Organization in Bacteria

See allHide authors and affiliations

Science  20 Mar 1998:
Vol. 279, Issue 5358, pp. 1827
DOI: 10.1126/science.279.5358.1827a

Frederick R. Blattner et al. (1), when describing the complete sequence of the Escherichia coli chromosome, correlated an overall DNA property, “GC skew” [the quantity (G − C)/(G + C) averaged over a sliding window of arbitrary length 10 kb] with the direction of DNA replication. GC skew for replichore 1 (rightwards from the origin on the presented strand) oscillates considerably, yet remains almost entirely positive for its entire length, while replichore 2 shows the opposite behavior. Kunstet al. (2) did not present such an analysis for the sequence of the Bacillus subtilis chromosome, but did note that the GC skew changes sign at the origin, an observation made earlier by Lobry (3), who documented it for the replication origins of E. coli, Haemophilus influenzae,B. subtilis, and Mycoplasma genitalium and for the terminus of H. influenzae.

In contrast to GC skew, which is a derivative function of the base composition along a DNA sequence, we have computed three integral functions of the sequences of nine complete prokaryotic genomes (Table1). Composite graphs for three of these genomes are presented (Fig. 1), and the remainder are available on a linked website. We define “purine excess” as the sum of all purines minus the sum of all pyrimidines encountered in a walk along the sequence up to the point plotted (4). “Keto excess” is the same function calculated for the keto bases (GT) minus the amino bases (AC), and “coding-strand excess” is the sum of all nucleotides encountered along the sequence that are in coding sequences, minus those that have complements (on the opposite strand) that are in coding sequences; bases in non-coding regions add zero to this sum. Graphs of these functions reveal nonrandom patterns, the most striking of which is the clear correlation between purine excess and the origins and termini of DNA replication (Fig. 1). In every case where independent information is available, the minimum in the purine-excess curve corresponds to the origin (Table 1). We suggest that this regularity may hold for most prokaryotic genomes. Conversely, the maxima of the purine-excess curves (Fig. 1) correlate strongly with known or suspected replication termini (5). Keto-excess curves reflect the same correlation, although for most genomes the minima and maxima (thus, predicted origins and termini) are not as sharply defined as for the purine-excess functions. Haemophilus influenzae represents a notable exception to this rule (compare the keto-excess curve in Fig.1B).

Table 1

Completely-sequenced bacterial genomes analyzed for base and coding asymmetries and their origins and termini of replication.

View this table:
Figure 1

Purine excess (blue curves), keto excess (red curves), and coding-strand excess (green curves) for the complete genomes of (A) E. coli (1), (B) H. influenzae (16), and (C) M. jannaschii (17). Known origins and termini of replication are marked. Abscissa represents the genomic sequence position from the beginning to the end of the genome; left ordinate represents the count of purine and keto excesses; right ordinate represents the Watson coding-strand excess count at a given position. Green histograms across the bottom of each graph display the correlation coefficients between purine excess and coding-strand excess for-25 kb windows. Click on each image to enlarge. Graphs of six additional genomes (Table 1) can be viewed on the Web at

Other genome features stand out in these graphs. The relatively smooth, featureless curve for E. coli contrasts with the much rougher patterns displayed by H. influenzae andSynechocystis PCC6803 (see linked website for data). This likely reflects a greater tendency of the latter organisms to take up foreign DNA and integrate it into the chromosome (6, 7), a point supported by the correlation of the density of DNA-uptake sequences in H. influenzae (6) with many of the inflection points of the purine-excess curve (8). Likewise, the sites of μ prophage integration in H. influenzaecluster most densely around the pronounced minimum in the purine-excess curve adjacent to the terminus (Fig.1B). The larger megaplasmid (pNGR234a) ofRhizobium sp. NGR234 also displays similar behavior (8), in keeping with its recognized characteristics as a “transposon trap” (9).

Examination of the relationship between base-composition and coding asymmetries at the whole-genome level shows close parallels between coding-strand and purine excess for seven out of nine genomes. E. coli shows typical behavior (Fig. 1A). Haemophilus influenzae and Synechocystis display much weaker correlations on this scale. At a finer level of detail, there are substantial correlations between these functions for all the genomes we studied, but the results for the two archaebacteria, M. jannaschii and M. thermoautotrophicum, are particularly striking (Fig. 1C), showing strong correspondence between coding-strand and purine excess.

What forces might give rise to the long-range patterns of strand asymmetry in bacterial genomes? There is a prominent correlation between purine excess and replication direction, which suggests as an explanation asymmetrical errors in DNA synthesis. In the absence of transpositions and insertions, a bias favors accumulation of purines in the leading strand. However, this contradicts expectations thatlagging strand synthesis should be more error-prone (10), and thus that most purine substitutions (the principal cause of transversions) should occur there. Francino and Ochman (11) have argued, on the other hand, thattranscriptional effects can account for DNA strand asymmetry because transcription-coupled repair will remove the most frequent types of DNA damage (deaminated cytosines and pyrimidine-dimers), thereby reducing harmful mutations. This only occurs on the transcribed (that is, template) strand, which therefore will become pyrimidine-rich. In addition, the template strand is significantly protected against DNA damage during transcription, whereas the coding strand is exposed. Under this model, evolutionary selection should increase the less mutationally vulnerable purine content of the coding strand.

Mycoplasma genitalium conforms to the predictions of the transcription-coupled repair model particularly well: in replichore 1, 85% of the open reading frames (ORFs) correspond to the presented (purine-rich) strand up to the putative terminus (maximum in the purine-excess curve). For the other replichore, 77% of the ORFs occur in the complementary strand. In E. coli, strand preference is less pronounced: only 55% of the genes are aligned with the replication direction (1). However, Francino has analyzed the codon adaptation index (CAI), a measure strongly associated with the extent of gene expression in E. coli, and finds that 74% of the genes with CAI ≥ 0.5 and 84% of those with CAI ≥ 0.6 are situated on the leading strand (11), that is, with the direction of transcription the same as replication (12). In addition to favoring transcriptional repair, a major advantage to this arrangement is that head-on collisions between replication and transcription complexes will be reduced (13).

Functions like those described here promise to be revealing tools for whole-genome analysis (4). For example, in the absence of any other information, the global minimum of the purine excess locates the probable origin of replication, and its maximum is the likely terminus for prokaryotic genomes. Similar regularities may emerge from the impending deluge of eukaryotic DNA sequences. We have already shown that the patterns of purine-excess plots correlate well with phylogenetic position for mitochondrial DNAs (14), and graphs of coding-strand excess in the Saccharomyces cerevisiae genome tend to match the purine-excess curves (15).


Stay Connected to Science

Navigate This Article