Technical Comments

Genome Arithmetic

See allHide authors and affiliations

Science  25 Sep 1998:
Vol. 281, Issue 5385, pp. 1923
DOI: 10.1126/science.281.5385.1923a

Plotting integral purine (A+G versus T+C), keto (G+T versus A+C), and coding-strand excess for nine genomes, James M. Freeman et al. (1) found global peaks in many cases near the replication origin (ori) and terminus (ter) sites. They mention earlier findings of similar strand asymmetries with GC skew (2, 3), calculated as (G-C)/(G+C) in a window sliding along a sequence.

In numerical integration with very small windows, purine excess is practically equivalent to a sum of GC and AT skews, and keto excess to their subtraction. This arithmetic is important, as seen from the differences between the cumulative excess (1) and cumulative skew plots (4). DNA strand properties with respect to replication and repair switch at ori/ter, and the leading strand has been shown to contain more G than C in 12 out of 14 microbial genomes (4). This is not the case with the leading strand AT skew: A is less than T in six genomes, for example, Escherichia coli, but A is greater than T in others, for example,Bacillus subtilis. This variation, and the fact that global switches in AT skew often occur (six cases) far away from ori/ter (4), may negatively affect identification of these sites with the use of the aggregate values of purine and keto excess (1).

Evolutionary forces seem to affect AT and GC skews differently. Pertinent to transcription and selection, coding-strand excess correlates strongly with AT skew in the first codon position (5), for example, in Haemophilus influenzae, where there is only a weak correlation with purine excess (1). GC skew may be linked with replication and repair (2, 3), because it changes linearly with the time the template spends in a single-stranded state during replication of vertebrate mitochondria and viruses (4).

The “rough” plot patterns observed for some species (1) have been explained by uptake of foreign DNA or prophage integration (a common interpretation of A+T–rich islands in A+T content plots). In addition, I would like to suggest another explanation: Some plot distortions correspond to recent inversions, as has been demonstrated (4) for two strains of E. coli(6).


Table 11

Comparison of bacterial replication-origin predictions by different cumulative strand-asymmetry analyses.*

View this table:
Table 12

Comparison of bacterial replication-terminus predictions by different cumulative strand-asymmetry analyses.

View this table:

Response: In order to reveal biologically relevant features, complete genome sequence data require suitable methods of graphical analysis and display. Our recent technical comment (1) described the use of cumulative strand asymmetries of purines, keto bases, and coding sequences to reveal the correlations between these functions and replication origins-termini and the directions of gene transcription in nine complete bacterial genomes. We also suggested that the plots indicate positions of DNA segments recently acquired by phage-transposon integration or uptake of transforming DNA.

Grigoriev (2) has used a similar approach by adapting the GC-skew method of Lobry (3) to calculate cumulative AT- and GC-skew curves that also show these features (4). He points out that discontinuities in cumulative base-asymmetry curves may correlate with sequence inversions, and correctly observes that amalgamating the A and G (or G and T) bases to create purine- (or keto-) excess curves may adversely affect their utility for locating origins and termini of replication because the strand asymmetry patterns for A track the replication direction less consistently than the patterns for G.

Window size (taken by Grigoriev as an arbitary parameter) limits the resolution of cumulative base-skew curves, and its minimum in turn depends on the details of the sequence, because any window size that can be filled by an exclusively AT-containing segment of the sequence will produce a (mathematically undefined) singularity in the GC-skew plot when such a sequence is encountered (and vice versa for the AT-skew). Grigoriev does not furnish details of his window size beyond stating that it was always less than 0.5% of the genome length, which, for example, in E. coli equals 23 kb. In order to compare the different types of genome plot, we have optimized his approach by determining the minimum window size possible for each of the nine complete genomes described in our earlier comment (1) and then used these to compute cumulative GC- and AT-skew plots at the maximum possible resolution. For the three bacterial genomes for which the positions of DNA replication termini are known, we also computed two “single-base excess” curves (which have the same selectivity as the cumulative GC- and AT-skews without the singularity problem). To calculate the “A excess” for example, we walked along the sequence and counted every A as +1, every T as −1, and G's and C's as 0.

The essential results of this exercise are summarized in Table 1 (for replication origins) and Table 2 (for termini). For the case of each genome and each method, we have chosen the best match to the reported origin or terminus and calculated the deviation between the predicted and observed chromosomal feature (5). We conclude that (i) with one exception (H. pylori), the optimized cumulative skew method comes closer to pinpointing origins than does the cumulative two-base excess method—provided that the correct function is selected (alternate skew functions, which are not shown, do not have minima or maxima closer to the targets than the two-base excess functions listed); (ii) in one case (of three) the cumulative-skew method comes closer to predicting the correct terminus than the two-base excess method; (iii) the “one-base excess” method comes closet of all to the targets—it also works better than the cumulative skew method because it avoids the need to optimize windows and by definition works at single-base resolution; (iv) the best function to use (AG versus GT excess, GC versus AT skew, or A versus G excess) is not a priori clear and, in particular, does not correlate with overall GC content.

Whole-genome strand asymmetry analyses should prove useful for a number of purposes in addition to origin and terminus location. In particular, the striking correlations between coding strand selection and purine excess (1) suggest that this function may be helpful in open reading frame verification, and both the cumulative skew and cumulative excess plots reveal important locations of genome rearrangements. Qualitative and quantitative analysis of the “roughness” of these plots should aid in the understanding of the differing dynamics of genomes in different organisms as well as fundamental differences in the behavior of their replication, transcription, and repair machinery. This is a broad field and many other variations of these analytic methods can be developed, some of which may be highly revealing of significant genome features. Given the relative ease of creating such “pictures” of chromosomes, it may well be advisable to combine several methods to attack any specific problem.


Navigate This Article