Technical Comments

Comment on "Computational Improvements Reveal Great Bacterial Diversity and High Metal Toxicity in Soil"

See allHide authors and affiliations

Science  18 Aug 2006:
Vol. 313, Issue 5789, pp. 918
DOI: 10.1126/science.1126593


Gans et al. (Reports, 26 August 2005, p. 1387) provided an estimate of soil bacterial species richness two orders of magnitude greater than previously reported values. Using a re-derived mathematical model, we reanalyzed the data and found that the statistical error exceeds the estimate by a factor of 26. We also note two potential sources of error in the experimental data collection and measurement procedures.

Using previously published DNA reassociation kinetics (Cot curve) data (1), Gans et al. (2) estimated bacterial species richness (one aspect of diversity) in a soil sample to be 8.3 × 106. However, the authors' calculation of error for this estimate is unrealistically low. We re-derived the mathematical model of reassociation kinetics from first principles [arriving at a model similar to Gans et al. (2)] and applied standard nonlinear regression analysis to fit the model to the original data. We obtained a similar richness estimate (7.4 × 106), but a formal statistical error 26 times as large as the estimate itself. Furthermore, we note potential sources of error in the original experimental and measurement protocol that may contribute to the unreliability of the richness estimate.

Let us assume a DNA extract containing sequences from S ≥ 1 bacterial species, with no interspecific sequence similarity and no intraspecific repeat sequences (both false). Applying certain simplifying assumptions (3, 4), we obtain a mathematical model for the Cot (observed reassociation) data points (ul,yl), l = 1...n: Math(1) where W is a random variable representing the species' proportions and reassociation rates, and γ = 0.45 and kr = 5.19 are taken to be constants (2). Nonlinear regression then produces parameter estimates, standard errors (SEs), and goodness-of-fit tests (5, 6).

We fitted Eq. 1 to the noncontaminated soil data provided by Sandaa (2). We tested 10 distributions for W (7) and found that only one yielded a convincing fit to the observed points. This had the form P(W = λi) = pi, λi > 0, i = 1,2,3; p1 + p2 + p3 = 1, with p1 = 1.30 × 10–4, λ1 = 2.60 × 103, p2 = 2.40 × 10–6, λ2 = 7.20×104, p3 = 9.998676 ×10–1, and λ3 = 4.892648 ×10–1 (8). The fit was excellent (Fig. 1), with a sum of squared errors (SSE) of 1.05 × 10–3. The estimate of S was 7.4 × 106, but with a SE of 192.1 × 106.

Fig. 1.

Cot curves fitted to noncontaminated soil data (2) by nonlinear regression. (A) Mixture-of-three-point-masses species-abundance model, used in equation 11 in (15), with parameters estimated by nonlinear least-squares regression, yields function shown by solid line; data points are overlaid. (B) Fitted curve extended to complete (100%) reassociation. Nonconstant curvature is due to the mixture of Cot curves with varying reassociation rates. Extension of estimated curves far beyond available data is statistically inadvisable.

The model used by Gans et al. (2) can be rewritten as Math(2) where μ > 0, and Math is assumed known. The authors applied a minimum χ2 procedure that does not yield SEs for the parameter estimates. Their report lacked certain details that prevented us from replicating their results exactly, but our model and fit are comparable. Our richness estimate is close to theirs, but the statistical SE is far higher than their informal calculation of a factor of, at most, 8.2. An SE of this magnitude makes intercommunity comparisons (e.g., richness in pristine versus polluted environments) statistically meaningless, because the range of possible values of the (unknown) richness of this community is virtually unbounded.

These results are sensitive to model assumptions to an unknown degree. For example, if γ is estimated from the data, then a simpler model fits very well with SSE 1.22 × 10–3, but γ and S are estimated as 0.1095 (SE, 0.003) and 629 (SE, 120), respectively. Until such robustness issues are clarified, any results must be regarded as contingent on numerous questionable assumptions.

We also noted certain debatable aspects of the original experimental protocol and measurement procedure. First, Gans et al. (2) assumed that the DNA analyzed in the Cot analysis of Sandaa et al. (1) was bacterial in nature. We tested the bacterial extraction technique described (1) and observed considerable contamination of the bacterial pellet with eukaryotic cells/tissues. The presence of eukaryotic genomes in the DNA extract would introduce substantial error into estimates of bacterial richness using reassociation kinetics data. Second, DNA reassociation was estimated by measuring changes in hypochromicity (Δh), a practice that can greatly underestimate the reassociation of repetitive sequences in complex DNA mixtures (9, 10) (Fig. 2). A population of soil bacteria may be dominated by a few species (11, 12) whose sequences would effectively reassociate like eukaryotic repetitive elements; in fact, our estimated abundance distribution shows just this structure. In this case, normal variation in homologous DNA sequences would result in formation of duplexes with partial strand mismatch, which is believed to underlie the reduced Δh of renatured eukaryotic repeats (9). Extrapolation of partial Δh Cot curves to “completion,” as was done by Gans et al. (2), amplifies these errors.

Fig. 2.

Equating Δh with DNA reassociation in complex samples can produce misleading results. (A) DNA extracted from a soil sample represents numerous bacterial species/strains as shown, with 90% of the DNA contributed by several strains of Species G. For simplicity, assume that different species share no notable sequence homology but that DNA from strains of the same species can form duplexes during reassociation (with occasional base mismatches due to modest sequence divergence). (B) A hydroxyapatite chromatography–based Cot curve of the soil DNA extract would show rapid reassociation of Species G DNA (red portion of curve) compared with DNA of other species (blue portion). Although the Species G genome may contain little repetitive sequence, its relative abundance in the DNA extract would cause it to reassociate at least 100 times as fast as DNA of any other species. The gap in relative sequence redundancy between Species G and DNA sequences of other species would result in a flat region of the curve where there would be no notable DNA reassociation (black portion). (C) Cot curve prepared from the same soil extract, in which Δh data are used to estimate DNA reassociation. For simplicity, assume that Δh from complete native double-stranded DNA to complete denaturation accounts for a 27% change in absorbance (9) and that repetitive DNA (here, Species G DNA duplexes) exhibit half the Δh of native DNA, as is typical of eukaryotic repeats (9, 10). As a result of its relatively low hypochromicity, reassociation of Species G DNA will occupy only 12% of the abscissa (0.27 × 0.5 × 0.9 = 0.12). At high Cot values (e.g., 104 M·s), reassociation of soil extract DNA will appear to be far from completion (i.e., 100% hypochromicity), when in reality it may have finished reassociating. (D) Reassociation of Species G DNA at relatively low Cot coupled with its reduced Δh may cause some researchers to discount its renaturation as a “collapse” hypochromicity effect; see (16) for definition. Consequently, they may entirely omit it from their Cot curve, as shown. Extrapolation of the curve to 100% hypochromicity (dotted blue line) would amplify the error.

Current soil bacterial species richness estimates range from < 100 (13) to almost 107 (2). Many of these estimates may be correct, although imprecise: When SE ≈ 2 × 108, an estimate may assume almost any value and remain correct, although uninformative. Informative estimation of species richness by DNA reassociation kinetics will require more precise parameter estimation, a more realistic physical model (14), and analysis of sensitivity to assumptions and constants.

Supporting Online Material

Materials and Methods

SOM Text

Fig. S1


References and Notes

View Abstract

Navigate This Article