## Abstract

Gans *et al*. (Reports, 26 August 2005, p. 1387) provided an estimate of soil bacterial species richness two orders of magnitude greater than previously reported values. Using a re-derived mathematical model, we reanalyzed the data and found that the statistical error exceeds the estimate by a factor of 26. We also note two potential sources of error in the experimental data collection and measurement procedures.

Using previously published DNA reassociation kinetics (Cot curve) data (*1*), Gans *et al*. (*2*) estimated bacterial species richness (one aspect of diversity) in a soil sample to be 8.3 × 10^{6}. However, the authors' calculation of error for this estimate is unrealistically low. We re-derived the mathematical model of reassociation kinetics from first principles [arriving at a model similar to Gans *et al*. (*2*)] and applied standard nonlinear regression analysis to fit the model to the original data. We obtained a similar richness estimate (7.4 × 10^{6}), but a formal statistical error 26 times as large as the estimate itself. Furthermore, we note potential sources of error in the original experimental and measurement protocol that may contribute to the unreliability of the richness estimate.

Let us assume a DNA extract containing sequences from *S* ≥ 1 bacterial species, with no interspecific sequence similarity and no intraspecific repeat sequences (both false). Applying certain simplifying assumptions (*3*, *4*), we obtain a mathematical model for the Cot (observed reassociation) data points (*u ^{l}*,

*y*),

^{l}*l*= 1...

*n*: (1) where

*W*is a random variable representing the species' proportions and reassociation rates, and γ = 0.45 and

*k*

_{r}= 5.19 are taken to be constants (

*2*). Nonlinear regression then produces parameter estimates, standard errors (SEs), and goodness-of-fit tests (

*5*,

*6*).

We fitted Eq. 1 to the noncontaminated soil data provided by Sandaa (*2*). We tested 10 distributions for *W* (*7*) and found that only one yielded a convincing fit to the observed points. This had the form *P*(*W* = λ_{i}) = *p*_{i}, λ_{i} > 0, *i* = 1,2,3; *p*_{1} + *p*_{2} + *p*_{3} = 1, with *p*_{1} = 1.30 × 10^{–4}, λ_{1} = 2.60 × 10^{3}, *p*_{2} = 2.40 × 10^{–6}, λ_{2} = 7.20×10^{4}, *p*_{3} = 9.998676 ×10^{–1}, and λ_{3} = 4.892648 ×10^{–1} (*8*). The fit was excellent (Fig. 1), with a sum of squared errors (SSE) of 1.05 × 10^{–3}. The estimate of *S* was 7.4 × 10^{6}, but with a SE of 192.1 × 10^{6}.

The model used by Gans *et al*. (*2*) can be rewritten as (2) where μ > 0, and is assumed known. The authors applied a minimum χ^{2} procedure that does not yield SEs for the parameter estimates. Their report lacked certain details that prevented us from replicating their results exactly, but our model and fit are comparable. Our richness estimate is close to theirs, but the statistical SE is far higher than their informal calculation of a factor of, at most, 8.2. An SE of this magnitude makes intercommunity comparisons (e.g., richness in pristine versus polluted environments) statistically meaningless, because the range of possible values of the (unknown) richness of this community is virtually unbounded.

These results are sensitive to model assumptions to an unknown degree. For example, if γ is estimated from the data, then a simpler model fits very well with SSE 1.22 × 10^{–3}, but γ and *S* are estimated as 0.1095 (SE, 0.003) and 629 (SE, 120), respectively. Until such robustness issues are clarified, any results must be regarded as contingent on numerous questionable assumptions.

We also noted certain debatable aspects of the original experimental protocol and measurement procedure. First, Gans *et al*. (*2*) assumed that the DNA analyzed in the Cot analysis of Sandaa *et al*. (*1*) was bacterial in nature. We tested the bacterial extraction technique described (*1*) and observed considerable contamination of the bacterial pellet with eukaryotic cells/tissues. The presence of eukaryotic genomes in the DNA extract would introduce substantial error into estimates of bacterial richness using reassociation kinetics data. Second, DNA reassociation was estimated by measuring changes in hypochromicity (Δ*h*), a practice that can greatly underestimate the reassociation of repetitive sequences in complex DNA mixtures (*9*, *10*) (Fig. 2). A population of soil bacteria may be dominated by a few species (*11*, *12*) whose sequences would effectively reassociate like eukaryotic repetitive elements; in fact, our estimated abundance distribution shows just this structure. In this case, normal variation in homologous DNA sequences would result in formation of duplexes with partial strand mismatch, which is believed to underlie the reduced Δ*h* of renatured eukaryotic repeats (*9*). Extrapolation of partial Δ*h* Cot curves to “completion,” as was done by Gans *et al.* (*2*), amplifies these errors.

Current soil bacterial species richness estimates range from < 100 (*13*) to almost 10^{7} (*2*). Many of these estimates may be correct, although imprecise: When *SE* ≈ 2 × 10^{8}, an estimate may assume almost any value and remain correct, although uninformative. Informative estimation of species richness by DNA reassociation kinetics will require more precise parameter estimation, a more realistic physical model (*14*), and analysis of sensitivity to assumptions and constants.

**Supporting Online Material**

www.sciencemag.org/cgi/content/full/313/5789/918c/DC1

Materials and Methods

SOM Text

Fig. S1

References