## Abstract

The complexity of soil bacterial communities has thus far confounded effective measurement. However, with improved analytical methods, we show that the abundance distribution and total diversity can be deciphered. Reanalysis of reassociation kinetics for bacterial community DNA from pristine and metal-polluted soils showed that a power law best described the abundance distributions. More than one million distinct genomes occurred in the pristine soil, exceeding previous estimates by two orders of magnitude. Metal pollution reduced diversity more than 99.9%, revealing the highly toxic effect of metal contamination, especially for rare taxa.

For any complex system, the number and relative abundance of parts is fundamental to a quantitative description of the system. Quantification provides a framework to compare equilibrium and dynamic properties and, for biological communities, to evaluate perturbations such as pollution, global climate change, and foreign species encroachment. To quantify plant and animal communities, ecologists survey the number and relative abundance of species (i.e., species-abundance distributions) (*1*, *2*). However, effective measurement of bacterial species-abundance distributions has eluded microbiologists owing to the overwhelming complexity of bacterial communities.

Surveys of bacterial communities are typically attempted by counting small subunit rRNA (16S rRNA) gene sequences. Aside from the technical difficulties and biases (*3*), the survey size required for accurate analysis of soil communities is impractically large. Accurately estimating diversity in a community with a log-normal species-abundance distribution requires sampling about 80% of the species (*4*, *5*). For a typical gram of soil containing a billion bacterial cells, a survey of at least 10^{6} 16S rRNA gene sequences, three orders of magnitude larger than current survey efforts, would be required to sample 80% of diversity in a community with 10,000 species.

Measuring total genetic diversity overcomes the limitations of surveys (*6*). By using two simplifying assumptions, genetic diversity can be translated into species diversity. Genetic diversity can be inferred from DNA reassociation kinetics of pooled genomic DNA from a bacterial community. The length of time for reassociation is proportional to the number and relative abundance of distinct sequence fragments (*7*). In 1990, landmark reassociation studies with bacterial community DNA provided the basis for the now widely accepted paradigm of “10,000 bacterial species per gram soil” (*6*). However, genetic diversity has been grossly underestimated as a result of the use of an analytical approach that implicitly assumes all bacterial species in a sample are equally abundant.

Using previously published data for community DNA from pristine and metal-contaminated soils (*8*), we demonstrated an approach that enables quantitative comparison of different species-abundance models. The original reassociation study was performed to assess the effects of heavy metal pollution in soil caused by the repeated application of sewage sludge (*8*). The authors purified bacterial cells from soil samples, extracted DNA from the bacterial cells, extensively purified the DNA by repeated hydroxyapatite chromatography, and then monitored DNA reassociation in sealed cuvettes by optical absorbance (*8*).

When all sequences are equally abundant, the reassociation of DNA, monitored spectroscopically, follows pseudo-second-order reaction kinetics (*9*). For samples containing DNA fractions that differ in relative abundance, a modified version of the basic equation for reassociation kinetics was developed that allows *n* fractions (abundance classes) with different reassociation rates but does not enable direct comparison of different abundance models (*10*), i.e. (1) where [*C*] is the concentration of single-stranded DNA, [*C*_{0}] is the initial concentration of single-stranded DNA, *t* is time, γ (the “retardation factor”) is a heuristic DNA-sequence-independent constant (*9*), *k*_{i} and *f*_{i} are the reassociation rate and relative abundance of the *i*th DNA fraction, and *n* is the number of fractions.

We recast this equation in terms of the total number of species (*S*_{t}) in a community and the relative abundance of each species to allow direct comparison of different abundance models. Using a number of substitutions and approximations (*11*), we obtained (2) where *N* is the number of individuals, *P*(*N*)*dN* is the normalized species-abundance distribution, 〈*N*〉 is the average number of individuals per species, and β is the ratio of the reassociation rate of a reference genome (e.g., *Escherichia coli*) and the total number of individuals (*N*_{t}) in a sample.

The evaluation of Eq. 2 requires values for β and γ (*11*) and an a priori form for the species-abundance distribution. In the absence of a strong justification for a particular distribution, we adopted a variety drawn from macroecology (*11*). To provide a relatively unbiased, heuristic estimate of *P*(*N*), we also compared a piece-wise linear approximation (3) This “model-free” approximation has the form of a histogram with geometric bar widths. There are *n* bars, with heights *p*_{i} and widths Δ^{i}N_{0}, where *N*_{0} is the location of the left edge of the first bar and θ is the Heaviside step function. This yields *n*+2 free parameters to be determined by fitting Eq. 3 to experimental Cot curves. The model-free approximation provides a more flexible shape that does not require symmetry or continuity like the standard abundance models and consequently provides a useful baseline for assessing the fit of standard models.

Using this framework, we reanalyzed the three published (*8*) reassociation data sets for bacterial communities. Two observations were noteworthy. First, we were able to describe the general shape of the abundance distribution. Second, we were able to estimate improved boundaries for the total amount of genetic diversity.

Empirical data and simulations both demonstrate that DNA reassociation kinetics can accurately identify different abundance patterns, although the resolving power depends on the completeness of the reassociation curve. For example, a delta function (a distribution in which all components, e.g., genes, are equally abundant) provided the best fit (as expected) for experimental, single-species *E. coli* DNA reassociation curves. For contrast, we simulated DNA reassociation for a theoretical bacterial community with 5000 species following a lognormal abundance distribution (*11*). After adding Gaussian noise to the reassociation curve equal to the noise seen in the soil DNA data sets, we fit the curve with a variety of abundance models and compared the fits using *v* (i.e., reduced χ^{2}, which accounts for differences in the number of free parameters between models). Even with a reassociation curve only 50% complete, values clearly identified the underlying distribution as lognormal while other models were easily excluded (fig. S1).

For the soil bacterial communities (*8*), the value of obtained from fitting each model ruled out the delta, top hat, geometric, and neutral models for the species-abundance distribution (Fig. 1). The lognormal distribution was also discounted because it consistently produced larger values than the zipf distribution (*11*). The remaining models were qualitatively similar. For all three soil DNA reassociation curves, the model-free curve provided the best fit (Fig. 1), followed closely by the zipf and log-Laplace models (which were statistically indistinguishable). The fluctuation in the model-free DNA reassociation curve for the noncontaminated soil (Fig. 1) reflected a deviation in the shape of the species-abundance distribution (Fig. 2), not a significant increase in species diversity compared with the zipf model.

The zipf and log-Laplace distributions shared the same power-law form describing the most abundant (large *N*) bacterial species. The power-law envelope defined by the zipf distribution had the form *P*(*N*) ∼*N ^{z}*, where

*z*was approximately –2 (

*z*= –1.96 ± 0.02, –2.11 ± 0.01, and –2.08 ± 0.03 for the noncontaminated, low-metal, and high-metal data sets, respectively). Power laws have described the abundance distribution of artificial life forms (

*z*= –2, most commonly) (

*12*), marine phages (

*z*= –1.64 and –1.73) (

*13*,

*14*), and plant communities (

*15*) and may arise from a variety of mechanisms (

*12*,

*16*,

*17*). Alternatively, a log-Laplace distribution, which would appear as a power law when measured by DNA reassociation, may arise from an ensemble of lognormals (

*18*,

*19*) that individually describe the abundance distribution of different functional groups (e.g., denitrifiers, iron reducers, and sulfate reducers).

The zipf and log-Laplace differed mathematically in describing the rare species (Fig. 3). This difference in the two functions was not apparent in the values of as a result of the incompleteness of the curves and the magnitude of the measurement error, which masks small changes in the shape. The ambiguous shape of the distribution for rare species demonstrates that a portion of the community is veiled. Although a reasonable estimate can be obtained of the minimum number of species in the community (including the veiled fraction), additional work is required to obtain a fully accurate description of the entire species-abundance distribution.

Although the shape of the abundance distribution is of fundamental importance, the total diversity is often of greatest interest in environmental assessment and regulatory policy. For each soil, the model-free, zipf, and log-Laplace estimates of *S*_{t} agreed within a factor of two (Fig. 4). Given the qualitative and quantitative similarity of these distributions, we averaged the three to obtain an estimate for each soil. Thus, the noncontaminated, low-metal, and high-metal soils respectively contained about 8.3 × 10^{6}, 6.4 × 10^{4}, and 7.9 × 10^{3} species among approximately 10^{10} cells [or 10 g of soil; this represents the quantity of DNA used in the reassociation experiments (*11*)]. Our estimates of *S*_{t} were larger by a factor of 4 to 500 than the original estimates of 1.6 × 10^{4}, 6.4 × 10^{3}, and 2.0 × 10^{3} species.

On the basis of our estimates, metal pollution reduced diversity more than 99.9%. Interestingly, total bacterial biomass remained unchanged at about 2 × 10^{9} cells per gram of soil despite metal exposure (*8*). Our abundance models were consistent with this observation and indicated that the major effect of metal exposure was the elimination of rare taxa (Fig. 2). In the pristine soil, taxa with abundance values <10^{5} cells per gram accounted for 99.9% of the diversity, and genetic diversity from this fraction of the community appears to have been purged by high metal pollution. The functional importance of these rare taxa for soil nutrient cycling and ecosystem resilience is unknown.

To assess the overall error for *S*_{t}, we calculated the net impact of all error sources, including measurement error, Cot curve completeness, calibration rate, and hybridization of mismatched DNA (*11*). The relatively minor effects of the first two factors were included in the error estimates for *S*_{t} shown in Fig. 4 and could be reduced further (fig. S3). Given that all error sources were random and uncorrelated, the total error for *S*_{t}, calculated by standard propagation of errors (*20*), was a factor of 8.2. As this error range affects *S*_{t} but is not expected to influence the relative differences between the soils, we are confident of the relative impact of metal pollution.

Comparing the ability of numerous species-abundance distributions to reproduce experimental DNA reassociation data showed that the soil bacterial communities were naturally best represented by the model-free approximation, followed closely by the zipf (i.e., power law) distribution. Hence, the original study substantially underestimated the species diversity of pristine soil bacterial communities. Moreover, heavy metal pollution reduced bacterial diversity not by a factor of 8, as previously suggested, but by a factor of about 1000, with rare species impacted the most. Although the minimum number of species in the soils can be estimated, the exact shape of the abundance distribution for rare species remains ambiguous and is an area for additional work. Overall, the improved analytical approach demonstrates that rigorous DNA reassociation studies can address otherwise intractable problems in microbial ecology, such as monitoring environmental perturbations and mapping diversity geographically.

**Supporting Online Material**

www.sciencemag.org/cgi/content/full/309/5739/1387/DC1

Materials and Methods

Figs. S1 to S3

Table S1

References