Single-Cell Genomics Reveals Hundreds of Coexisting Subpopulations in Wild Prochlorococcus

See allHide authors and affiliations

Science  25 Apr 2014:
Vol. 344, Issue 6182, pp. 416-420
DOI: 10.1126/science.1248575

Cyanobacterial Diversity

What does it mean to be a global species? The marine cyanobacterium Prochlorococcus is ubiquitous and, arguably, the most abundant and productive of all living organisms. Although to our eyes the seas look uniform, to a bacterium the ocean's bulk is a plethora of microhabitats, and by large-scale single-cell genomic analysis of uncultured cells, Kashtan et al. (p. 416; see the Perspective by Bowler and Scanlan) reveal that Prochlorococcus has diversified to match. This “species” constitutes a mass of subpopulations—each with million-year ancestry—that vary seasonally in abundance. The subpopulations in turn have clades nested within that show covariation between sets of core alleles and variable gene content, indicating flexibility of responses to rapid environmental changes. Large sets of coexisting populations could be a general feature of other free-living bacterial species living in highly mixed habitats.


Extensive genomic diversity within coexisting members of a microbial species has been revealed through selected cultured isolates and metagenomic assemblies. Yet, the cell-by-cell genomic composition of wild uncultured populations of co-occurring cells is largely unknown. In this work, we applied large-scale single-cell genomics to study populations of the globally abundant marine cyanobacterium Prochlorococcus. We show that they are composed of hundreds of subpopulations with distinct “genomic backbones,” each backbone consisting of a different set of core gene alleles linked to a small distinctive set of flexible genes. These subpopulations are estimated to have diverged at least a few million years ago, suggesting ancient, stable niche partitioning. Such a large set of coexisting subpopulations may be a general feature of free-living bacterial species with huge populations in highly mixed habitats.

The cyanobacterium Prochlorococcus is the smallest and most abundant photosynthetic cell in the oligotrophic oceans, contributing substantially to global photosynthesis (1). A single species by traditional measures, Prochlorococcus can be divided into several major clades, or ecotypes, defined by the intergenic transcribed spacer (ITS) region of their ribosomal RNA (rRNA) genes. These ecotypes are physiologically distinct (24); display distinctive seasonal, depth, and geographic patterns (3); and, like other microorganisms (510), embody tremendous genotypic and phenotypic diversity (4). To begin to understand the scope and limits of ecologically meaningful diversity within the canonical Prochlorococcus ecotypes, we examined cell-by-cell genomic diversity within a small sample of seawater and explored how it shifts in a dynamic environment.

We applied single-cell genome sequencing (1114) to wild Prochlorococcus cells from samples collected at the Bermuda-Atlantic Time-series Study (BATS) site at three separate times of year (November 2008, February 2009, and April 2009) (Fig. 1A) (15). Because light, temperature, nutrients, and co-occurring communities change with winter deep mixing (15, 16) (Fig. 1A), cells experience substantial environmental changes over tens of generations, enough to cause shifts in abundance of ITS-defined ecotypes (2, 15, 17). Flow sorting and DNA amplification (1114) of more than 1000 co-occurring Prochlorococcus cells allowed us to explore the cell-by-cell genomic composition of these wild populations. We were able to identify coherent subpopulations at the whole-genome level and their relationship to those defined by the ITS region, explore finely resolved diversity patterns within and between subpopulations, and examine shifting abundances with seasonal changes in the habitat.

Fig. 1 Cell-by-cell Prochlorococcus population composition in samples from three separate times of year at the BATS site.

Cells were collected within the mixed layer at 60 m depth in November 2008, February 2009, and April 2009 [see (15)]. (A) Schematic of seasonal dynamics at BATS and sampling design. (Top) A typical mixed-layer depth profile and context of our three samples. (Bottom) Typical average dynamics of light [smoothed mean surface photosynthetically active radiation (PAR) from 2004 to 2009], temperature, and nitrogen (within the mixed layer, averaged from 1999 to 2009) experienced by cells (15). Winter deep mixing brings cold nutrient-rich water to the surface. (B) Phylogenetic trees from pairwise genetic distances of ITS-rRNA sequences of individual cells from each sample [based on multiple alignment (15)]. The relevant sub-tree range of the known ecotypes (2) is marked above each tree if cells belonging to that ecotype were found, as is the division into low-light–adapted (LL) and high-light–adapted (HL) groups (2). (C) Heat maps describing the pairwise distance matrix between ITS-rRNA sequences of individual cells from each sample. Rows and columns are arranged according to the order of leaves of the trees shown in (B). The color map represents genetic distances as a percentage of base substitutions per site (log scale), such that the blue blocks identify very closely related ITS ribotypes. ITS sequences from cultured isolates with completely sequenced genomes are denoted by asterisks centered on the relevant line. Names of the largest clusters are marked in bold (e.g., cN2).

We first examined the population composition by sequencing the ITS regions of hundreds of Prochlorococcus cells in each sample, revealing the presence of finely resolved clusters within the broadly defined ecotypes (Fig. 1B). The populations were composed of tens to hundreds of nearly identical ITS clusters (>98% similar) within the coarse-grained ecotypes (Fig. 1, B and C). The relative abundance of cells belonging to the different clusters changed with season (Fig. 1, A to C) (15), suggesting shifts in their relative fitness in response to environmental changes.

To study the fine-scale genomic variation and compare it with the ITS-defined clusters, we sequenced the partial genomes (representing, on average, 70% of the total genome) of 90 individual cells (30 per sample) from the largest nearly identical ITS cluster, cN2 (Figs. 1C and 2), as well as 6 cells from two other clusters, cN1 and c9301. For each time of year, cells were randomly selected for genome sequencing from within the major ITS ribotypes (>99% similar) within cluster cN2 (C1 to C5) (30 cells), as well as from c9301-C8 and cN1-C9 (one cell each), as detailed in (15). We used a modified mediator genome reference assembly approach (15, 18) to analyze between-cell variation in the partial genomes recovered. The topologies of the ITS and genomic trees were highly congruent (Fig. 2), indicating that ITS sequences can serve as a proxy for genome sequences in Prochlorococcus at a much finer level of resolution than previously demonstrated (4, 19). The genomic data further revealed that the largest cluster cN2 is divided into five major clades [C1 to C5 (Fig. 2)] and a few additional minor clades represented by only one cell each. The delineation of clades C1 to C5 was highly robust and also observed in trees constructed from genomic position subsets (figs. S1 and S2).

Fig. 2 ITS-rRNA sequence and whole-genome neighbor-joining phylogenetic trees at a fine resolution of diversity.

(A) Phylogenetic tree based on ITS-rRNA sequences of 96 single cells (90 cN2 ribotypes, three cN1 ribotypes, and three c9301 ribotypes), as well as additional five high-light–adapted cultured strains. (B) Phylogenetic tree of the 96 single cells based on whole-genome sequences. The colored symbols to the left of the leaf labels in (A) and (B) represent the different clades depicted from the deep branches observed in the whole-genome tree. The sample origin of each cell is marked with red, blue, and green squares (representing autumn, winter, and spring, respectively) on the right. Distance units are base substitutions per site (see scale bar) (15). Bootstrap values <80 are marked as black dots on the internal nodes in (B) (fig. S1). Cells marked with # fall into an ITS clade that differs from the genome-defined clade. Neighbor-joining trees in (A) and (B) were constructed using p-distance.

To explore the evolutionary forces that shaped the cN2 C1 to C5 clades, we examined differences in nucleotide sequences within and between clades. For example, the C1 and C3 subpopulations (Fig. 2B) differ in 52,885 dimorphic single-nucleotide polymorphisms (SNPs), which represent 3.2% of their genomes (Fig. 3A, blue). The dimorphic SNPs between C1 and C3 are scattered across the genomes, occurring in 1519 out of 1974 genes (most of them core genes); 8% of these SNPs are found in intergenic regions (9% of the genome is noncoding). Of the intragenic SNPs, 37% are nonsynonymous, thus affecting the amino acid sequences of the proteins they encode. In contrast to the scattered nature of the sequence variation between the C1 and C3 clades, the polymorphism within them is confined to a few regions of the genome (Fig. 3A, black), indicating that most regions along the genome are conserved within clades and are different between them (15), which is true for all pairwise comparisons within C1 to C5 (figs. S3 and S4).

Fig. 3 Evidence for distinct genomic backbones defining Prochlorococcus subpopulations.

(A) Polymorphic sites within the cN2 clades C1 and C3 (black) and dimorphic sites between the two clades (blue) (15). The black-striped line below each bar graph marks positions with sufficient data for evaluation of site statistics. Genomic islands (ISL1, ISL2, etc.) (table S9) are shaded gray. (B) Genome-wide distributions of FST of all genes in the cN2-C1 composite genome, as computed for the five cN2 clades (C1 to C5), based on nucleotide sequences. Also shown is a representative FST distribution from coalescent simulations of neutral evolution (15). Genes with high FST exhibit higher sequence variation between the clades than within the clades. (C) Gene-by-gene profile of genetic differentiation between backbone subpopulations (FST). FST is estimated by the proportion of interpopulational gene diversity (γST) (20). Heat maps above are displayed for a few core genes with high FST. Each heat map shows the percentage of amino acid sequence substitutions between single cells, as well as cultured high-light–adapted strains.

This emerging pattern was further supported by a standard measure of genetic differentiation between populations, the fixation index (FST) (20), applied at gene-by-gene resolution to the five cN2 clades, C1 to C5 (Fig. 3, B and C). Seventy-five percent of the core genes had high FST values (>0.8), (Fig. 3, B and C) (15), meaning different clades contained significantly different alleles. Some of the differentiated core genes have functions involved in the interaction between the cell and environmental stimuli [e.g., transporters, genes that affect oxidative stress responses, and cell surface biosynthesis and modification (Data S1)]; that is, they are not all simply “housekeeping genes” that control central metabolism. For example, alleles of phosphoglucosamine mutase, which is involved in the biosynthesis of outer membrane lipopolysaccharides (21), differ by an average of 10% of their amino acid sequences (Fig. 3C), with substitutions in the hydrophilic center of the enzyme (21), possibly affecting its specificity and kinetics.

We next asked whether different clade subpopulations carry distinct sets of flexible genes. Using de novo assemblies to capture regions unmapped by the reference assemblies (15), we found that each subpopulation carries a small set of distinct genes, typically in the form of cassettes within genomic islands (Table 1). Cassettes containing genes in the glycosyltransferase family account for much of the gene content variation between these clade subpopulations (Table 1 and table S1). The gene content in these cassettes suggests involvement in outer membrane modifications, possibly affecting phage attachment (22), recognition by grazers (23), cell-to-cell communication, or interactions with other bacteria (24).

Table 1 Flexible gene cassettes associated with different cN2 backbone subpopulations highlighting gene content that may contribute to ecological differentiation.

GT, glycosyltransferase; ABC-T, adenosine triphosphate–binding cassette (ABC) transporter; HLIP, high-light–inducible protein; CO, Cytochrome oxidase c subunit VIb; HlpA, histone-like protein; CpsL, polysaccharide biosynthesis protein. In the “Selected gene annotations” column, numbers before gene annotation refer to number of that type of gene. A complete list of the genes in each cassette is described in table S1 (15).

View this table:

We conclude that these clade subpopulations have distinct “genomic backbones” (and are henceforth referred to as “backbone subpopulations”) consisting of highly conserved (within subpopulation) alleles of the majority of core genes and a small distinct set of flexible genes that is linked with a particular backbone. This covariation between the core alleles and flexible gene content, and its fine scale resolution, represents a new dimension of microdiversity within wild Prochlorococcus populations. It is noteworthy that similar patterns have been identified in cultured isolates and metagenomic assemblies within coexisting members of a few other microbial species with very different ecologies (510, 25), suggesting that differentiated genomic backbones may be a feature of diverse types of microbial populations.

At a finer resolution of diversity, we observed that cells within the five cN2 backbone subpopulations differ by 19,000 nucleotide positions on average, in comparison to 77,000 positions between backbone subpopulations (equivalent to 1.2 and 4.7% of the genome, respectively) (Fig. 2B). The most similar pairs of individual cell genomes in our samples differ in a few hundred base pairs [close to the detection limit when one considers single-cell processing and sequencing error (15)]; some of these pairs likely have identical gene content (15). Except for these few pairs, each cell carries at least one gene cassette not found in any other. In some cases, a few closely related cells (a subclade) within backbones share a distinct gene cassette. Among these genes are, again, glycosyltransferase genes, as well as transporters and genes involved in nucleotide binding and processing. In a few cases, cells from different backbone subpopulations carry similar flexible gene cassettes [e.g., high-light–related genes (Table 1) and phosphonate related genes], demonstrating the combinatorial nature of backbones and flexible genes.

If backbone subpopulations have differential fitness, we would expect their relative abundance to change with changing environmental conditions (Fig. 1). Accordingly, the majority of the largest subpopulations exhibited significant seasonal abundance variation (Fig. 4A), higher than expected by chance (15), consistent with the hypothesis that this reflects selection, but more data are needed to draw that conclusion. Backbone subpopulations maintain their genomic composition between seasons (tested for C1) (15), which we would expect, as the establishment of new mutations and the acquisition and loss of genes are not likely to be in play on these time scales (15).

Fig. 4 Abundance profiles of backbone subpopulations and the estimated number of coexisting subpopulations within samples.

(A) Relative abundance profiles of the 11 largest backbone subpopulations in our samples within the ITS clusters cNATL, cMED4, cN1, cN2, and c9301. A, autumn; W, winter; S, spring. Backbone names are marked near the relevant cluster on the ITS heat map. Backbone subpopulations were predicted by 99% ITS similarity for the full set of 1381 ITS sequences. Error bars represent SEM. NS, no significant changes between seasons (false discovery rate, α = 0.05) (15). (B) Rarefaction curves estimating the number of coexisting backbone subpopulations within samples (15). Backbones predicted as in (A). Error bars represent 95% confidence intervals. (Inset) Rarefaction curve of all samples combined.

The congruency of genomic and ITS phylogenies in Prochlorococcus at both coarse (4, 19) and fine resolution (Fig. 2) suggests that ITS-ribotype clusters coincide, in most cases, with distinct genomic backbones (15). This allowed us to estimate the number of coexisting backbone subpopulations in our samples through rarefaction analysis, revealing at least hundreds of coexisting subpopulations with distinct backbones (Fig. 4B) in each sample. These backbone subpopulations are estimated to have diverged at least a few million years ago (15), suggesting ancient, stable niche partitioning. That they have different alleles of genes associated with environmental interactions, carry a distinct set of flexible genes, and differ in relative abundance profiles as the environment changes suggests strongly that they are ecologically distinct.

Enormous population sizes and immense physical mixing probably played a role in the evolution of diverse genomic backbones in Prochlorococcus. A simple fluid mechanics model bridging the micrometer and kilometer scales for a typical ocean suggests that just-divided cells will be centimeters apart within minutes, tens of meters apart within an hour, and a few kilometers apart within a week (15). Thus, Prochlorococcus populations are expected to be well mixed over large water parcels (~10 km2 area by 3 m depth) on ecologically relevant time scales (~1 week) (15). This mixing and a stable collective Prochlorococcus population density of 107 to 108 cells liter−1 (17) make the size of each backbone subpopulation in such parcels enormous (>1013 cells) (15). The effective population size is arguably close to this census population size (15), implying that Prochlorococcus evolution is governed by selection, not genetic drift [based on population genetics theory (26)]. Consistent with this argument, the difference in the observed FST distribution from that estimated for no selection (Fig. 3B) provides further evidence that the differentiation of genomic backbones in Prochlorococcus is a product of selection (15).

The correlation between phylogeny and flexible gene content (Table 1, tables S1 and S13, and fig. S5) leads us to propose that the emergence of a genomic backbone is initiated by the acquisition of a beneficial flexible gene cassette, followed by slow fine-adjustment of the core gene alleles to the new niche dimension afforded by the acquired cassette. Given the huge effective population size, even extremely weak fitness differentials among alleles (27) can facilitate fine-adjustment of core genes (15) over the millions of years of evolution after divergence.

The diverse set of hundreds of subpopulations with distinct genomic backbones probably plays an important role in the dynamic stability of the Prochlorococcus “collective” in the global oceans (fig. S6). Small fitness differentials, niche differentiation, and selective phage and grazer predation, in the context of temporal and spatial environmental variation, help to explain their coexistence (28, 29). On seasonal time scales, the Prochlorococcus collective maintains a relatively stable population size through temporal and local adjustments in the relative abundance of backbone subpopulations (Figs. 1C and 4A and fig. S6D). On longer time scales (decades to millions of years), the collective may respond to shifting selective pressures through the exchange of gene cassettes between and within backbone subpopulations, and through the evolution of the backbones themselves. The coherence of the collective population holds as long as subpopulations do not diverge to the point where they are no longer able to exchange flexible genes and backbone extinction and emergence rates are relatively balanced. If Prochlorococcus backbone subpopulations were designated as distinct species (30), it would imply that the global collective is an assortment of thousands of species. It is likely that such a large set of coexisting subpopulations with distinct genomic backbones is a characteristic feature of free-living bacterial species with very large population sizes living in highly mixed habitats.

Supplementary Materials

Materials and Methods

Figs. S1 to S21

Tables S1 to S13

References (3191)

Data S1

References and Notes

  1. Materials and methods are available as supplementary materials on Science Online.
  2. Acknowledgments: We thank S. Itzkovitz, P. H. R. Calil, D. Sher, R. Milo, P. M. Berube, A. P. Yelton, R. Braakman, and particularly M. F. Polz for comments on the manuscript. We thank the Bermuda Atlantic Time-series Study for sample collection, the Bigelow Laboratory Single Cell Genomics Center for single-cell sorting and whole-genome amplification, and the BioMicroCenter facility at MIT for their contributions to the generation of genomic data. N.K. acknowledges the Rothschild Foundation (Yad Hanadiv) and the National Oceanic and Atmospheric Administration “Climate and Global Change” Postdoctoral Research Fellowships. This work was supported by grants to S.W.C. from the NSF Evolutionary Biology Section and Biological Oceanography Section, the NSF Center for Microbial Oceanography Research and Education (C-MORE), the U.S. Department of Energy (DOE)–GTL, and the Gordon and Betty Moore Foundation Marine Microbiology Initiative; to R. Stepanauskas from the NSF Biological Oceanography Section; and to R.R.M from the DOE (contract number DE-AC02-05CH11231). Genomic data have been deposited in National Center for Biotechnology Information GenBank under accession numbers KJ477896 to KJ479276 and JFKN00000000 to JFOE00000000. Additional data files have been deposited to Dryad (doi:10.5061/dryad.9r0p6).
View Abstract

Navigate This Article