A sparse covarying unit that describes healthy and impaired human gut microbiota development

Characterizing the organization of the human gut microbiota is a formidable challenge given the number of possible interactions between its components. Using a statistical approach initially applied to financial markets, we measured temporally conserved covariance among bacterial taxa in the microbiota of healthy members of a Bangladeshi birth cohort sampled from 1 to 60 months of age. The results revealed an “ecogroup” of 15 covarying bacterial taxa that provide a concise description of microbiota development in healthy children from this and other low-income countries, and a means for monitoring community repair in undernourished children treated with therapeutic foods. Features of ecogroup population dynamics were recapitulated in gnotobiotic piglets as they transitioned from exclusive milk feeding to a fully weaned state consuming a representative Bangladeshi diet. INTRODUCTION Ecosystems such as the human gut microbiota are typically described by a “parts list” with enumeration of component members. Accordingly, the abundances of community components are commonly used as a metric for relating its configuration to features of its habitat and to the biological state of the host. Although this approach has provided much insight, the structure and function of biological systems are emergent, arising from the collective action of constituent parts rather than each part acting in isolation. This characteristic demands a different approach to describing the form of a microbiota—one that takes into consideration the abundances as well as the interactions between members. RATIONALE Borrowing from the fields of econophysics and protein evolution, where identification of conserved covariation has provided insights about the organization of complex dynamic systems, we searched for features amidst the seemingly intractable complexity of human gut microbial communities that could serve as a framework for understanding how they assemble and function. RESULTS A statistical workflow was developed to identify conserved bacterial taxon-taxon covariance in the gut communities of healthy members of a Bangladeshi birth cohort who provided fecal samples monthly from postnatal months 1 to 60. The results revealed an “ecogroup” of 15 bacterial taxa that together exhibited consistent covariation by 20 months of age and beyond. Ecogroup taxa also described gut microbiota development in healthy members of birth cohorts residing in Bangladesh, India, and Peru to an extent comparable to what is achieved when considering all detected bacterial taxa; this finding suggests that the ecogroup network is a conserved general feature of microbiota organization. Moreover, the ecogroup provided a framework for characterizing the state of perturbed microbiota development in Bangladeshi children with severe acute malnutrition (SAM) and moderate acute malnutrition (MAM), as well as a quantitative metric for defining the efficacy of standard versus microbiota-directed therapeutic foods in reconfiguring their gut communities toward a state seen in age-matched healthy children living in the same locale. These results highlight the importance of the ecogroup as a descriptor, both for fundamental and practical uses. A consortium of cultured ecogroup taxa, introduced into gnotobiotic piglets, reenacted changes in their relative abundances that were observed in human communities as the animals transitioned from exclusive milk feeding to a fully weaned state consuming a prototypic Bangladeshi diet. This pattern of change correlated with the representation of a sparse set of metabolic pathways in the genomes of these organisms and, in the fully weaned state, with their expression. CONCLUSION The ecogroup represents a simplified feature of community organization and components that could play key roles in community assembly and function. As the gut microbiota constantly faces environmental challenges, “embedding” a sparse network of covarying taxa in a larger framework of independently varying organisms could represent an elegant architectural solution developed by nature to maintain robustness while enabling adaptation. The approach used to identify and characterize the sparse network of covarying ecogroup taxa is, in principle, generalizable to a wide variety of ecosystems. Ecogroup as a concise description of microbiota form. (Top) Network diagram of covarying taxa where node (taxon) color indicates ecogroup (green) or nonecogroup (gray), node size indicates number of mutually covarying taxa, and connection between nodes indicates covariance between two taxa. (Bottom) Measuring the representation of ecogroup taxa reveals that children with SAM treated with standard therapeutic foods have an ecogroup profile similar to that of children with untreated MAM, indicating persistent perturbations in their gut community relative to healthy children. In contrast, children with MAM treated with a therapeutic food designed to target the microbiota (MDCF-2) have an ecogroup profile that overlaps nearly entirely with that of healthy children.


SUPPLEMENTARY RESULTS
The effect of bacterial load on ecogroup definition Given a set of N total fecal samples where each fecal sample (microbiota) contains a set of taxa, the fractional representation of any taxon can be calculated as " " = " (1) where b i and x i and X i represent the 'bacterial load', fractional abundance, and total abundance, respectively, of taxon 'x' for microbiota i. The covariance between taxon 'x' and taxon 'y' can be represented as The average fractional abundance of a taxon 'x', for instance, can be expressed in terms of bacterial load as the following The fractional abundance of taxon 'x' for any microbiota i can be expressed as total abundance and fractional abundance from (1) as " = " " (5) Substituting this into (4) Given the expression shown in (5), we can now address the case where (1) bacterial load is constant across all fecal samples, and (2) bacterial load is different across fecal samples.

Case 1: All bacterial loads are equal across all N microbiota
In the case that bacterial loads are equal across all N fecal samples, Thus, b i can be substituted for b, a constant bacterial load across all N. Substituting this into (5) gives , = 1 " − 1 3 + 5 + ⋯ +  (9) which is equal to Covariance calculated using absolute bacterial load between two taxa, 'X' and 'Y' is Thus from (10) and (11) , = 1 ( , ) (12) The result of (12) illustrates that when taking into account a constant bacterial load across an ensemble of fecal samples, the covariance computed between taxa 'x' and 'y' and between 'X' and 'Y' are related to each other by a constant-the inverse of the bacterial load.
In our statistical approach, temporally conserved taxon-taxon covariance is computed using fractional abundance measurements from month 20 to 60 of postnatal life across the healthy Mirpur cohort. If we were to take into account a constant bacterial load across all samples, this covariance matrix would scale in a directly proportionate fashion as (12) demonstrates.
The next step in our approach is to apply PCA to the temporally weighted covariance matrix. The first step of PCA is to compute the eigenvectors and eigenvalues of the input matrix. We can ask what is the effect of proportionately scaling data with respect to identifying eigenvalues and eigenvectors of a matrix? Given the temporally weighted covariance matrix C, the way to identify the eigenvalues of C is by solving where 'det' means determinant, I is the identity matrix of the same dimension as C and represents the eigenvalues to be solved. As an example, if C is a 2x2 matrix defined as Computing (17)  Using the quadratic formula to solve for in (22) gives the following solution for the eigenvalues of C Using the definition of the trace and determinant of matrix C from (20) and (21), (27) can be expressed as 5 − Tr + 5 det ( ) = 0 (28) Using the quadratic formula to solve for in (28) gives the following solution for the eigenvalues of C scaled by b.
Thus, scaling the matrix C does not affect the eigenvectors of the matrix, but only affects their scaling, and is a well-known result of linear algebra. An example of this result is shown in fig. S15A,B.

Case 2: All bacterial loads differ across all N microbiota
If bacterial loads are different between samples, the simplification from (6) to (8) no longer holds. Thus, as a simple example of how different bacterial loads affect covariance between taxa, assume N = 2. Therefore, eigenvalues for each eigenvector scale non-linearly, the eigenvectors themselves remain unchanged. Mathematically this is consistent with equation (32). If bacterial load is taken into consideration, the input values to the covariance calculation will be absolutely different, as shown by (1), but a relative measure of whether two taxa co-vary does not change. Mathematically, by equation (32), the set of eigenvalues for each eigenvector can be different, but the existence of the set of eigenvectors v (a set of transformed axes to represent the data) remains unchanged. The practical interpretation of this result is that taking into consideration differential bacterial load will only change the amount of variance represented by the principal component. Thus, identification of ecogroup taxa is invariant to differential bacterial load.

Dietary practices
The Generating RF-derived models of gut microbiota development in healthy members of birth cohorts representing geographically distinct regions and anthropologic characteristics MAL-ED is a network of eight study sites, located in low-income countries, dedicated to assessing the impact of enteric infections that alter gut function and impair the growth and development of infants and children. To define the extent to which age-discriminatory taxa are shared between infants and young children, we generated V4-16S rDNA datasets from fecal samples collected monthly for the first 2 postnatal years from members of MAL-ED birth cohorts with healthy growth phenotypes living in Loreto, Peru, Vellore, India, Fortaleza, Brazil and Venda, South Africa [n=22.4±2.8 (mean ± SD) fecal samples/child; total of 1639 samples; table S4]. 'Healthy' in these sites was defined as height-for-age and weight-for-height Z-scores (HAZ, WHZ) consistently no more than 1.5 standard deviations below the median calculated from a WHO reference healthy growth cohort (36). Bacterial V4-16S rDNA reads were grouped into 97%ID OTUs. Using the 16S rDNA dataset and a sparse 2-year, 30 OTU RF-derived model generated from 25 healthy members of the Bangladeshi birth cohort in Gehrig et al. (21), we determined that a minimum of 12 individuals would be required to construct a model with comparable performance ( fig. S9A-C). Based on this result, we generated RF-derived models of gut development from the sufficiently powered Indian and Peruvian datasets ( fig. S9D,E; table S12). Limiting models to 30 OTUs with the top ranked feature importance scores had only minimal impact on accuracy (i.e., the models were within 1% of the mean squared error obtained using all OTUs). Therefore, our subsequent analyses used sparse site-specific RFgenerated models that were each comprised of their 30 top-ranked 97% ID OTUs. The Peruvian and Indian models shared 13 OTUs, and 16 and 15 OTUs with the Bangladeshi model, respectively ( fig.  S9D,E).
We created a sparse 'aggregate' model from bacterial 16S rDNA datasets generated from all but the Bangladeshi birth cohort (i.e., the MAL-ED cohorts from India, Peru, Brazil and South Africa) ( fig.  S9F,G). To balance the representation of each site's contribution to the aggregate model, seven of the most densely sampled healthy individuals from each of the four sites were selected (see Methods; n=599 fecal samples). The resulting RF-derived aggregate model contained 17 of the 30 OTUs present in the sparse 2-year Bangladeshi RF-derived model, and 18 and 16 of the OTUs in the sparse Indian and Peruvian models, respectively (also see fig. S9H).

Sensitivity analyses of the workflow for identifying ecogroup taxa
We performed a sensitivity analysis of the workflow described in Fig. 1A-C. Specifically, we compared the projections along PC1 shown in Fig. 1C with results obtained (i) using unrarefied 16S rDNA data, (ii) considering compositional data from postnatal months 1 to 60 versus months 20 to 60, and (iii) using different thresholds for monthly covariance binarization.
The merits of rarefaction have been the source of extensive discussion in analyzing microbiota compositional data. While certain methods advocate using unrarefied, raw count data, other studies have argued rarefaction is a useful normalization method prior to ordination (6,37,38). The output of our workflow is the projection of each taxon onto PC1 of a temporally weighted covariance matrix (Fig. 1C). We found a linear relationship between the PC1 projections computed using unrarefied versus rarefied compositional data (Pearson r 2 value of 0.98; fig. S16A, table S13A) indicating that rarefaction does not affect our identification of consistently co-varying taxa in the ecogroup.
The framework of our workflow was to calculate conserved covariance within microbiota that had achieved a degree of stability with respect to their community structure. This required us to perform iPCA on the longitudinal Bangladeshi birth cohort in order to identify a starting month for the analysis. Fig.  S4B,C shows why we selected months 20 to 60. Choosing months 1 through 60 as compared to months 20 through 60 results in a similar pattern of taxa projections along PC1 but compresses the dynamic range of these projections ( fig. S16B, table S13B). Choosing months 25 or 30 through 60 as compared to months 20 through 60 results in a similar pattern of taxa projections along PC1 ( fig. S16C,D, table S13B).
In our workflow, monthly covariance values were binarized according to the top and bottom 10% of the distribution of covariance values. We performed a perturbation analysis of this threshold, evaluating the taxa projections onto PC1 for the top and bottom 5%, 20%, and 30%. Identification of ecogroup taxa is robust to changes in threshold choice; fig. S16E-G demonstrates that varying this threshold affects the dynamic range but not the order of taxa projections along PC1, particularly at the lower threshold (30%) (table S13C). Fig. 1A with two other methods, SPIEC-EASI and SparCC, for defining taxon interaction networks in the gut microbiota SPIEC-EASI (10) seeks to create an interaction graph using cross-sectional data by first applying the centered log-ratio transform then inferring the interaction graph by computing the inverse of a taxontaxon covariance matrix. The mathematical basis of this method is a well-known approach for solving what is termed 'the inverse problem'; i.e. inferring system interactions from correlations (39,40). SparCC (9) seeks to infer correlations between the abundances of taxa by first log-transforming community compositional data, thereby maintaining sub-compositional coherence, and then mathematically solving for the correlation coefficient that emerges from computing the variance of the log-ratio between abundances of any two taxa. Importantly, SparCC is meant to be used on observed counts, since normalization may generate unreliable results for rare OTUs (9).

Comparing the approach described in
We applied each method to 16S rDNA datasets generated from members of the healthy Bangladeshi birth cohort from postnatal months 1 to 60. Taxon-taxon monthly interaction matrices generated by SPIEC-EASI and covariance matrices produced by SparCC were then averaged from months 20 to 60 (table S6A,C). PCA was performed on the resulting matrices to identify 15 taxa that interact (SPIEC-EASI) or co-vary (SparCC) in a temporally conserved fashion (table S6B,D). These taxa were validated using the same criteria as for the ecogroup; namely, the ability to describe (i) healthy gut microbial development in children residing in Bangladesh, (ii) the configurations of SAM and MAM microbiota, and (iii) the effects of standard treatment on SAM microbiota and MDCF interventions on MAM microbiota.
Taxon projections along PC1 resulting from SparCC show a high degree of concordance with the taxon projections resulting from our workflow shown in  . S11B); this result is driven primarily by the presence of B. longum (OTU 559527). Moreover, configurational changes in the microbiota of children with SAM before and after treatment defined by the 15 SparCC identified taxa demonstrate a similar pattern of movement in PCA space as that described by the 15 taxa identified in our workflow (compare fig. S11C with Fig. 3A). This result is due to the inclusion of taxa that characterize differences in the SAM microbiota as a function of time and treatment, namely B. longum (559527), Bifidobacterium (484304), S. gallolyticus (349024), and F. prausnitzii (514940) (Fig.  3C). Reducing the stringency of inclusion to the 21 SparCC taxa that project most significantly onto PC1 captures the two ecogroup P. copri OTUs (588929 and 840914).
PCA performed on the temporally averaged interaction matrix produced by SPIEC-EASI revealed two significant principal components ( fig. S12A,B, table S6D); PC1 isolates Prevotella species while PC2 distinguishes P. copri OTUs 588929 and 840914. The 15 taxa that project significantly onto PC1 computed using the SPIEC-EASI workflow comprise 6 ecogroup taxa and 9 non-ecogroup taxa (table  S6E). See the main text and fig. S11 and fig. S12 for a comparison of these approaches and the approach shown in Fig. 1A in characterizing (i) healthy gut microbiota development and (ii) the effects of treatment on the configurations of fSAM-and MAM-associated microbiota.

SUPPLEMENTARY FIGURES
fig. S1. A comparison of taxonomic assignments generated by QIIME and Amplicon Sequence Variants (ASVs) by DADA2. Sensitivity analysis directly comparing taxonomic assignments using OTUs versus amplicon sequence variants (ASVs) (15,16). (A) Summary of workflow. An ASV database is created by running all datasets described in this report and Gehrig et al. (21) through the DADA2 pipeline. Fifteen ASVs are randomly chosen; for each ASV, a V4-16S rDNA sequence that has 100% identity with the ASV is defined as the 'primary OTU sequence.' For each 'primary OTU sequence', a library of 1000 sequences with at least 97% identity is generated. Each sequence in each library is then compared to the ASV database and the ASV with the maximum sequence identity (MSI) is noted. If the MSI ASV corresponds to the ASV from which the 'primary OTU sequence' was generated (defined as the 'correct ASV'), the sequence in the library is given a '1'; otherwise, the sequence in the library is given a '0'. An average score is computed for the entire library. The process of library generation and sequence score designation is repeated 10 times and an overall average is computed. This average represents the probability of assigning the 'correct ASV' to the 'primary OTU sequence' given a sequence divergence of ≤ 3%.  1 through 60. (A,B) UniFrac, a beta-diversity dissimilarity metric that measures the degree to which any two communities share branch length on a bacterial phylogenetic tree, was used to calculate the degree of dissimilarity between each sampled child's fecal microbiota at each timepoint of fecal collection (n = 36 individuals; 1961 samples) relative to samples profiled from unrelated adults who also lived in Mirpur (n = 12 males, 49 samples). Unweighted (panel A) and weighted UniFrac (panel B) distances are plotted as mean values ± SD. As a reference control, the distances between adult samples relative to one another are shown. (C,D) Alpha-diversity metrics [Shannon diversity index (SDI) and Phylogenetic diversity (PD)] plotted as mean values ± SD for each monthly age bin and for adult samples. Clostridiales sp.
Clostridiales sp.   (Fig. 1C). (E) Heatmap of the monthly distribution of relative abundances of age-discriminatory taxa.  Fig. 2A and fig. S7. PCA plots are shown for the Indian cohort (panels A and B) and the Peruvian cohort (panels D and E). Comparing panels C (India) and F (Peru) with Fig. 2B reveals similar temporal patterns of change in the fractional representation of ecogroup taxa in healthy members of the Indian, Peruvian, and Bangladeshi birth cohorts.

fig. S11. SparCC-based analyses. (A)
The SparCC computational workflow was applied to the Bangladeshi birth cohort (see Supplementary Results for details). Monthly covariance matrices generated for postnatal months 20 to 60 were averaged and the resulting covariance matrix was subject to PCA. Taxon projection along PC1 as computed using the SparCC workflow and temporally conserved covariance is plotted. (B) The 15 taxa that project most significantly onto PC1 computed by SparCC were used to analyze the temporal pattern of gut microbiota development analogous to that described in Fig.  2A. For postnatal months 4, 10, 20, 40 and 60, fecal microbiota are plotted on a PCA space to illustrate temporal changes in the community. (C) The 15 taxa that most significantly project onto PC1 computed by SparCC are used to ordinate a PCA plot, analogous to the one shown in Fig. 3A in order to compare the fecal microbiota of children with SAM and MAM prior to and after administration of therapeutic foods. Abbreviation; d/c, discharge from in-hospital nutritional rehabilitation unit.
fig. S12. SPIEC-EASI-based analysis. The SPIEC-EASI computational workflow was applied to the 5year healthy Bangladeshi birth cohort. Monthly taxon-taxon interaction matrices were generated for postnatal months 20 to 60 and averaged. The resulting temporally-averaged interaction matrix was subject to PCA. (A) Eigenspectrum of the temporally averaged interaction matrix. Two principal components capture 39% of the data variance. (B) Two P. copri OTUs and two Prevotella OTUs contribute significantly to taxon projections along PC1 and PC2; these two P. copri strains and one of the two Prevotella OTUs are also ecogroup taxa. (C) The 15 taxa that most significantly project onto PC1 computed by SPIEC-EASI were used to analyze the temporal pattern of gut microbiota development in a a manner analogous to the approach described in Fig. 2A. (D) The 15 taxa that most significantly project onto PC1 were used to ordinate a PCA plot in a manner analogous to