Supplementary Materials

Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity

Alexandra Zhernakova, Alexander Kurilshikov, Marc Jan Bonder, Ettje F. Tigchelaar, Melanie Schirmer, Tommi Vatanen, Zlatan Mujagic, Arnau Vich Vila, Gwen Falony, Sara Vieira-Silva, Jun Wang, Floris Imhann, Eelke Brandsma, Soesma A. Jankipersadsing, Marie Joossens, Maria Carmen Cenit, Patrick Deelen, Morris A. Swertz, LifeLines cohort study, Rinse K. Weersma, Edith J. M. Feskens, Mihai G. Netea, Dirk Gevers, Daisy Jonkers, Lude Franke, Yurii S. Aulchenko, Curtis Huttenhower, Jeroen Raes, Marten H. Hofker, Ramnik J. Xavier, Cisca Wijmenga, Jingyuan Fu

Materials/Methods, Supplementary Text, Tables, Figures, and/or References

Download Supplement
  • Materials and Methods
  • Figs. S1 to S13
  • Captions for Additional Data tables S1 to S19
  • References (31â€"54)
Data Tables S1 to S19
Table S1. Description and summary statistics of 207 factors. We selected 207 factors that were further categorized into 5 groups: 41 intrinsic factors, 39 diseases, 5 smoking factors, 78 dietary factors, and 44 drug categories. The data type can be continuous (e.g. age, BMI, blood lipids), binary (e.g. gender, disease status, medication), or categorical (e.g. food intake frequency). For continuous traits, we further tested the normality of the distribution and performed log-transformation if necessary. A brief description of each factor and the summary statistics of males, females, or all samples are provided. The gender difference was tested using the t-test for continuous data, Chi-squared test for binary data, and Wilcoxon’s test for categorical data. If the data was log10 transformed, the summary statistics on the transformed data are also provided and the gender difference test was performed on transformed data as well.

Table S2. Pairwise Spearman correlation coefficients of 207 factors. To understand the correlation structure of our 207 factors, we performed pairwise Spearman correlation. (A): The pair-wise Spearman correlation coefficients. (B): The P value of pair-wise Spearman correlation analysis. (C): The P value adjusted for multiple tests using the Benjamini and Hochberg methods.

Table S3. The predicted abundance of 632 species. A total of 632 species were predicted in the fecal samples from the LifeLines-DEEP cohort. Together, these 632 species accounted for 99.98% of microbial composition. For each species, information was shown for the mean proportion (mean (%)), standard deviation (SD), the number of subjects in which the species was present (# presented samples), and whether we tested for single species association. A total of 170 species were selected for single species association. Each accounted for at least 0.01% of the microbial composition and was present in more than 10 participants. These 170 species accounted for an average 99.3% of the predicted microbial composition.

Table S4. Association of 207 factors with Bray-Curtis distance. The association of each factor with inter-individual distance in microbial composition (Bray-Curtis distance) was assessed using function adonis from R package vegan. The summary statistics of the adonis analysis are summarized in the table, including degree of freedom (Df), sequential sums of squares (SumOfSqs), mean squares (MeanSqs), F statistics (F.Model), partial R-squared (R2), P values out of 10,000x permutations (Pr.F.), and the P value adjusted for 207 factors using the Benjamini and Hochberg method (p.adj).

Table S5. Replication of the association with Bray-Curtis distance on 16S data. For the same set of subjects, 16S rRNA gene sequencing data is available. The Bray-Curtis distance was calculated on the transformed abundance level of 1,155 OTUs and we validated the association with Bray-Curtis distance on 16S rRNA gene sequencing data. At FDR<0.1, 72% of associations were replicated in 16S rRNA data from the same subjects.

Table S6. Association with Shannon’s diversity index. The Spearman correlation was computed between each factor and Shannon’s diversity index. The table summarizes the Spearman correlation coefficient (Spearman.r), P value, and P value adjusted for multiple tests using the Benjamini and Hochberg method (p.adj).

Table S7. Replication of the association with Shannon’s diversity index on 16S sequencing data. For the same set of subjects, 16S rRNA data is available. The Shannon’s diversity index was calculated on the abundance level of 1,155 OTUs. We validated the association with Shannon’s diversity index on 16S rRNA gene sequencing data. At FDR<0.1, 80% of associations were replicated in 16S rRNA data from the same subjects.

Table S8. Association with gene richness. The Spearman correlation was computed between each factor and gene richness. The table summarizes the Spearman correlation coefficient (Spearman.r), P value, and P value adjusted for multiple tests using the Benjamini and Hochberg method (p.adj).

Table S9. Association with COG richness. The Spearman correlation was computed between each factor and COG richness. The table summarizes the Spearman correlation coefficient (Spearman.r), P value, and P value adjusted for multiple tests using the Benjamini and Hochberg method (p.adj).

Table S10. Replication rate in 16S rRNA gene sequencing dataset. The summary of replication rate between MGS and 16S. The replication rate was assessed at different significance levels. At adjusted P <0.1, 90% of Bray-Curtis association was replicated in the 16S rRNA data and 83% of the associations with diversity were replicated in 16S rRNA data. At adjusted P <0.05, 94% of the Bray-Curtis association was replicated in the 16S rRNA data and 100% of the associations with diversity were replicated in 16S rRNA data.

Table S11. The significant associations with individual species (FDR<0.1). Multivariate analysis was performed using MaAsLin; age, gender, and sequence depth were included as covariates. The table summarizes the factor, species, association coefficient, total number of samples, number of non-zero subjects, association P value (P-value), and Q-value adjusted for the total number of tests for 207 factors and 170 species.

Table S12. The significant associations with individual MetaCyc pathway (FDR<0.1). Multivariate analysis was performed using MaAsLin; age, gender, and sequence depth were included as covariates. The table summarizes the factor, species, association coefficients, total number of samples (N), number of non-zero subjects (N of non-zero), association P value (P-value), and Q-value adjusted for the total number of tests for 207 factors and 215 MetaCyc pathways.

Table S13. Association of species (FDR<0.1) after correcting for all potentially confounding effects. Multivariate analysis was performed using MaAsLin. To address the dependence of data and to reveal the most dominant effect, we included a boosting step in MaAsLin. This step selected all potential cofactors as covariates in the model so that the reported associations were corrected for the influence of other factors. The table summarizes the factor, species, association coefficients, total number of samples (N), number of non-zero subjects (N of non-zero), association P value (P-value), and Q-value adjusted for the total number of tests for 207 factors and 170 species.

Table S14. Association of MetaCyc pathways (FDR<0.1) after correcting for all potentially confounding effects. Multivariate analysis was performed using MaAsLin. To address the dependence of data and reveal the most dominant effect, we included a boosting step in MaAsLin. This step selected all potential cofactors as covariates in the model so that the reported associations were corrected for the influence of other factors. The table summarizes the factor, species, association coefficients, total number of samples (N), number of non-zero subjects (N of non-zero), association P value (P-value), and Q-value adjusted for the total number of tests for 207 factors and 215 pathways.

Table S15. The selection of SNPs cis-affecting CHGA (encoding CgA) gene expression. Seven cis-eQTL SNPs were extracted from the literature. The table summarizes the chromosome (Chr), base pair position (Loc), source, minor allele frequency (MAF), passes of quality control (PassesQC), and the reference link (Ref Link). One SNP, rs9658667, failed to pass the quality control. Thus six SNPs were included in further analysis.

Table S16. The association of CHGA cis-acting SNPs with fecal CgA level. We tested the association between fecal CgA level and six cis-eQTL SNPs of CHGA. No significant association was detected at FDR<0.1.

Table S17. The association of CHGA cis-acting SNPs with 170 species. We tested the association between 170 species and six cis-eQTL SNPs of CHGA. No significant association was detected at FDR<0.1.

Table S18. Spearman correlations of species abundance with species richness. The Spearman correlation was computed between the abundance of each species and the total number of species in each individual (species richness). The table summarizes the Spearman correlation coefficient (Spearman.r), P value, and P value adjusted for multiple tests using the Benjamini and Hochberg method (p.adj).

Table S19. Drug ATC codes used for 44 drug groups. To group the drugs based on their function, we used the Anatomical Therapeutic Chemical code (ATC code), and these are summarized in this table.