Research Article

A metagenomic strategy for harnessing the chemical repertoire of the human microbiome

See allHide authors and affiliations

Science  13 Dec 2019:
Vol. 366, Issue 6471, eaax9176
DOI: 10.1126/science.aax9176

Prospecting for drugs in the microbiome

The microbiome is an important source of natural products that can profoundly influence health and disease in the host. Sugimoto et al. constructed a modular, probabilistic strategy called MetaBGC to uncover biosynthetic gene clusters (BGCs) in human microbiome samples (see the Perspective by Henke and Clardy). The authors found geographic and strain-specific distributions of BGCs. By zeroing in on two type II aromatic polyketides, the native organisms were identified, the BGCs were reconstructed in Streptomyces, and the products were characterized. When expressed in Bacillus subtilis, the products resembled currently used anticancer drugs and antibiotics. These polyketides were not cytotoxic but had inhibitory activity against oral Gram-positive bacteria, which may reflect the niche and ecology of the originating organisms.

Science, this issue p. eaax9176; see also p. 1309

Structured Abstract


The human microbiome has been correlated with several health and disease conditions, but the molecular mechanisms underlying these correlations remain largely unexplored. Biologically active small molecules that are produced by the human microbiome offer an important route for exploring these mechanisms because they often mediate important microbe-microbe and microbe-host interactions. In bacterial genomes, small-molecule biosynthetic genes are usually encoded in distinct clusters known as biosynthetic gene clusters (BGCs), which enables scientists to use computational tools for recognizing them and predicting their products. Here, we present a hybrid strategy that uses computational and synthetic biology tools for discovering microbiome-encoded small molecules.


Previous efforts to discover small-molecule BGCs from the human microbiome relied mainly on analyzing genomic data of sequenced bacterial isolates. Although this approach has revealed the enormous and largely untapped diversity of microbiome-encoded BGCs, it fails to report on the biosynthetic potential of members of the human microbiome that have not yet been cultured or isolated: the majority of species in metagenomic sequencing data. Therefore, we sought to develop a computational algorithm that discovers small-molecule BGCs directly in complex metagenomic sequencing data of the human microbiome: metagenomic identifier of biosynthetic gene clusters (MetaBGC). First, high-performance probabilistic models for identifying homologs of a biosynthetic enzyme of interest are built specifically for use with complex metagenomic datasets (MetaBGC-Build). Next, these models are used to identify biosynthetic genes in thousands of metagenomic datasets of the human microbiome at the single-read level (MetaBGC-Identify). Finally, identified biosynthetic reads are quantified in the entire cohort of samples (MetaBGC-Quantify) and clustered into biosynthetic read bins on the basis of their abundance profiles across samples (MetaBGC-Cluster). To evaluate the utility of this approach, we used it to discover BGCs for type II polyketides, a clinically relevant class of small molecules, directly from metagenomic sequencing data of the human microbiome.


We applied MetaBGC to 3203 metagenomic samples of the human microbiome originating from Western (subjects from the United States, Spain, and Denmark) and non-Western (subjects from China and Fiji) populations and from every major human body site (gut, mouth, skin, and vagina). Overall, we discovered 13 complete BGCs that potentially encode type II polyketides; eight of these were encoded by diverse bacterial isolates of the human microbiome in a strain-specific manner and five could not be assigned to any sequenced species. Type II polyketide BGCs are found in three major human body sites, gut, mouth, and skin, and at least six of them are transcribed under host-colonization conditions and widely distributed in different human populations (e.g., 46% of healthy subjects from the United States encode at least one BGC in their gut, oral, or skin microbiome). Next, we selected two of the identified BGCs for experimental characterization, one from the oral microbiome and another from the gut microbiome. We used a synthetic biology strategy in which metagenomically discovered BGCs are genetically engineered and expressed in various heterologous hosts without the need for cultivation of the native producer. Using this strategy, we successfully purified and solved the structures of five new type II polyketide molecules as the products of the two characterized BGCs. Finally, we show that two of the discovered molecules exert strong antibacterial activities against members of the human microbiome that occupy the same niche as their producer, implying a possible role in microbe-microbe competition.


We developed a hybrid strategy that combines computational and experimental techniques for discovering and characterizing small-molecule BGCs directly from complex datasets of the human microbiome. Using this strategy, we discovered that a clinically relevant class of molecules, type II polyketides, are widely encoded in the human microbiome and that human microbiome–derived polyketides resemble in structure and biological activity clinically used ones. Our approach is generally applicable to other classes of small molecules and can be used to systematically unveil the chemical potential of the human microbiome—a goal that is useful for both mechanistic microbiome explorations and drug discovery.

Small-molecule discovery from the human microbiome.

A hybrid computational and synthetic biology approach was developed to discover small molecules encoded by the human microbiome. Biosynthetic gene clusters were discovered at the single metagenomic read level from thousands of samples using a new algorithm, MetaBGC, and expressed in a synthetic biology platform to yield previously unknown small molecules.


Extensive progress has been made in determining the effects of the microbiome on human physiology and disease, but the underlying molecules and mechanisms governing these effects remain largely unexplored. Here, we combine a new computational algorithm with synthetic biology to access biologically active small molecules encoded directly in human microbiome–derived metagenomic sequencing data. We discover that members of a clinically used class of molecules are widely encoded in the human microbiome and that they exert potent antibacterial activities against neighboring microbes, implying a possible role in niche competition and host defense. Our approach paves the way toward a systematic unveiling of the chemical repertoire encoded by the human microbiome and provides a generalizable platform for discovering molecular mediators of microbiome-host and microbiome-microbiome interactions.

The human microbiome harbors thousands of bacterial species, varies in composition between different sites of the human body and between individuals, and has been correlated with several diseases (15). Recently, scientists have begun to examine these correlations at a mechanistic level, often by identifying and characterizing microbiome-derived small molecules that are responsible for a specific phenotype (6, 7). These molecules may mediate their effects directly, by targeting human cells or receptors (microbe-host interactions) (814), or indirectly, by affecting other members of the microbiome in a competitive or collaborative manner (microbe-microbe interactions) (1518). With the importance of microbiome-derived small molecules becoming apparent, there is a dire need to develop systematic approaches for discovering and characterizing them.

We previously undertook a systematic approach to describe the general capacity of the microbiome to produce small molecules (15) by identifying the biosynthetic gene cluster (BGC) repertoire encoded in genomes of thousands of bacterial strains isolated originally from humans and characterizing the structure and biological activity of a subset of their products (9, 15). Although this approach revealed an enormous and largely unexplored capacity of the microbiome to produce small molecules, it suffered from one major limitation: Its starting point relied on analyzing assembled genomes from readily cultured and sequenced isolates of the human microbiome. Therefore, this approach misses BGCs encoded in sequenced clinical samples but not in genomes of previously isolated bacteria. This limitation is noteworthy for three main reasons.

First, most large culturing efforts have focused on human gut microbiome samples, especially those from healthy and Western populations, resulting in a relatively limited representation of reference genomes from other human body sites and from diseased and non-Western cohorts (1922). By contrast, thousands of metagenomic sequencing datasets have been produced from all human body sites (1, 2), for cohorts of several human diseases (35, 2325), and from non-Western populations (26, 27). Second, recent metagenomic binning studies revealed thousands of new bacterial species in the human microbiome, representing rare, not yet cultivated or not yet sequenced members (2830). These new species expanded not only the taxonomical space of the human microbiome, but also its functional capacity beyond what has been previously observed in the genomes of cultured isolates. Finally, BGCs are commonly encoded on mobile elements as strain-specific traits (15) and capturing them would require deep sampling within single species, which is not often the goal in culturing efforts. It is evident that our previous approach, which relied solely on discovering BGCs from genomes of cultured isolates, is limited in its ability to fully represent the biosynthetic potential of the collective human microbiome.

A read-based algorithm for the detection of BGCs in metagenomic data

Realizing the above limitation, we set out to develop an algorithm that allows for the identification of small-molecule BGCs directly in human microbiome–derived metagenomic sequencing data without the need for bacterial cultivation or sequencing; we called this algorithm MetaBGC (metagenomic identifier of biosynthetic gene clusters). Our algorithm needed to fulfill three main criteria: (i) be able to detect BGCs de novo from metagenomic sequencing data without any prior knowledge of bacterial species in the sample and therefore show no bias toward sequenced isolates; (ii) be able to detect new BGCs at the single metagenomic read level [~100 base pairs (bp)] and therefore show no bias toward abundant or easy-to-assemble species that often dominate the metagenomic assemblies; and (iii) be computationally efficient without sacrificing sensitivity so that it can be used to profile thousands of metagenomic samples simultaneously.

A simple and fast method for detecting homologs of a given protein family is through profile hidden Markov models (pHMMs) (31, 32). In pHMMs, the probability of finding a given amino acid, insertion, or deletion is calculated at each position of an alignment of interest (training set) and then used to construct a probability profile. New sequences (search set) are then scored on the basis of their fit to that profile. pHMMs are typically constructed from full-length proteins in the training set, whereas most Illumina-based metagenomic reads are ~100 bp in length (~33 amino acids). When used in a search set, these short sequences are aligned locally to different regions of their respective pHMMs. Depending on the sequence complexity and conservation at these local regions, varying specificities and sensitivities may be obtained across a given pHMM. Therefore, we sought to adapt pHMMs for use in metagenomic applications by developing what we refer to here as segmented pHMMs (spHMMs). spHMMs are built on 30–amino acid segments of aligned, full-length protein homologs, resulting in probability profiles with lengths that match those of the sequences in the search set. Because of the small size of each segment, spHMMs are able to distinguish regions of the aligned proteins that are of high complexity from repetitive or low-complexity ones that are commonly found in sequences, and regions of high conservation among homologs from ones that are more variable. This differentiation would result in models of varying performances depending on their intervals (Fig. 1A). Models are evaluated using a synthetic dataset and spHMMs that perform poorly (with a high false-positive, high false-negative, or low true-positive rate) are eliminated. This first module of the algorithm is named MetaBGC-Build. Only high-performance spHMMs are selected for the next step and used to search metagenomic reads from clinical samples (MetaBGC-Identify).

Fig. 1 Overview of MetaBGC.

(A) The first step of MetaBGC is the development of high-performance spHMMs that are specific for a given class of BGCs (MetaBGC-Build). Homologs of the protein family of interest are aligned and the alignment is segmented into 30–amino acid fragments with a 10–amino acid window shift. spHMMs are built using the segmented alignments, which are then evaluated using synthetic metagenomes composed of a predominantly negative background and spiked with positive BGCs. spHMMs with F1 scores ≥0.5 form the basis of MetaBGC-Identify and those with F1 scores <0.5 are excluded. (B) The remaining three steps of MetaBGC are as follows: (i) MetaBGC-identify detects biosynthetic reads from complex metagenomic datasets of the human microbiome using spHMMs [as described in (A)]; (ii) unique (nonredundant) biosynthetic reads are then quantified in all samples (MetaBGC-Quantify) and an abundance profile is generated for each read; and (iii) nonredundant biosynthetic reads are finally clustered on the basis of their abundance profiles (MetaBGC-Cluster) to produce read “bins” that originate from specific BGCs. For example, biosynthetic reads colored in red or blue are detected from 10 metagenomic samples, quantified in the same samples, and finally clustered to produce two distinct bins (one red and one blue) that originate from bgc1 or bgc2, respectively. Nonbiosynthetic reads are colored in gray.

Reads identified by the selected spHMMs (scored above a defined threshold) are then deemed “biosynthetic” and passed on to a third module of the algorithm: MetaBGC-Quantify (Fig. 1B). In this module, biosynthetic reads are de-replicated and quantified in all samples of the entire cohort(s) and an abundance matrix is generated for all nonredundant reads against all samples. Because reads that originate from a single BGC should have an even coverage across metagenomic samples, we devised a clustering strategy to produce “bins” of identified reads on the basis of their abundance profiles in different samples. In this strategy, we use DBSCAN (density-based spatial clustering of applications with noise) (33) to cluster reads with similar abundance profiles across different metagenomic samples into distinct bins on the basis of their pairwise Pearson correlation distance (see the supplementary materials). This fourth module is called MetaBGC-Cluster. This final strategy not only decreases the total number of hits that need to be analyzed, but also presents an opportunity for a targeted assembly of the reads that originate from the same BGC and end up in the same bin. In summary, our spHMM-based algorithm identifies, quantifies, and clusters microbiome-derived BGCs at the single metagenomic read level (Fig. 1B).

Test case for MetaBGC: Type II polyketide synthase BGCs

To evaluate the utility of this approach, we focused on one class of small-molecule BGCs that has never been reported from members of the human microbiome: Type II polyketide synthases (TII-PKS BGCs). TII-PKS BGCs are relatively uncommon in bacterial genomes and almost always encode small molecules with interesting biological activities (including the clinically used anticancer drug doxorubicin and the clinically used antibiotic drug tetracycline) (34, 35). To first test whether TII-PKS BGCs can be identified at all in human metagenomic sequencing data, we performed de novo metagenomic assemblies on 759 samples from the Human Microbiome Project 1 (HMP-1-1) (1) and used a common BGC identification tool, antiSMASH, to detect TII-PKS BGCs in assembled scaffolds >5000 bp (36). Using this strategy, we identified six new TII-PKS BGCs in three body sites (mouth, gut, and skin) (bgc1 to bgc6; see below), which indicated that this class of molecules is indeed encoded in human-derived metagenomes despite not having been reported from common isolates of the human microbiome. Motivated by this finding, we proceeded toward adapting our MetaBGC strategy for the systematic discovery and quantification of TII-PKS BGCs directly in human-derived metagenomic sequencing data, at the single-read level and without relying on metagenomic assemblies. This strategy would largely eliminate the contingency of discovery success on the abundance of the encoding organism or the ability to properly assemble its genome from complex metagenomic samples.

Four essential enzymes are universally present in TII-PKS BGCs: two ketosynthases (KSα and KSβ), which are responsible for the elongation of the polyketide chain and chain length determination, respectively; an acyl carrier protein (ACP), also known as a thiolation domain (T), on which the growing chain is attached through a thioester bond; and at least one of four types of cyclases or aromatases (OxyN, TcmN, TcmI, and TcmJ types), which are responsible for the final cyclization/aromatization of the polyketide chain through a series of aldol condensation reactions (Fig. 2A) (35, 37). Although KS and ACP domains exist in other BGCs (e.g., iterative, TI-PKS, and TIII-PKS BGCs), cyclase domains are relatively specific to TII-PKS BGCs and can be used as a proxy for identifying them using MetaBGC (37). Therefore, we aligned selected, diverse homologs of each of the four types of TII-PKS cyclases and built spHMMs for all intervals at a window size of 30 amino acids and a window shift of 10 amino acids (see the supplementary materials, data table S1, and figs. S1 to S5).

Fig. 2 Using MetaBGC to detect TII-PKS BGCs in synthetic metagenomic samples.

(A) Typical organization of a TII-PKS BGC composed of genes encoding for two ketosynthases (KSα and KSβ) responsible for the elongation and chain-length determination of the growing polyketide chain, respectively; a thiolation domain (T), on which the growing chain is attached; and at least one type of cyclase/aromatase (Cyc), which is responsible for the cyclization of the linear chain. Through additional tailoring reactions, more elaborate structures are formed, such as the clinically used antibiotic tetracycline. (B) Four types of cyclases or aromatases are specifically found in TII-PKS BGCs and are chosen as protein families for MetaBGC. For each of the four protein families, segmented alignments corresponding to 30–amino acid intervals are used to build spHMMs. Each spHMM is then evaluated using synthetic metagenomes designed to simulate several expected compositions in human microbiome samples (synthetic dataset 1). The F1 score (y axis) of each spHMM per interval (x axis) is calculated, which is indicative of its accuracy (considering both precision and recall). Values shown represent the maximum F1 score after testing various spHMM score thresholds (fig. S8). Note that some spHMMs have very low F1 scores, whereas others have high ones (only spHMMs with F1 scores ≥0.5 were included in MetaBGC-Identify). The F1 score of each unsegmented cyclase pHMM is shown in green. (C) Stacked bar graph showing 11 bins (>50 nonredundant biosynthetic reads each) that MetaBGC produced when applied to the 140 synthetic metagenomes simulated in this study (synthetic dataset 1). Colors indicate the types of cyclases that the biosynthetic reads belong to within each bin. Blue indicates false-positive reads (FP). The heatmap on top indicates the number of BGCs represented per bin and the number of genomes from which these BGCs originate. Most bins represent a single BGC, and when two BGCs are represented by the same bin, they are expectedly derived from the same spiked genome. Results from synthetic dataset 2 can be found in figs. S12 to S14.

To evaluate the performance of MetaBGC, we applied it to a carefully designed synthetic metagenomic dataset. We simulated 140 metagenomic samples containing 42 (low-diversity) or 126 (high-diversity) human microbiome–derived genomes that contain no TII-PKS BGCs. We then spiked in simulated reads from 10 diverse bacterial genomes that harbor in total 13 TII-PKS BGCs, none of which was part of the spHMM training sets (see the supplementary materials, fig. S6, and data table S2). Overall, this synthetic dataset (synthetic dataset 1) was constructed to simulate several conditions that are expected in human microbiome samples: very low abundance of a given BGC (~1× coverage), medium abundance (~10× coverage), several BGCs per sample, and no BGCs per sample (fig. S7). We then computed the F1 scores (harmonic mean between precision and recall) to individually evaluate each spHMM of all four cyclases. As expected, MetaBGC-Build revealed two types of spHMMs for each alignment: high-performance spHMMs (F1 scores ≥0.5) and low-performance spHMMs (F1 scores <0.5) (Fig. 2B). After eliminating low-performance models and tuning spHMM scores of the high-performance ones (fig. S8), we reached a final set of 40 spHMMs to be used in the MetaBGC-Identify module. Moreover, as expected, the number of reads detected by these models for a given cyclase positively correlates with the coverage of the spiked genomes harboring the same cyclase (fig. S9). In summary, we used synthetic metagenomic data to evaluate the performance of each of the cyclase spHMMs in MetaBGC and to select and tune the best-performing ones for next steps.

To evaluate the remaining components of MetaBGC, we subjected the identified “biosynthetic reads” from the 40 high-performance models to the quantification and clustering modules and analyzed the resulting bins for composition. In total, MetaBGC produced 11 bins with >50 nonredundant biosynthetic reads (range 51 to 616 reads, average 278 reads per bin), 10 of which were true-positive bins, and the smallest of them (bin 6, 51 reads) contained only false-positive reads (fig. S10). Collectively, the 10 true-positive bins harbored reads from all 37 spiked cyclases (100% cyclase recovery rate). Next, we investigated how many BGCs were represented in each of the 10 true-positive bins. Seven out of the 10 bins corresponded to reads that originated exclusively from one of seven nonredundant BGCs, whereas the remaining three bins harbored reads from two BGCs each (Fig. 2C). Each pair of BGCs that shared a bin was encoded in the same spiked genome and thus had the same coverage and representation in the metagenomic dataset and was clustered together by MetaBGC-Cluster. In conclusion, MetaBGC recovered reads from all cyclases of all spiked BGCs and correctly clustered them into their BGC- and genome-specific bins. Moreover, MetaBGC produced only 4% false-positive reads after clustering, 40% of which were binned together into a single, easily identifiable false-positive bin (Fig. 2C).

To test whether a wider distribution of coverages for the TII-PKS–positive genomes would affect the performance of MetaBGC, we simulated a new synthetic dataset (synthetic dataset 2) in which the proportion of each TII-PKS–positive and -negative genome in a given sample is sampled from a log-normal distribution and then normalized such that the sum of proportions equals one (data table S2 and fig. S11). Despite a wide range of coverages for the TII-PKS–positive genomes (0 to >1000×) in this new dataset, MetaBGC performed in the same manner as before: overall, it produced 12 bins with >50 nonredundant biosynthetic reads in each (10 true- and two false-positive ones), 95% true-positive reads after binning, 100% cyclase recovery rate, and 100% correct binning of cyclase reads into their corresponding BGCs and genomes (figs. S12 to S14). These promising results illustrate the validity of MetaBGC for the identification, quantification, and clustering of TII-PKS BGCs in human-derived metagenomes.

Tuning of MetaBGC using metagenomic data from three large cohorts

After evaluating MetaBGC on simulated datasets of the human microbiome, we sought to tune its performance on clinically derived metagenomic data. We applied it to 2544 human-derived metagenomic samples from three large studies [the Human Microbiome Project (HMP-1-1 and HMP-1-2) and the Metagenomics of the Human Intestinal Tract Consortium, (MetaHIT) (1, 2, 38)]. These samples originated from all major human body sites (skin and airways, 293 samples; gut, 872 samples; vagina, 215 samples; and mouth, 1164 samples) and collectively harbored 1.09 × 1011 reads. By comparing MetaBGC results with the National Center for Biotechnology Information (NCBI) nonredundant protein sequence database, 10 of our spHMM score cutoffs were further fine-tuned to eliminate false-positive hits resulting from genomes commonly observed in the metagenomic datasets but not initially included in our synthetic dataset, and one model was eliminated (see the supplementary materials and data table S1). Our final, most tuned algorithm detected 18 × 103 reads (1.66 × 10−7 hit rate), which clustered into 19 bins of at least 10 nonredundant biosynthetic reads (data tables S3 to S5).

As expected, six of these bins mapped to our initial set of six BGCs (bgc1 to bgc6) and 13 appeared to originate from new ones (Fig. 3 and figs. S15 to S17). Only one bin, bin-W6, appeared to harbor two groups of reads with different abundance profiles (the first group was found on average in 72 stool samples, whereas the second was found only in two stool samples), which were easily separated into bins W6a and W6b using more stringent clustering parameters (Fig. 3A, supplementary materials, and data table S5). Various assembly approaches on 14 selected samples that have the highest abundance of the 14 new bins provided 10 complete BGCs and four partial ones. Of the new, complete ones, six were clearly TII-PKS BGCs that have not been previously described (Fig. 4 and data table S6). Four of the six newly discovered TII-PKS BGCs (bgc9 to bgc12) could not be identified by antiSMASH (36), even after fully assembling them, further highlighting the sensitivity of our read-based discovery approach. For all complete TII-PKS BGCs, whenever a cyclase read was detected by an spHMM for a given interval, this read was mapped correctly to the same interval in the fully assembled cyclase gene (see the supplementary materials, data tables S7 to S10, and fig. S18). The remaining six completely assembled bins do not represent TII-PKS BGCs but instead represent several new architectures of OxyN-containing BGCs with no previously characterized function (named here: nonpolyketide cyclase BGCs or NPCs) (fig. S15).

Fig. 3 Testing MetaBGC on human microbiome–derived metagenomic samples from three large cohorts.

(A) MetaBGC results from profiling 2544 metagenomic samples of three cohorts (HMP-1-1, HMP-1-2, and MetaHIT). A selection of bins (n = 13) corresponding to TII-PKS BGCs is shown, for which each point represents a nonredundant biosynthetic read that was detected by MetaBGC. For all other bins, see fig. S16. The color, shape, and size of each point indicate the type of cyclase to which each read belongs, the body site from which each read originated, and the number of metagenomic samples where each read was detected following the key on the right. All bins harbor biosynthetic reads from one BGC. Bins W6a and W6b were separated using a refined clustering strategy (see supplementary materials). (B) Heatmap indicating the abundance of each of 11 bins that correspond to complete TII-PKS BGCs (data table S6) in 380 metagenomic samples. For a similar heatmap of all other bins, see fig. S17. Bin abundance is calculated as the sum of abundances for all biosynthetic reads in a given bin (in log10) and is indicated on a gradient color scale as shown above (with a minimum bin abundance of 10 reads). The body site and cohort that each sample belongs to are indicated following the color code on the left. The bar graph shown at the bottom indicates the prevalence of each corresponding biosynthetic bin in subjects from the three cohorts analyzed here. Percentages are shown out of the total number of subjects with available samples at a given body site (mouth, 219 subjects; gut, 646 subjects; skin, 161 subjects). (C) Venn diagram indicating the number of HMP-1-1 and HMP-1-2 subjects that are positive for complete TII-PKS BGCs in one, two, or three body sites (with a minimum bin abundance of 10 reads). (D) Longitudinal analysis of HMP-1-1 and HMP-1-2 subjects. Samples from up to three visits of the same subject are deemed TII-PKS–positive (dot, with a minimum bin abundance of 10 reads) or TII-PKS–negative (no dot). Red dots indicate that the subject had the same TII-PKS bin at subsequent visits (longitudinal consistency); black dots indicate that the subject had a TII-PKS bin at only one visit. X indicates visits for which no metagenomic data are available.

Fig. 4 Genetic organization of human-derived TII-PKS BGCs.

Colored arrows in each BGC indicate genes following the color code on top. Typical TII-PKS biosynthetic domains (KSα, KSβ, T, and the four possible types of Cyc) are shown to the right of each BGC. The main body site where each BGC was found, the subject cohort(s) in which it was discovered, and the bacterial species with the closest match to it are also indicated. Note that several BGCs are only found in metagenomic data and have no species designation. For more details, see data table S6.

TII-PKS BGCs are widespread in both Western and non-Western cohorts

We then used MetaBGC-Quantify to study the representation of all TII-PKS bins that produced complete BGCs in the 2544 human samples (660 subjects). With a cutoff bin abundance of 10 reads per sample (see the supplementary materials), we detected at least one TII-PKS BGC in 122 (46%) of HMP-1-2 and HMP-1-2 subjects at any body site (total of 265 subjects). Of these TII-PKS–positive subjects, 93 (76.2%) harbored TII-PKS-BGCs in the oral, gut, or skin microbiome, whereas 26 (21.3%) and three (2.5%) harbored them at two or all three body sites, respectively (Fig. 3, B and C, figs. S16 and S17, and data table S4). By analyzing longitudinal samples that originated from the same subject over different visits (collected a few months apart), we found that 52% (oral), 52% (gut), and 19% (skin) of subjects who were positive for a given TII-PKS bin at one visit were positive for the same bin at a subsequent one (Fig. 3D and data table S11). Finally, in the MetaHIT cohort in which only samples from the gut microbiome were available, 73 (18.5%) of the 395 subjects harbored at least one TII-PKS bin (Fig. 3B and data table S4).

After the analysis of the HMP-1-1, HMP-1-2, and MetaHIT cohorts (subjects from the United States, Denmark, and Spain), we investigated whether TII-PKS BGCs are also widespread in non-Western cohorts. To answer this question, we analyzed two additional cohorts (Fijicomp, a collection of 434 oral and fecal samples from Fijian subjects, and a Chinese cohort of 225 fecal samples) (3, 26). From 28 × 109 total reads, MetaBGC detected 10 × 103 reads (3.6 × 10−7 hit rate), which were clustered into 19 bins of >10 nonredundant biosynthetic reads (figs. S19 and S20 and data tables S3 to S5). Six bins mapped to TII-PKS BGCs and NPCs previously identified in the Western cohorts; two bins contained only false-positive reads and 11 appeared to be new ones. Of these, assemblies of 11 samples enabled the discovery of one complete and four partial TII-PKS BGCs and five NPCs (Fig. 4, fig. S15, and data table S6). Compared with the Western cohorts, 15 (8%) and 56 (25%) of the Fijian and Chinese subjects, respectively, harbored at least one TII-PKS BGC in their gut microbiome, whereas 38 (13%) of the Fijian subjects harbored a TII-PKS BGC in either their gut or oral microbiome and four (1.4%) of them were positive in both body sites (figs. S19 and S20 and data table S4). Taken together, these results indicate that TII-PKS BGCs are widespread in the human microbiome of different populations at three major body sites (mouth, gut, and skin) and are harbored by consistent colonizers of the human body (stable over several months).

Because MetaBGC infers the abundance of a given TII-PKS BGC from quantifying metagenomic reads derived solely from its cyclase genes, we sought to further verify the validity of this inference. For all complete TII-PKS BGCs discovered in a given cohort, we mapped metagenomic reads from all samples in the same cohort to the entire length of the BGC using an independent read recruitment method and investigated whether TII-PKS genes other than the four types of cyclases could be detected (see the supplementary materials). In >91% of the instances where a sample was deemed positive by MetaBGC for a given TII-PKS BGC, reads that mapped to both cyclase and noncyclase genes within the same BGC were detected in the sample (data tables S12 to S16). In addition, a strong positive correlation was observed between the bin abundances calculated by MetaBGC and the independently calculated abundances for the entire BGCs expressed in RPKM (reads per kilobase pair per million sequenced reads) values: the Pearson correlation coefficient was 0.85 and the p-value was 2.2 × 10−16 (data tables S12 to S16). These results further validate MetaBGC as a sensitive, read-based algorithm for the discovery and quantification of TII-PKS BGCs in metagenomic samples of the human microbiome.

TII-PKS BGCs are encoded by diverse members of the human microbiome and expressed under host colonization conditions

Altogether, we discovered 13 complete TII-PKS BGCs directly from metagenomic datasets of the human microbiome, each of which was found predominantly in one body site (mouth, skin, or gut) (Figs. 3B and 4). To gain more insight into the members of the human microbiome with genomes that encode these BGCs, we searched the 13 BGCs against databases of reference genomes in NCBI and the Integrated Microbial Genomes and Microbiomes (IMG) (39) and successfully mapped eight of them to previously sequenced bacterial genomes (the remaining five had no matches; see the supplementary materials). We found that TII-PKS BGCs are encoded by a diverse suite of Firmicutes (Streptococcus sp., Staphylococcus sp., Lactobacillus sp., and Blautia sp.) and Actinobacteria (Propionibacterium sp., and Rothia sp.) (Fig. 4 and data table S6). The mapped BGCs have strain-specific representation even among extensively sampled genera: bgc1 and bgc9 exist in only one strain each out of ~3000 sequenced Streptococcus isolates, bgc5 exists in only one strain out of >5500 sequenced Staphylococcus isolates, and bgc13 exists in only one strain out of ~1000 sequenced Lactobacillus isolates. These results indicate that human microbiome–derived TII-PKS BGCs are encoded by both sequenced and not-yet-sequenced members of the human microbiome in a strain-specific manner, further emphasizing the importance of discovering these pathways directly from metagenomic sequencing data.

To determine whether the discovered TII-PKS BGCs are indeed expressed in the human body, we mapped publicly available metatranscriptomic data from different human oral and fecal samples to the 13 complete TII-PKS BGCs discovered here (see the supplementary materials). Overall, we observed in vivo transcription of at least five oral BGCs (bgc1, bgc2, bgc3, bgc8, and bgc9) and one gut BGC (bgc6) in at least two different samples each (fig. S21 and data table S17). These results suggest that TII-PKS BGCs are not only encoded in the human microbiome, but are also expressed under host colonization conditions.

Experimental characterization of TII-PKS BGCs from the oral and gut microbiome

To determine whether human microbiome–derived TII-PKS BGCs produce similar molecules (in structure and biological activity) to their previously characterized counterparts from environmental bacteria, we selected two of the in vivo–expressed BGCs for experimental characterization: an oral one (bgc3) and a gut one (bgc6). Because a native strain harboring bgc3 had not yet been isolated at the beginning of this work, we undertook a synthetic biology strategy for its characterization. We synthesized the coding sequence of bgc3 optimized for the heterologous expression in Streptomyces sp., a widely used host for the expression of small-molecule BGCs (including TII-PKS BGCs) (40). bgc3 was synthesized in two overlapping fragments, with strong promoters driving the expression of its two potential operons (see the supplementary materials and Fig. 5A). The fragments were then assembled into an Escherichia coliStreptomyces–yeast shuttle vector by transformation-associated recombination in Saccharomyces cerevisiae and finally integrated into a phage attachment site on the chromosome of Streptomyces albus by bacterial conjugation (see the supplementary materials) (41). We then compared the organic extracts of cultures of recombinant S. albus::bgc3 with those of empty vector controls using high-performance liquid chromatography coupled with mass spectrometry (HPLC-MS), revealing the presence of several bgc3-specific peaks (Fig. 5B).

Fig. 5 Experimental characterization of bgc3.

(A) The DNA sequence of bgc3, discovered from human oral metagenomic data, was codon optimized for Streptomyces sp. and synthesized, assembled, and cloned into the heterologous host S. albus. Terminator (black dots) and constitutive promoter (black arrows) sequences were engineered to control the expression of bgc3 in S. albus (see the supplementary materials). (B) HPLC-MS analysis of a chemical extract from the culture of S. albus::bgc3 (red) compared with one from S. albus with an empty vector control (blue). An HPLC chromatogram at the absorbance of 400 nm is shown for both samples, indicating four bgc3-specific peaks produced by S. albus::bgc3 and not the control (1 to 4, highlighted in yellow). (C) Molecular structure of metamycins A to D (1 to 4), the products of bgc3. Asterisks indicate the relative configuration at the chiral carbons.

To further characterize the bgc3 products, we scaled up the fermentation of S. albus::bgc3 (27 L) and then isolated, purified, and solved the structure of four new type II polyketides from its organic extract (compounds 1 to 4, named metamycins A to D, respectively; Fig. 5), using a combination of HPLC–high-resolution tandem MS (HPLC-HR-MS/MS), 1D and 2D nuclear magnetic resonance (NMR), and x-ray crystallography (see the supplementary text, figs. S22 to S37, and tables S1 and S2). Metamycins A (1) and B (2) share the same aromatic, tricyclic backbone (C16), but differ in their starter unit at C16 (a sec-butyl moiety in 2 instead of an isopropyl one in 1). Metamycins C (3) and D (4) share a longer, tetracyclic aromatic backbone (C18) and similarly differ in their starter units.

The second BGC that we characterized was bgc6 (Fig. 6A). This BGC is of special interest to us because of its variable prevalence in human fecal samples from different cohorts, from 7% in subjects from Fiji, to 17% in subjects from Denmark or Spain, to 23% in subjects from China, and finally to 28% in subjects from the United States (data table S4). In addition, bgc6 is encoded in the genome of the gut isolate Blautia wexlerae DSM 19850 from the class Clostridia. This is unusual because TII-PKS BGCs are rarely encoded in anaerobic bacteria and only two anaerobes are known to produce aromatic polyketides (4244). To characterize the product of bgc6, we amplified its coding sequence from the genomic DNA of B. wexlerae DSM 19850, cloned it in an E. coliBacillus subtilis shuttle vector under a strong constitutive promoter, and integrated it into the genome of B. subtilis 168-sfp by natural transformation and double-crossover homologous recombination to yield B. subtilis-168-sfp::bgc6 (see the supplementary materials). B. subtilis was chosen in this case because it belongs to the same phylum as B. wexlerae (Firmicutes), harbors an average guanine-cytosine (GC) content in its genomic DNA that is similar to that of B. wexlerae (GC content is 41.5% in B. wexlerae and 43.5% in B. subtilis-168), and has been previously used as a heterologous host for the expression of type I polyketides (e.g., 6-deoxyerythronolide B), so it should not be limited in primary polyketide substrates such as acetate and malonate (45).

Fig. 6 Experimental characterization of bgc6.

(A) The DNA sequence of bgc6 was cloned from the gut isolate B. wexlerae DSM 19850 into the heterologous host B. subtilis-168-sfp. The black arrow indicates a constitutive promoter that was engineered to control the expression of bgc6 in B. subtilis-168-sfp (see the supplementary materials). (B) HPLC-MS analysis of a chemical extract from the culture of B. subtilis-168-sfp::bgc6 (red) compared with one from B. subtilis-168-sfp with an empty vector control (blue). An HPLC chromatogram at the absorbance of 400 nm is shown for both samples, indicating a single bgc6-specific peak produced by B. subtilis-168-sfp::bgc6 and not the control (5, highlighted in yellow). (C) Molecular structures of wexrubicin (5), the product of bgc6, and the related anticancer drug, doxorubicin. (D) Heatmap indicating the antimicrobial activity of metamycins A to D (1 to 4) and wexrubicin (5) against a panel of representative oral, skin, and gut isolates of the human microbiome. The activity is measured in micromolar as the minimum inhibitory concentration on agar (MIC-A). The antibiotic tetracycline (T) was tested in the same manner, and its activity was compared with that of the newly discovered polyketides. Note that only metamycin C (3) and metamycin D (4) exert antimicrobial activities against several isolates, with metamycin D activity being similar to that of tetracycline in several cases.

We then compared the organic extracts of cultures of B. subtilis-168-sfp::bgc6 and an empty vector control using HPLC-MS, revealing a single bgc6-specific peak (Fig. 6B). Next, we scaled up the B. subtilis-168-sfp::bgc6 culture (31L) and then isolated, purified, and elucidated the structure of a single new molecule (compound 5, named wexrubicin) using a combination of HPLC-HR-MS/MS and NMR (see supplementary text, figs. S38 to S44, and table S3). Wexrubicin has a tetracyclic, anthracycline ring system (C21) connected with a β-glucose moiety at C4, which is consistent with an encoded glycosyl transferase in bgc6 (Fig. 6C).

Biological activity of type II polyketides encoded in the human microbiome

The discovery of 1 to 5 not only shows that the human microbiome encodes previously undescribed type II polyketides, but also that some of these molecules resemble to a great extent clinically used drugs. Wexrubicin (5), for example, falls into the group of anthracycline polyketides, which includes the clinically used anticancer drugs doxorubicin and danuribicin, as well as the antitumor antibiotic molecules tetracenomycin and elloramycin (46, 47). Conversely, metamycins A and B (1, 2), and C and D (3, 4) resemble the previously isolated antibiotics setomimycin from S. pseudovenezuelae and oviedomycin from S. antibioticus, respectively (48, 49). To determine whether microbiome-derived polyketides exert the same biological activity as their closely related derivatives, we tested them in two types of assays: cytotoxicity against mammalian cell lines and antimicrobial activity against selected bacteria and fungi (see the supplementary materials). Compared with doxorubicin as a positive control, none of the tested molecules showed notable cytotoxicity against HeLa cell lines (fig. S45).

In antimicrobial assays, however, metamycins C and D (3, 4) showed strong inhibitory activity against several Gram-positive bacteria at a magnitude similar to that of the clinically used antibiotic tetracycline in some cases. These molecules were most potent against oral isolates of Streptococcus, Atopobium, Actinomyces, Rothia, and Corynebacterium sp. (Fig. 6D). Being the products of an oral BGC (bgc3), these results suggest that metamycins C and D (3, 4) may play a role in niche competition in the oral cavity or in host protection against pathogens. This is consistent with the fact that bgc3 is expressed in human supragingival plaque samples during early biofilm formation, as determined by metatranscriptomic analysis (fig. S21 and data table S17). Because no cytotoxicity or antibacterial activity was detected for metamycins A and B (1, 2) or wexrubicin (5), further studies are needed to provide insights into their biological role.


Although substantial efforts have been spent on documenting and characterizing diverse small molecules encoded by human-associated bacteria, no aromatic polyketides have been previously described from the human microbiome. Here, we used a new computational strategy for the discovery of small-molecule BGCs directly from human microbiome–derived metagenomic sequencing data, revealing the unexpectedly wide distribution of BGCs encoding for this relatively rare class of molecules in the human microbiome (~50% of subjects from the United States carry at least one BGC of this class). We then combined this in silico approach with a synthetic biology strategy to heterologously express the identified BGCs and directly discover their chemical products. The structural diversity observed in the products of two of the 13 BGCs discovered here, and the fact that they resemble in structure or biological activity clinically used drugs, clearly motivate further functional investigations into this class. These investigations will not only serve as an important avenue for elucidating microbiome-host interactions at the molecular level, but they will also serve as an unprecedented resource for drug discovery from within the human body.

Our computational strategy relies on repurposing and tailoring established probabilistic models for the use with short-read metagenomic data, which achieves high-sensitivity and high-specificity detection of target protein families. We focus on not one but four distinct protein families in this study as a proof of principle (TII-PKS cyclases and aromatases), which demonstrates the generality of our method. The same approach could be easily adapted to any protein family of interest, including ones that are specific to other types of BGCs. To illustrate this point, we applied our same strategy to two new protein families that are specific to BGCs that encode two additional types of small molecules (siderophores and lanthipeptides). In both cases, we were successful in separating low- and high-performance spHMMs and in using the four modules of MetaBGC using synthetic dataset 2 (see the supplementary materials, data table S18, and figs. S46 to S49).

The performance and limitations of MetaBGC depend on several factors. First, to be used as a discovery and quantification method for a given BGC, it is important that spHMMs are built for a protein family that is both specific for and universally found in the BGC of interest. Second, a relatively high-performance, full-size pHMM needs to be generated before segmentation, which relies in turn on the availability of sufficient numbers and a proper alignment of sequenced homologs for the protein family of interest. Third, a carefully designed synthetic dataset needs to be generated to reflect as much as possible the expected complexity in the real metagenomic datasets of interest and to evaluate the spHMMs and tailor their score cutoffs. Finally, several flexible parameters need to be optimized on the basis of the specific study goals, including the size of the spHMM segments (depending on the read length of the search data), the F1 and spHMM score cutoffs (depending on how conservative the discovery goals are), the minimum number of nonredundant biosynthetic reads to define a new bin, the minimum bin abundance to define the “presence” or “absence” of a given bin in a metagenomic sample (depending on how sensitive the study is designed to be), and, finally, the minimum Pearson correlation distance used in the clustering step, which dictates the stringency of the binning. We have provided example settings for these parameters in our current study, which can be used as starting points for new applications and further optimized as needed.

In conclusion, we present a general strategy that is useful for quickly profiling metagenomic datasets from large clinical cohorts and prioritizing candidate BGCs for experimental characterization. More broadly, we see this as a systematic strategy for unveiling the chemical repertoire encoded by the human microbiome, a much-needed step for understanding its role in human health and disease.

Materials and methods summary

An expanded materials and methods section can be found in the supplementary materials.

Development of MetaBGC

MetaBGC is composed of four modules. In MetaBGC-Build, spHMMs are developed for every protein family of interest and evaluated using a synthetic metagenomic dataset that harbors reads from two groups of bacterial genomes: genomes that encode diverse members of the protein family of interest (but not included in the spHMMs) and genomes that lack any. In MetaBGC-Identify, high-performance spHMMs are used to search billions of translated metagenomic reads from thousands of metagenomic samples to identify “biosynthetic reads.” In MetaBGC-Quantify, identified biosynthetic reads are de-replicated and quantified in all samples of a given cohort and an abundance profile matrix is built for all reads in all samples. Finally, in MetaBGC-Cluster, biosynthetic reads are clustered into “biosynthetic read bins” on the basis of their abundance profiles across samples. These bins are then used to calculate “biosynthetic read bin abundance” in each sample and as seeds for targeted or untargeted BGC assembly efforts from the sample with the highest abundance of a given bin. For exact details about the development of MetaBGC, see the supplementary materials.

Using MetaBGC to search metagenomic sequencing data of the human microbiome for TII-PKS BGCs

High-performance spHMMs for four types of cyclases and aromatases commonly found in TII-PKS BGCs were developed and used to search quality-filtered and translated metagenomic sequencing reads from five large cohorts (HMP-1-1, HMP-1-2, MetaHIT, Chinese, and Fijicomp). After quantification, clustering, and assembly of biosynthetic read bins, complete TII-PKS BGCs were annotated and two were selected for experimental characterization. For details about the complete set of analyses performed on the five cohorts and the discovered TII-PKS BGCs, see the supplementary materials.

Experimental characterization of two human microbiome–derived TII-PKS BGCs

Out of the 13 identified TII-PKS BGCs, we selected two, bgc3 and bgc6, for experimental characterization on the basis of their profiles in metagenomic and metatranscriptomic datasets. The metagenomic DNA sequence for bgc3 was codon optimized, synthesized, and engineered for heterologous expression in S. albus, and that of bgc6 was amplified from genomic DNA isolated from B. wexlerae DSM 19850 and engineered for heterologous expression in B. subtilis. Heterologous expression lines harboring the BGCs or empty vector controls were cultured, chemically extracted, and analyzed using HPLC-MS. Large-scale cultivation experiments of the expression lines were used to provide enough materials for the isolation and purification of the five type II polyketide molecules reported here. Structural elucidation of the purified molecules was performed using a combination of NMR, x-ray crystallography, and MS. Biological activity of the purified molecules was assessed using cytotoxicity assays against HeLa cells and minimum inhibitory concentration (MIC)-on-agar assays against several pathogenic and commensal bacteria. For additional details, see the supplementary materials.

Supplementary Materials

Materials and Methods

Supplementary Text

Figs. S1 to S49

Tables S1 to S3


Data Tables S1 to S18

References and Notes

Acknowledgments: We thank M. Cahn for assistance with metagenomic data analysis; B. Javdan, S. Rath, and A. Korennykh for assistance with mammalian cell culture; the crystallography core facility at the Department of Molecular Biology, Princeton University; S. Chatterjee for general assistance; and the rest of the Donia laboratory for useful discussions. Funding: Funding for this project was provided by an NIH Director’s New Innovator Award (ID: 1DP2AI124441) to M.S.D., the Pew Biomedical Scholars Program to M.S.D., and a Focused Research Team on Precision Antibiotics Award by the School of Engineering and Applied Science at Princeton University through the generosity of Helen Shipley Hunt *71. Author contributions: M.S.D., Y.S., and F.R.C. designed the study. Y.S., F.R.C., S.W., P.C., A.O., A.B., P.D.J., and M.S.D. performed experiments and analyzed the data. M.S.D., F.C., and Y.S. wrote the manuscript with input from the rest of the authors. Competing interests: M.S.D. is a member of the scientific advisory board of DeepBiome Therapeutics and a consultant for Flagship Pioneering. Data and materials availability: All data are available in the main text or the supplementary materials. The code for MetaBGC is archived at Zenodo (50). Crystallography data were deposited at the Cambridge Crystallographic Data Centre under accession numbers CCDC 1878185, CCDC 1878186, and CCDC 1878187.

Stay Connected to Science

Navigate This Article