Research Articles

A prominent glycyl radical enzyme in human gut microbiomes metabolizes trans-4-hydroxy-l-proline

See allHide authors and affiliations

Science  10 Feb 2017:
Vol. 355, Issue 6325, eaai8386
DOI: 10.1126/science.aai8386

Chemically guided functional profiling

The big challenge posed by the microbiota living in or on humans is working out what they do for us. Microorganisms generate large quantities of peptides and proteins that may have profound systemic effects on the host. Levin et al. took microbial metagenome data and used a combination of bioinformatic tools to generate a network that clusters sequences of enzymes sharing similar biological functions (see the Perspective by Glasner). Experiments verified these homology and structural-chemical inferences. The analysis identified enzymes involved in anaerobic short-chain fatty acid production and L-proline biosynthesis, both of which are key mediators of healthy microbiota-host symbioses.

Science, this issue p. eaai8386; see also p. 577

Structured Abstract


The microbes that live in and on our bodies (the human microbiome) profoundly affect human health and disease. For example, within the lower gastrointestinal tract, microbes employ powerful enzymatic chemistry to access recalcitrant nutrients and generate metabolites that mediate interactions with host cells. Given the vast amounts of available sequencing data from human microbiomes, we know surprisingly little about the precise mechanisms by which these activities influence human biology. This knowledge gap arises in part from our poor understanding of microbial enzymes and metabolic processes. Collectively, the genes present in microbiomes (metagenomes) encode millions of uncharacterized enzymes, and approaches are needed to connect these genes to biochemical functions.


Efforts to identify the microbial activities encoded within metagenomes (functional profiling) have largely focused on assigning protein sequences found in these data sets to overarching processes (e.g., “vitamin biosynthesis”) or large enzyme superfamilies whose members carry out many different chemical reactions. These methods therefore provide limited information about specific enzymes of interest and cannot easily differentiate superfamily members with known and unknown functions. Addressing this problem requires incorporating a mechanistic understanding of how amino acid sequence influences enzymatic activity into metagenomic analyses. We envisioned developing a “chemically guided” functional profiling strategy that would use protein sequence similarity network (SSN) analysis to distinguish functionally distinct members of large enzyme superfamilies and integrate this information into quantitative metagenomics. This method would not only quantify different types of enzymes in metagenomic and metatranscriptomic data sets, but also pinpoint enzymes of unknown function in communities, prioritizing them for further study on the basis of their abundance and distribution. We initially applied this workflow to profile the glycyl radical enzyme (GRE) superfamily, which is one of the most enriched protein families in the human gut microbiome. GREs are O2-sensitive enzymes that catalyze key transformations in anaerobic microbial metabolism, including carbohydrate utilization and DNA synthesis. Although the activities of certain gut microbial GREs have been connected to heart, liver, and kidney diseases, as well as autism, numerous members of this superfamily have not yet been biochemically characterized.


We determined the abundance of individual types of GREs in 378 metagenomes from healthy humans, including two aerobic body sites (vagina and skin), three microaerobic body sites (tongue, inner cheek, and dental plaque), and one anaerobic body site (gut). The human gut microbiome contained the largest number of distinct GREs, many of which have unknown functions. Our analysis provided new information about known GRE-mediated activities, including production of the disease-associated metabolites trimethylamine and p-cresol. In vitro studies of abundant, uncharacterized GREs from the human gut revealed that radical-based dehydration chemistry is widespread in this environment and led to the discovery of trans-4-hydroxy-l-proline (Hyp) dehydratase. This enzyme enables gut commensals and human pathogens like Clostridium difficile to metabolize Hyp, a nonproteinogenic amino acid that is rare in bacteria but is an abundant posttranslational modification in eukaryotes. The universal distribution of this activity in human gut microbiomes suggests that it plays an important role in this habitat, setting the stage for future hypothesis-driven research.


By accurately identifying enzymes present in microbial communities, this workflow allows ecological context to inform enzyme characterization, uncovering widespread but previously unappreciated metabolic activities. We are now poised to apply this strategy to examine various patient populations, additional protein superfamilies, and other microbiomes.

Chemically guided functional profiling enables enzyme discovery in microbiomes.

Combining protein sequence similarity network (SSN) analysis with quantitative metagenomics reveals the abundance of both characterized and uncharacterized members of enzyme superfamilies. An analysis of glycyl radical enzymes in healthy human microbiomes facilitated the discovery of trans-4-hydroxy-l-proline dehydratase, a universally distributed but previously unknown gut microbial enzyme.


The human microbiome encodes vast numbers of uncharacterized enzymes, limiting our functional understanding of this community and its effects on host health and disease. By incorporating information about enzymatic chemistry into quantitative metagenomics, we determined the abundance and distribution of individual members of the glycyl radical enzyme superfamily among the microbiomes of healthy humans. We identified many uncharacterized family members, including a universally distributed enzyme that enables commensal gut microbes and human pathogens to dehydrate trans-4-hydroxy-l-proline, the product of the most abundant human posttranslational modification. This “chemically guided functional profiling” workflow can therefore use ecological context to facilitate the discovery of enzymes in microbial communities.

Communities of microorganisms (microbiomes) occupy nearly every environment on Earth, and these complex assemblages carry out metabolic processes that affect surrounding habitats and organisms (1). For example, the human gut microbiome metabolizes nondigestible dietary components, produces essential vitamins and nutrients, and synthesizes metabolites that are linked to human disease (2, 3). Despite their importance, we have extremely limited knowledge of the specific biochemical reactions performed by microbiomes and the precise mechanisms by which this chemistry shapes microbial ecosystems (4).

This deficit stems from our incomplete understanding of the microbial enzymes that catalyze these chemical transformations. Collectively, the genomes of the organisms that comprise microbiomes (metagenomes) encode vast numbers of enzymes, most of which are uncharacterized. This issue complicates efforts to predict the metabolic activities present within these communities (functional profiling). For instance, 78 to 86% of genes in Human Microbiome Project (HMP) metagenomes cannot be assigned a metabolic function, and ~50% cannot be given any annotation (4, 5). Moreover, genes that can be annotated are typically mapped to large enzyme superfamilies without considering that a single superfamily can catalyze many different chemical reactions and that as many as 80% of enzymes within a superfamily can be uncharacterized or misannotated (6, 7). Thus, functional profiling strategies that can accurately identify enzymes in microbiomes are needed, including both characterized enzymes and enzymes of unknown function that play important but unrecognized roles in these habitats.

The importance of this problem can be appreciated by considering the difficulties associated with studying the activities and roles of glycyl radical enzymes (GREs) in the human gut microbiome. GREs use protein-based radicals to accomplish challenging chemical transformations (Fig. 1A) (8), with the key glycine-centered radical installed posttranslationally by a radical S-adenosylmethionine enzyme (9). These enzymes participate in evolutionarily ancient, anaerobic primary metabolism, including carbohydrate utilization [pyruvate formate-lyase and related α-ketolyases (PFL)] (Fig. 1B) and deoxyribonucleotide synthesis (class III ribonucleotide reductase) (8, 10). Previous metagenomic and metaproteomic studies have indicated that the GREs are one of the most abundant protein superfamilies in the human gut microbiome (1113). Furthermore, activities of characterized GREs from gut microbes are strongly linked to human health. Production of trimethylamine (TMA) from choline (choline trimethylamine-lyase, CutC) (14) is associated with heart (15) and liver diseases (16). Decarboxylation of p-hydroxyphenylacetate gives p-cresol (p-hydroxyphenylacetate decarboxylase, HPAD) (17), which interferes with human drug metabolism and is elevated in children with autism (18, 19). Despite these intriguing connections to human biology, little is known about the abundance and distribution of different types of GREs in human microbiomes. Efforts to accurately identify these enzymes in microbiomes, including attempts to detect CutC in stool metagenomes (20), have been complicated by the high amino acid sequence similarities of GREs and the many superfamily members with unknown functions.

Fig. 1 An overview of the glycyl radical enzyme (GRE) superfamily.

(A) Shared mechanistic features of GREs. SAM, S-adenosylmethionine. (B) Chemical reactions catalyzed by selected characterized GREs.

Here we show that integrating information about enzymatic chemistry into quantitative metagenomics can improve our ability to detect both known and uncharacterized members of enzyme superfamilies in microbiomes. Using a workflow that combines protein sequence similarity network (SSN) analysis with quantitative metagenomics, we first determined the abundance and distribution of individual members of the GRE superfamily in healthy human microbiomes. We identified and quantified biochemically characterized GREs as well as uncharacterized family members that are locally abundant, widespread, or unique in given body sites, prioritizing them for further study on the basis of their ecological context. Employing this strategy, we discovered that the most abundant uncharacterized GRE in HMP stool metagenomes is a trans-4-hydroxy-l-proline dehydratase. This previously unknown enzyme is found in all subjects and thus likely plays a prominent role in the human gut microbiome.

Chemically guided functional profiling incorporates an understanding of enzymatic activity into quantitative metagenomics

Our approach, which we call “chemically guided functional profiling” (Fig. 2), begins by identifying an enzyme superfamily of interest, comparing the amino acid sequences of all family members to one another, and then visualizing the resulting pairwise relationships as an SSN (21). Guided by an understanding of how amino acid residues of characterized family members contribute to their activities, mechanisms, and structures, we can construct an SSN that clusters together sequences of enzymes that likely share the same biochemical function. Notably, this analysis can differentiate family members with distinct activities, regardless of whether or not an enzyme’s function is known.

Fig. 2 Chemically guided functional profiling incorporates chemical information into metagenomic analyses to reveal the abundance and distribution of individual members of enzyme superfamilies in microbial communities.

The SSN is then used to interpret data generated by the quantitative metagenomic analysis tool ShortBRED (Short, Better Representative Extract Dataset) (22). Given the amino acid sequences of the enzyme superfamily as input, ShortBRED identifies sequence markers unique to similar family members and quantifies their relative abundance in raw metagenomic sequencing data with high specificity. Mapping the sequence markers and abundance data produced by ShortBRED back to the clusters of enzymes in the SSN then reveals the abundance of individual superfamily members in a microbial community, including enzymes of both known and unknown function. While SSN analysis has been used to study uncharacterized enzymes found in microbial genome-sequencing projects (21, 23, 24), these efforts have not examined the presence of these enzymes in communities. Likewise, although sequences from assembled metagenomes have been incorporated into SSNs to expand the diversity of an enzyme superfamily (25), to our knowledge these networks have not been applied to large scale, quantitative metagenomic analyses.

Construction of an SSN for the GRE superfamily

We envisioned using this workflow to assess the distribution of the GRE superfamily across healthy human microbiomes. To begin our analysis, we used the web-based Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST) to build an SSN using 6343 sequences from InterPro family IPR004184, which includes enzymes that have the so-called “PFL domain” and encompasses all functionally characterized GREs except for the phylogenetically distinct ribonucleotide reductases (26). Our initial network was constructed such that connected sequences shared an alignment score of at least 10−300. We iteratively refined this network by adding different percent identity (ID) filters, removing edges that did not meet each threshold to generate multiple SSNs (figs. S1 and S2). By searching sequence databases and the literature, we then mapped biochemically characterized GREs onto each of these networks, using characteristic conserved active site amino acids known to be involved in substrate binding and catalysis to confirm our assignments (Fig. 3A). For instance, we annotated sequences that likely encode PFLs by looking for the catalytically essential active site Cys-Cys motif that is found in all GREs with this activity (>20 different proteins) (27).

Fig. 3 Construction of a sequence similarity network for the GRE superfamily.

(A) Multiple sequence alignment of selected GREs. The regions shown contain residues that occupy the active sites of structurally characterized GREs and homology models of uncharacterized GREs. The residues at the positions marked with asterisks are conserved in different characterized GREs and are known to play roles in substrate binding or catalysis, making them useful for both identifying known GREs and revealing uncharacterized GREs with potentially distinct activities. Numbering corresponds to PD from R. inulinivorans (uncharacterized GRE cluster 16); accession numbers are from UniProt. (B) An SSN of the GRE superfamily (InterPro version 53.0; IPR004184, PFL domain) was constructed with an initial score of 10−300. The edge score was then refined such that nodes are connected by an edge if the pairwise sequence identity is ≥62% ID. Each of the 1843 nodes within the resulting SSN contains sequences with >95% amino acid identity. Single-letter abbreviations for the amino acid residues are as follows: A, Ala; C, Cys; D, Asp; E, Glu; F, Phe; G, Gly; H, His; I, Ile; K, Lys; L, Leu; M, Met; N, Asn; P, Pro; Q, Gln; R, Arg; S, Ser; T, Thr; V, Val; W, Trp; and Y, Tyr.

Ultimately, we chose a minimum edge threshold for the SSN (62% ID) that separates GREs with biochemically verified activities into different clusters (Fig. 3B). Notably, this edge threshold also differentiates uncharacterized GREs that may possess disparate biochemical activities based on differences in predicted active site residues and genomic contexts. For example, at lower edge thresholds (e.g., 55% ID), glycerol dehydratase (GD) from Clostridium butyricum clusters with two uncharacterized GREs that are predicted to share only a subset of active site residues with GD (fig. S3). Whereas GD is encoded next to a 1,3-propanediol dehydrogenase (28), these other GREs are colocalized with additional genes predicted to encode microcompartment proteins, an aldehyde dehydrogenase, and a phosphate propanoyltransferase (fig. S3). These distinct genomic contexts suggest that the activities of the uncharacterized GREs may differ from that of GD. At a minimum edge threshold of 62% ID, our SSN resolves these three enzymes into distinct clusters. Though we cannot know for certain that all of the GRE clusters in our SSN are isofunctional, the separation of these highly similar GREs indicates a strong likelihood that each cluster contains enzymes with the same biochemical activity. The presence of uncharacterized GREs reflects a larger trend within the SSN: 195 of the 241 clusters in the final SSN have no assignable biochemical function, suggesting that this enzyme superfamily contains substantial unexplored diversity.

Integrating the SSN with ShortBRED reveals the distribution and abundance of GREs in human microbiomes

With an SSN in hand, we used ShortBRED to profile the abundance of the entire GRE superfamily in 378 high-quality, first-visit metagenomes from healthy participants sequenced during the HMP (5), focusing on six body sites: stool (reflective of the lower gastrointestinal tract), buccal mucosa (oral), supragingival plaque (oral), tongue dorsum (oral), anterior nares (skin), and posterior fornix (vaginal). These body sites range from aerobic (skin and vaginal) to microaerobic (oral) to anaerobic (gut) environments. ShortBRED-Identify first found unique protein sequence markers for highly similar GREs (85% amino acid identity) (table S2). ShortBRED-Quantify then measured the abundance of each marker in the unassembled metagenomic reads. By tabulating the sequence markers belonging to each cluster of sequences in our SSN, we determined the abundance of each group of GREs within each metagenome. Finally, we normalized these abundance values using previously calculated average microbial genome sizes for each metagenomic sample (29).

This chemically guided functional profiling workflow revealed the abundance and distribution of individual GRE clusters in microbiomes from healthy human subjects (Fig. 4A and tables S3 to S5). We detected sequences belonging to 75 of the 241 GRE clusters from our SSN, implying that the human host supports a wide range of GRE-mediated chemistry. We found GREs in all oral and stool metagenomes and a subset of samples from the other body sites. PFL is the most abundant family member in all GRE-containing samples, consistent with its role in anaerobic glucose metabolism (Fig. 4B). The presence of PFL in many facultative anaerobes and the existence of mechanisms for repairing oxygen-damaged PFL may explain its occurrence in both anaerobic and aerobic environments (30, 31). We observed a unique set of GREs in stool samples compared to the other body sites and identified significantly more GREs per microbial genome in this body site [P < 10−58, Kruskal-Wallis (KW); all Ps < 10−8, Dunn’s multiple comparisons (DMC) test]. Additionally, a larger number of distinct GRE clusters were located in the gut (75 versus 5 to 15 for other body sites) (fig. S4), indicating that this environment harbors a wider range of anaerobic metabolic processes.

Fig. 4 Chemically guided functional profiling of GREs in the human microbiome.

(A) Heatmap showing the abundance and distribution of the 50 most abundant GRE clusters in 378 HMP metagenomes from six body sites as quantified using ShortBRED. Biochemically characterized GRE clusters are shown in bold type, and GRE clusters characterized in this study are shown in red. Boxplots showing per-site abundance of (B) PFL, (C) HPAD, and (D) CutC across six body sites.

These results provide new insights about the ecological contexts of biochemically characterized GREs, including HPAD and CutC. Whereas HPAD is found almost exclusively in stool samples (Fig. 4C), CutC is present with similar frequency in stool, supragingival plaque, buccal mucosa, and tongue dorsum samples (Fig. 4D and table S5). Identifying this disease-linked enzyme in the oral microbiome is intriguing as periodontal disease and invasion of the gastrointestinal tract by oral bacteria are associated with heart and liver diseases (32, 33). This finding, which could not have been predicted by the distribution of CutC in sequenced genomes (20), implies that the oral microbiome may be a reservoir for TMA-producing bacteria. Unlike PFL, HPAD and CutC are detected in only a subset of stool metagenomes, which is consistent with the observed variability in the amounts of downstream metabolites p-cresol sulfate and trimethylamine-N-oxide in humans (14, 17) and could potentially contribute to interindividual differences in drug metabolism and disease susceptibility.

We also obtained information about the abundance of uncharacterized GREs in human microbiomes, and our data suggest that many unappreciated GRE-mediated activities exist in the human gut. GREs of unknown function represent 9 out of the 10 most abundant GRE clusters in stool metagenomes and, excluding PFL, outnumber characterized family members 63-fold. The 9th and 10th most abundant unknown GRE clusters were widely distributed in stool metagenomes (>50% of samples) but are both represented by a single sequence in the SSN. This observation serves as a reminder that proteins poorly represented in sequence databases may be widespread in biological habitats. This analysis also helped us to prioritize specific GREs for further study. We focused on the two most broadly distributed and abundant uncharacterized GREs in the human gut microbiome: cluster 16, which is found in 96% of stool samples, is the third-most abundant GRE in stool metagenomes and is enriched in this habitat relative to other body locations (P < 10−72, KW; all Ps < 10−15, DMC; Fig. 5A and table S5); and cluster 15, which is present in every stool sample, is the second-most abundant GRE in stool metagenomes and is also enriched in the gut (P < 10−60, KW; all Ps < 10−11, DMC; Fig. 6A and table S5).

Fig. 5 Identification and characterization of propanediol dehydratase reveals amino acids involved in dehydration.

(A) Per-site abundance of GRE cluster 16 across six body sites. (B) Hypothesized role of GRE cluster 16 in l-fucose metabolism. (C) Kinetic analysis of PD. Error bars represent the mean ± SD of three replicates. (D) Comparison of PD homology model (green) with GD crystal structure (yellow) identifies a characteristic set of active site residues required for dehydration. (E) GC-MS analysis of assays with wild-type PD or PD active site mutants and (S)-1,2-propanediol (time = 20 min). (F) Abundance of PD and B12-dependent propanediol dehydratase (PduC) in HMP stool metagenomes.

Fig. 6 An abundant, uncharacterized GRE in the human gut is a trans-4-hydroxy-l-proline dehydratase (t4lHypD).

(A) Per-site abundance of GRE cluster 15 across six body sites. (B) Conserved genomic context of GRE cluster 15 in Clostridiales. (C) Hypothesized pathway for anaerobic Hyp metabolism involving uncharacterized GRE cluster 15. (D) EPR spectrum of the glycine-centered radical of activated t4lHypD. An average of 0.51 ± 0.01 (mean ± SD) glycyl radical per t4lHypD monomer was observed with hyperfine coupling A = 1.44 mT. (E) LC-MS/MS detection of l-proline produced in vitro from Hyp by t4lHypD and P5C reductase (time = 1 hour). Error bars represent the mean ± SD of three replicates. AE, t4lHypD-activating enzyme.

The high abundance and wide distribution of these two GREs in metagenomes suggested that they might play prominent functional roles in the healthy human gut. To investigate whether these genes were expressed in gut microbiomes, we applied our chemically guided functional profiling workflow to analyze paired stool metagenomes and stool metatranscriptomes from eight healthy human subjects (34). Clusters 15 and 16 were present and transcribed in all samples (fig. S5), indicating that these GREs are likely produced and active in the human gut. Collectively, these observations imply that these two enzymes perform core functions within the healthy human gut and are distinctive of this habitat. We therefore set out to characterize the biochemical functions of these GREs.

Characterization of cluster 16 reveals a ubiquitous dehydratase motif within the GRE superfamily

We readily connected cluster 16 to anaerobic l-fucose utilization, a microbial metabolic activity that plays an important role in maintaining gut microbial-host symbiosis. Human gut bacteria consume l-fucose derived from host glycans, producing beneficial short-chain fatty acids like propionate as end products (35, 36). A key transformation required for bacteria to convert l-fucose to propionate is the dehydration of (S)-1,2-propanediol to propionaldehyde by B12-dependent propanediol dehydratase (37). The l-fucose–metabolizing human gut bacterium Roseburia inulinivorans lacks this enzyme and instead encodes a member of GRE cluster 16. This GRE was hypothesized to be a B12-independent propanediol dehydratase (PD) based on its colocalization with other fucose utilization genes in the R. inulinivorans genome and upregulation during growth on l-fucose (Fig. 5B) (38). However, when we began our study, the role of this GRE had not been biochemically validated.

We verified this proposal by characterizing R. inulinivorans PD and its activating enzyme (PD-AE) in vitro (fig. S6). Electron paramagnetic resonance (EPR) spectroscopy showed that PD-AE could generate a glycine-centered radical on PD (fig. S7). Gas chromatography–mass spectrometry (GC-MS) assays confirmed that activated PD converted (S)-1,2-propanediol to propionaldehyde (fig. S8). Kinetic analyses showed a 26-fold difference in specificity for (S)- versus (R)-1,2-propanediol [catalytic rate constant (kcat) = 1500 ± 100 s–1, Michaelis constant (Km) = 7.8 ± 0.6 mM, kcat/Km = 1.9 ± 0.2 × 105 M–1 s–1 versus kcat = 330 ± 40 s–1, Km = 44 ± 4 mM, kcat/Km = 7.5 ± 0.8 × 103 M–1 s–1], a stereochemical preference in accordance with PD’s proposed role in l-fucose metabolism (Fig. 5C). These findings agree qualitatively with a recently reported study of R. inulinivorans PD (39).

Identifying active site residues from PD that facilitate dehydration helped us to predict functions of additional uncharacterized GREs. We constructed a homology model of PD, docked both (S)- and (R)-1,2-propanediol into its active site, and compared these models to a crystal structure of the related GRE GD (Fig. 5D and fig. S9) (40). Key active site amino acids from PD that are conserved in GD include G817 and C438, the sites of the radical intermediates thought to initiate the reaction via hydrogen atom abstraction from C1 of the substrate; E440, a general base that may deprotonate the C1-hydroxyl group; and H166, which for GD is predicted computationally to protonate the departing C2-hydroxyl group (41). Our model and docking agree well with a recently reported crystal structure of PD bound to (S)-1,2-propanediol (root mean square deviation of 0.56 Å) (fig. S9) (39). Site-directed mutagenesis experiments confirmed that these four residues are critical for activity (Fig. 5E). We therefore reason that this combination of amino acids, which is not found in GREs that perform other transformations, constitutes a “dehydratase motif” that is predictive of enzyme function (fig. S10). We uncovered this motif in 100 out of 195 uncharacterized clusters in the GRE SSN, indicating that dehydration is likely a widespread activity in this enzyme family (fig. S11).

The discovery that PD was present at high abundance in 96% of the HMP stool metagenomes led us to investigate whether this enzyme or its B12-dependent counterpart propanediol dehydratase (PduC) was more abundant in the human gut microbiome. PduC was discovered in the 1960s, and certain gut pathogens, including Salmonella spp., use this enzyme to catabolize 1,2-propanediol to propionate (37, 38). Though these two enzymes catalyze the same dehydration reaction, they differ in their sensitivity to oxygen, making it unclear whether one type of enzyme would predominate in the largely anaerobic environment of the healthy human gut. We used ShortBRED to determine the abundance of PduC in the 80 HMP stool metagenomes analyzed above. Although we find that both dehydratases are widely distributed in human gut microbiomes (PD and PduC are present in 96 and 87% of stool samples, respectively), PD is significantly more abundant than PduC (P < 10−4, Mann-Whitney U test) (Fig. 5F). Furthermore, by examining the abundance of PD and PduC within each gut metagenome, we established that the median ratio of PD to PduC across all subjects was 5.2 to 1 (fig. S12). This observation suggests that PD may make a greater contribution to propionate production from l-fucose in the healthy human gut. However, the presence of both enzymes indicates that this gut microbial metabolic process may also proceed under conditions of increased oxygen, such as during inflammation (42). Overall, this analysis demonstrates how chemically guided functional profiling can provide insights into the ecology of enzymes that are well characterized biochemically.

A prominent gut microbial GRE dehydrates trans-4-hydroxy-l-proline

Our analysis of dehydratases in the SSN revealed the characteristic dehydratase motif in sequences from cluster 15, the most abundant uncharacterized GRE in the human gut (fig. S11). However, inspection of multiple sequence alignments and a homology model of this enzyme uncovered additional predicted active site residues that differ from those of GD and PD, suggesting that it might dehydrate a different substrate (Fig. 3 and fig. S13). Using sequences from cluster 15 as search queries, we located this GRE in more than 850 sequenced bacterial and archaeal genomes deposited in the National Center for Biotechnology Information (NCBI) genome database, including prominent gut and oral commensals (Parabacteroides spp. and Clostridiales) and human pathogens such as Clostridium difficile (>97% of sequenced isolates, NCBI database) (fig. S14).

The genomic context of this putative dehydratase sheds light on its biochemical function. In the genomes of Clostridiales, the gene encoding this GRE is often clustered with genes encoding a GRE-activating enzyme and a predicted Δ1-pyrroline-5-carboxylate (P5C) reductase (Fig. 6B). P5C reductase reduces P5C to l-proline as the final step in l-proline biosynthesis (43). Hypothesizing that these enzymes might participate in the same pathway, we considered the nonproteinogenic amino acid trans-4-hydroxy-l-proline (Hyp) as a potential substrate for the GRE (Fig. 6C). Dehydration of Hyp could generate P5C, which would be converted to l-proline by the P5C reductase. Many Clostridiales can use l-proline as an electron acceptor in amino acid fermentations (44). Notably, certain l-proline–fermenting strains, including C. difficile, also use Hyp as an electron acceptor, but the enzymes that mediate this process have not been identified (45). Our proposed pathway would account for this metabolic activity and is consistent with the observation that expression of d-proline reductase, a key enzyme required for l-proline metabolism, is up-regulated when C. difficile grows in the presence of Hyp (45).

In vitro characterization of the putative Hyp dehydratase (t4lHypD), its partner activating enzyme (t4lHypD-AE), and the colocalized P5C reductase from C. difficile 70-100-2010 confirmed this hypothesis (fig. S15). We first used a spectrophotometric assay to verify that P5C reductase could interconvert P5C and l-proline (fig. S16). Electron paramagnetic resonance (EPR) experiments then showed that t4lHypD-AE could install a glycine-centered radical on t4LHypD (51 ± 1% activation, Fig. 6D), establishing that these enzymes are an activating enzyme-GRE pair. Finally, incubation of activated t4lHypD, P5C reductase, reduced nicotinamide adenine dinucleotide (NADH), and Hyp resulted in the full conversion of this amino acid to proline as detected by liquid chromatography–tandem mass spectrometry (LC-MS/MS) (Fig. 6E and fig. S17). Although each component of the full assay mixture was essential for production of proline, consumption of Hyp was still observed in assays lacking either P5C reductase or NADH (fig. S17). This pattern of activity indicates that t4lHypD catalyzes the dehydration of Hyp to produce P5C and that this reaction does not require the presence of the downstream P5C reductase. t4lHypD displayed undetectable or greatly reduced activity toward other hydroxyproline stereoisomers based on the quantification of proline by LC-MS/MS in samples from end-point assays (fig. S18). The kinetic parameters of t4lHypD further support the physiological relevance of this reaction (kcat = 45 ± 1 s–1, Km = 1.2 ± 0.1 mM, kcat/Km = 3.8 ± 0.3 × 104 M–1 s–1) (fig. S19) (46). Likewise, experiments with sequenced Clostridiales isolates showed improved growth in Hyp-containing media and the accompanied consumption of Hyp only in strains encoding t4lHypD (fig. S20).

Taken together, these experiments show that this abundant, universally distributed human gut microbial GRE is a Hyp dehydratase and define a pathway for anaerobic 4-hydroxyproline metabolism. The reaction performed by t4lHypD differs substantially from those of all other characterized hydroxyproline dehydratases, which accept 3-hydroxyproline. The hydroxyl group of 3-hydroxyproline is adjacent to the α-carbon of this amino acid, which has a relatively acidic proton (pKa ~ 29). In contrast, the hydroxyl substituent of 4-hydroxyproline cannot be readily eliminated using acid-base catalysis, as it is positioned between two carbon atoms that bear nonacidic protons (pKa ~ >40). The use of a radical enzyme provides an elegant solution to this chemical challenge.

The discovery of t4lHypD also reveals a previously unappreciated host-gut microbe metabolic interaction (Fig. 7). Many host and dietary proteins contain Hyp, including collagen, the most abundant host protein, and hydroxyproline-rich glycoproteins, the major proteinaceous component of higher plant and algal cell walls (47). In eukaryotes, Hyp is generated posttranslationally by prolyl 4-hydroxylases, members of the nonheme iron–dependent dioxygenase family (47). Although C4-hydroxylation of l-proline is the most common posttranslational modification in the human proteome, it is rare in bacteria. Unlike most posttranslational modifications, C4-hydroxylation of l-proline is considered to be irreversible by human metabolism. Instead, Hyp is oxidized to yield pyruvate and glyoxylate without forming l-proline (48). Remarkably, the actions of t4lHypD and P5C reductase allow bacteria to chemically “reverse” proline hydroxylation. t4lHypD’s activity is also notable from an evolutionary perspective because Hyp formation requires molecular oxygen, a substrate that inactivates GREs and was not present during the evolution of ancestral GRE family members. t4lHypD therefore likely emerged after the oxygenation of Earth’s atmosphere in response to the evolution of this posttranslational modification in eukaryotic organisms.

Fig. 7 The intersection between gut microbial Hyp metabolism and host metabolism.

The universal distribution and high abundance of t4lHypD in stool metagenomes suggest that it plays a critical role in the healthy human gut. In addition to supporting microbial energy production, the conversion of Hyp to P5C and l-proline could supply the microbiome with sources of carbon and nitrogen. These products may also be further processed to provide amino acid building blocks for protein synthesis. Hyp metabolism might affect l-proline availability for the host, which is intriguing given this amino acid’s role in host cell stress responses and apoptosis (49). Gut microbes may liberate Hyp from collagen or collagen-derived peptides of host or dietary origin, affecting collagen homeostasis and Hyp availability (48). Finally, the distribution of t4lHypD in both gut commensals and human pathogens implies that Hyp utilization could contribute to colonization resistance or pathogenicity. Further experiments are needed to explore the many potential biological implications of this activity.


In summary, we have incorporated knowledge of enzymatic chemistry into quantitative metagenomics, designing and implementing a chemically guided functional profiling strategy. Our analysis of the GRE superfamily in human microbiomes provided both new insights about GREs of known activity, including enzymes linked to human disease, and the ability to identify enzymes of unknown activity in these communities, revealing intriguing targets for further study. A combination of bioinformatic analyses and in vitro biochemical experiments proved critical for linking these highly abundant, uncharacterized sequences to corresponding microbial metabolic processes. In particular, the many questions raised by the activity and distribution of t4lHypD illustrate how enzyme discovery efforts can inspire hypothesis-driven microbiome research.

Chemically guided functional profiling changes how we discover microbial enzymes by both facilitating their identification in complex multi’omics sequence data sets and prioritizing them for characterization on the basis of their abundance, distribution, and expression in communities. The use of ecological context to guide characterization of unknown enzymes represents a striking departure from methods that have focused on targets present in sequenced organisms without considering their distributions in microbial habitats. This general strategy may be applied broadly to investigate the chemistry present in microbial communities. Our workflow can be used to profile metagenomes and metatrascriptomes obtained from any environment. Moreover, it can be readily extended to identify other types of enzymes, including the numerous enzyme superfamilies that have already been subjected to SSN analysis (26), provided that some superfamily members have been biochemically characterized. Further chemically guided functional profiling could uncover novel metabolic interactions both within microbiomes and between microbes and hosts. For example, we are now poised to detect GREs present in patient populations, searching for known functions such as p-cresol and TMA production, as well as new metabolic activities that may influence disease progression. By expanding our knowledge of microbial enzymes and metabolism, this approach will advance progress toward a deeper mechanistic understanding of microbiomes.

Materials and methods

Expanded Materials and Methods can be found in the supplementary materials.

Construction of GRE SSNs

SSNs were generated via the EFI-EST webtool ( (26) using IPR004184 (the pyruvate formate-lyase domain, version 53.0 of UniProt, accessed on 9 October 2015) as the input for option B with a minimum sequence length of 500 amino acids and no maximum length specified. Networks were subsequently generated with initial edge values of 10−50 or 10−300. The resulting representative node networks were visualized with Cytoscape 3.2 (50). Edge scores were further refined in Cytoscape, and additional details related to the process of refining the edge threshold can be found in the supplementary materials.

Quantification of enzyme abundances in metagenomes

ShortBRED was used to quantify the abundance of the GREs in metagenomes (22). All ShortBRED computations were performed on the Odyssey cluster supported by the Faculty of Arts and Science (FAS) Division of Science Research Computing Group at Harvard University. First, ShortBRED-Identify was used to find markers for all of the sequences from the GRE SSN. UniRef90 was used as the reference list (51), and the markers generated were specific to sequences in the SSN and were absent from UniRef90. ShortBRED-Identify was run with the default parameters, with the exception of the “–threads” flag, which was increased to run effectively on the Odyssey cluster. With markers generated, ShortBRED-Quantify was then used to determine the abundance of the GREs in metagenomes generated as part of the HMP (5). We analyzed 378 high-quality, first-visit metagenomes from healthy human participants. The output from ShortBRED-Quantify was normalized to counts per microbial genome using previously computed average genome sizes for each sample (29). In addition to the HMP metagenomes, this analysis was repeated in the same manner with matched metagenomes and metatranscriptomes from eight individuals, except that the output was not normalized to counts per microbial genome (34). ShortBRED was also used to quantify the abundances of the B12-dependent diol dehydratases (IPR003206) in the HMP stool metagenomes in the same manner as it was used to quantify the abundances of the GREs, except that SSN analysis was not performed. This InterPro family contains the B12-dependent propanediol and glycerol dehydratases. Because both enzymes are known to dehydrate (S)-1,2-propanediol, we did not attempt to distinguish between them. Therefore our values represent upper limits for PduC abundance.

Code availability

The relevant scripts and instructions for performing “chemically guided functional profiling” with different SSNs or meta’omics data sets can be found at

Plasmid construction

The plasmids used in this study allowed for isopropyl-β-d-thiogalactopyranoside (IPTG)–inducible protein overexpression in Escherichia coli heterologous expression hosts. All plasmids were constructed with standard molecular biology techniques, including polymerase chain reaction, restriction enzyme digestion, ligation, Gibson assembly, and site-directed mutagenesis. Primers were purchased from Integrated DNA Technologies and are listed in table S1. All plasmid constructs were confirmed by DNA sequencing (Beckman Coulter Genomics). Genes encoding PD (UniProt ID: Q1A666) and PD-AE (UniProt ID: Q1A665) were amplified from R. inulinivorans DSM 16841 (DSMZ), and genes encoding t4lHypD (UniProt ID: A0A031WDE4), t4lHypD-AE (UniProt ID: A0A069AMK2), and P5C reductase (UniParc ID: UPI000235AE56) were amplified from C. difficile 70-100-2010 (BEI Resources).

Protein overexpression and purification

All recombinant proteins used in this study were individually overexpressed in E. coli strains [BL21 (DE3) or BL21-CodonPlus(DE3)-RIL ΔproC::aac (3)IV], followed by purification by affinity chromatography for quantification of glycyl radical species by EPR, in vitro activity assays, and kinetics experiments. PD-AE was overexpressed in E. coli BL21 (DE3) cotransformed with pPH149 encoding E. coli IscSUA-HscBA-Fd genes (52). All purified proteins were rendered anoxic prior to assays by either sparging or through repeated vacuum-refill cycles with argon as the inert gas.

Glycyl radical generation and quantification by EPR spectroscopy

PD and t4lHypD were activated by their partner activating enzymes in the presence of S-adenosylmethionine and either 5-deazariboflavin or acriflavine, respectively. Glycyl radicals in activated samples were detected by EPR spectroscopy at 77 K and quantified using K2(SO3)2NO standards. Simulated spectra for glycyl radicals were obtained from experimental data using EasySpin (53), a MATLAB toolbox (MathWorks).

End-point enzymatic activity assays

PD and t4lHypD were first activated by their partner activating enzymes under the same conditions used for EPR studies. Activated GREs were incubated with their respective substrates under anaerobic conditions and at room temperature until quenching for product detection. Headspace GC-MS was used for the detection of propionaldehyde in PD activity assays. LC-MS/MS was used for the detection of proline in t4lHypD activity assays.

Coupled spectrophotometric assays for kinetics

The activity of PD was coupled to horse liver alcohol dehydrogenase (Sigma), and the activity of t4lHypD was coupled to P5C reductase for the reduction of respective products. Absorbance of NADH at 340 nm was recorded over time to calculate initial rates and kinetic parameters.

Growth experiments and metabolite analyses

Terrisporobacter glycolicus DSM 1288 (DSMZ), Clostridium sporogenes ATCC 15579 (ATCC), Clostridium difficile 70-100-2010 (BEI Resources), and Clostridium sticklandii DSM 519 (DSMZ) were grown at 37°C under an atmosphere of 5% H2–95% N2. All media used for growth experiments in this study are modified from a previously reported phosphate- and carbonate-based medium with a minimal composition of amino acids (54). OD600 measurements of 5-ml cultures grown in Hungate tubes were taken until stationary phase. Hydroxyproline and proline content in spent media were quantified using LC-MS/MS.

Supplementary Materials

Materials and Methods

Figs. S1 to S20

Tables S1 to S5

References (5588)

References and Notes

Acknowledgments: Financial support was provided by Harvard University, the Packard Fellowship for Science and Engineering (E.P.B.), the George W. Merck Fund (E.P.B.), the National Institutes of Health (U54DE023798) (C.H.), and the National Science Foundation (DBI-1053486 and EAGER 1453942) (C.H.). B.J.L. acknowledges support from the NSF Graduate Research Fellowship Program (DGE1144152), and Y.W. acknowledges support from the Agency for Science, Technology and Research (A*STAR) Singapore. The computations in this paper were run on the Odyssey cluster supported by the FAS Division of Science, Research Computing Group at Harvard University. We thank J. Stubbe for helpful advice and feedback on this manuscript. We also thank the Klinman lab (Univ. of California Berkeley) for providing the plasmid pPH149, J. Nicoludis (Gaudet lab, Harvard University) for providing P1 phage lysate, J. Wang for assistance with GC-MS experiments, K. Chatman for assistance with LC-MS/MS experiments, and H. Nakamura for help with the synthesis of 5-deazariboflavin. A protocol that allows users to perform chemically guided functional profiling with their own data sets is available at
View Abstract

Stay Connected to Science

Navigate This Article