Three-Dimensional Structural View of the Central Metabolic Network of Thermotoga maritima

See allHide authors and affiliations

Science  18 Sep 2009:
Vol. 325, Issue 5947, pp. 1544-1549
DOI: 10.1126/science.1174671


Metabolic pathways have traditionally been described in terms of biochemical reactions and metabolites. With the use of structural genomics and systems biology, we generated a three-dimensional reconstruction of the central metabolic network of the bacterium Thermotoga maritima. The network encompassed 478 proteins, of which 120 were determined by experiment and 358 were modeled. Structural analysis revealed that proteins forming the network are dominated by a small number (only 182) of basic shapes (folds) performing diverse but mostly related functions. Most of these folds are already present in the essential core (~30%) of the network, and its expansion by nonessential proteins is achieved with relatively few additional folds. Thus, integration of structural data with networks analysis generates insight into the function, mechanism, and evolution of biological networks.

The advent of genome sequencing has enabled development of computational and experimental tools to investigate complete biological systems, but it has also highlighted the difficulty in integrating complex information for the hundreds to thousands of different molecules that compose even the smallest biological networks. Such integration presents many challenges, especially when assembling data from diverse fields, such as biochemistry and structural biology, that use different operational languages and conceptual frameworks. Biochemistry has traditionally focused on individual reactions and pathways, but recent advances in genomics have led to more rapid growth in the reconstruction and modeling of metabolic networks on a genome-wide scale (13). Thus, biochemical reactions, pathways, and networks can now be described in the context of entire cells, thereby enabling more realistic simulations of the behavior of metabolic networks in a growing number of organisms (47). Nevertheless, metabolism is still generally defined in terms of the chemical names and identity of substrates, products, and reactions. It does not explicitly consider the three-dimensional structures of its components, although such knowledge is required for a comprehensive understanding, not only of the individual reactions, but more importantly, of metabolic networks as a whole. Without such knowledge, we cannot rigorously define enzyme mechanisms or predict the effects of mutations or drugs; on the global level, we cannot understand the evolutionary relationships between different pathways, how new metabolic capabilities are acquired, and how individual organisms adapt to their particular ecological niches and respond to environmental pressures.

Such an understanding can be provided by structural biology, which has traditionally focused on individual proteins outside of their full, system-level, biological context. The emergence of large-scale structure genomics projects, such as the Protein Structure Initiative (8), has provided an exciting new opportunity for structural biology to contribute on a scale similar to that of genomics.

Thermotoga maritima, one of the first discovered hyper-thermophilic bacteria (9), represents the deepest known lineage of eubacteria (9, 10), has one of the smallest genomes for a free-living organism (11), and has been the subject of extensive experimental analysis (12, 13), making it an ideal model organism for systems biology and for integration of biochemical and structural approaches (14).

We constructed a metabolic model of T. maritima by a bottom-up approach, which first identified all known biochemical reactions and substrates from almost 150 publications (table S3), providing direct biochemical, genomic, and physiological evidence for more than 50% of the metabolic reactions. We then identified the remaining reactions from high confidence, homology-based annotation databases (15, 16) and from experimental or modeled protein structures (see below). We used flux balance analysis (17) to test the completeness of the network, revealing gaps, such as missing enzymes or redundant functional assignments, which were then resolved by manual curation for individual cases. We continued iterative evaluation of the network until its performance reproduced, in silico, the experimentally determined metabolic capabilities of T. maritima (tables S9 and S10) (18).

Our resulting metabolic reconstruction included 478 metabolic genes, 503 unique metabolites, and 562 intracellular and 83 extracellular metabolic reactions (18), and it reproduced T. maritimas ability to grow on diverse carbohydrates (table S9) and to produce known metabolic by-products; e.g., acetate and hydrogen. The overall scope, content, and quality of this metabolic reconstruction were comparable with state-of-the-art reconstructions for other model organisms (table S6). Although the current model does not yet provide an exhaustive description of T. maritima metabolism, it represents a major step in an iterative process of annotation and modeling of this organism.

The T. maritima metabolic reconstruction (mr) defines a specific set of proteins (mrTM) that carry out the biochemical functions that make up a self-sustaining, metabolic network. Of 478 proteins in this mrTM set, structures of 120 proteins have been determined experimentally (12), and 358 were predicted and modeled with a variety of computational approaches (18). The quality of the modeled structures spans the spectrum from those comparable to low-resolution, experimental structures (190 were built on templates with more than 30% identity to the targets) to very approximate (52 were based only on fold predictions). For three (TM1444, TM0788, and TM0540), the automated structure prediction approach failed, and approximate structures were modeled by combining several different fold prediction algorithms with manual refinement (18). Quality control, as based on public benchmarks in modeling and fold recognition, suggests high confidence that all models are correct at the fold-assignment level (18). Thus, these combined approaches allowed us to achieve complete structural coverage for the mrTM set (Fig. 1).

Fig. 1

Combining metabolic reconstruction and structural genomics approaches for an integrated annotation of the T. maritima central metabolic network. Underlying genomics information (Bottom) enabled both a metabolic reconstruction (Left) and an atomic-level structure determination/modeling of T. maritima proteins (Right). Integration of these two approaches enabled detailed information to be acquired for every reaction in the network (Top); an example from the T. maritima serine degradation pathway is illustrated (32).

The information from structural determination of T. maritima proteins and their homologs provided additional support for functional assignment of 181 individual genes. A total of 41 experimental structures of T. maritima proteins contained relevant metabolites, and 140 crystal structures (used as templates for homology modeling) were also determined as complexes with ligands, all of which support the functional assignment in the reconstruction. In at least two cases, TM0449 (1922) and TM1643 (23), structural analysis was critical for identification of enzymatic function and, in many other cases, substantially contributed to assignment of function.

Metabolic reconstruction not only can be described in a matrix format that can directly be used for metabolic simulations to predict essential genes or growth rates, it can also be represented as a graph. Because the reconstruction represents a fully functional, cell-level model of a metabolic network, analysis of the topology of this graph allows us to answer many interesting questions, especially when combined with knowledge of structures or models for all proteins in the network. For instance, what is the dominant mechanism for expansion of a metabolic network in a single organism? In the “patchwork” hypothesis (24), network expansion is driven by recruitment of proteins that perform similar reactions but are present in distinct pathways. Conversely, in the “retrograde” hypothesis (25), proteins evolve, after duplication, to perform dissimilar reactions within the same pathway or neighboring part of the network. Analysis of fold conservation as a function of network topology, therefore, addresses this key issue. Similar analyses have been performed previously on a small set of known pathways (26, 27), but our integrated approach allowed us to analyze the complete set of pathways that form the fully functional, self-sustained metabolic network of a single organism.

We then established an automated protocol to classify metabolic reactions into three categories: (i) similar, (ii) connected, and (iii) unrelated (Fig. 2 and fig. S6). Enzymes that catalyze similar types of reactions have a sixfold higher probability of having the same fold than enzymes catalyzing connected reactions (Fig. 2C), supporting the patchwork hypothesis (24). However, it should be noted that proteins catalyzing connected reactions still have a higher chance of having the same fold as those catalyzing unrelated reactions, suggesting a role for gene duplication within pathways during pathway evolution (i.e., the retrograde model). More importantly, the patchwork hypothesis can account for only 11% of the observed structural similarity between mrTM proteins of similar function, indicating that convergent evolution of similar reaction mechanisms [i.e., nonhomologous gene displacement (28), where two nonhomologous proteins perform the same or similar metabolic function] is not a rare event and substantially contributes to evolution of the central metabolic network.

Fig. 2

Classification of metabolic reactions. (A) Examples of similar (S), connected (C), and unrelated (U) reactions from the arginine and lysine biosynthesis pathways. ArgB and LysC share a co-substrate [adenosine triphosphate (ATP)] that undergoes the same transformation [to adenosine diphosphate (ADP) + Pi]. Similarly, ArgC and Asd transform the reduced form of NADP+ (NADPH) to nicotinamide adenine dinucleotide phosphate (NADP+). By these criteria, both pairs are classified as similar. At the same time, reaction pairs ArgB/ArgC and LysC/Asd are adjacent in the pathway, because the product of the first reaction is the substrate for the next. These reaction pairs are classified as connected. All other pairs of reactions (ArgB/Asd and ArgC/LysC) are classified as unrelated. In this example, only the enzymes classified as similar (ArgB/LysC and ArgC/Asd) have the same fold. (B) Detailed information on the enzymes in (A). (C) Bars representing the relative number of pairs with the same fold in each category of reactions.

Another interesting question is the distribution and frequency of protein folds in this mrTM set. The 478 proteins contain 714 domains, but only 182 distinct folds, which are significantly fewer than would be expected (~300) for an equivalent random set of proteins with known structures (fig. S8). The surprisingly small number of folds arises from the fact that the most popular folds [e.g., the P-loop, triosephosphate isomerase (TIM) barrel, and Rossmann folds] are overrepresented as compared with their frequency in the general protein population (Fig. 3). Some relatively rare folds, abundant in the mrTM set, such as the biotin synthetase and the thiamin diphosphate binding folds, include groups of enzymes that perform specific but essential functions, such as tRNA aminoacylation or carbon metabolism.

Fig. 3

Distribution of folds in the mrTM protein set with the most overrepresented folds, illustrated by structural ribbon diagrams. Fold codes from the Structural Classification of Proteins (SCOP) database (33) are shown on the x axis with the observed frequency on the y axis. The expected frequency for each fold in the NCBI nonredundant database (31) is shown as a magenta trace. TIM, triosephosphate isomerase.

The most obvious interpretation of this skewed fold distribution is that the mrTM set, which covers the most fundamental protein functions, consists of the most ancient and, thus, the most abundant protein families. To probe this interpretation further, we analyzed the fold distribution for the core of the T. maritima metabolic network, as represented by the set of essential proteins. We identified essential proteins by a reductive evolution simulation approach (18, 29), where iterative simulations are performed to identify a minimal network by randomly eliminating genes from the model until additional elimination would result in a nonviable network. Each simulation led to a different minimal network, of size anywhere between 200 and 300 genes (i.e., corresponding to 42 to 63% of the mrTM set). Statistical analysis of 1000 such minimal networks in independent simulations in glucose minimal medium (18) allowed the classification of genes from the mrTM set into three categories: (I) core- or unconditional-essential genes that are always present, (II) nonessential genes that never appear, and (III) “synthetic lethal” or “conditional-essential” genes (30) that appear only in some simulations, but not in others, depending on which other genes are removed or retained in a particular network minimization. For example, if two genes have the same essential function, the deletion of either gene would not be lethal, but at least one gene has to be present in the minimal network. The frequency of such genes in multiple simulations reflects the topology of the network and the relative redundancy of gene functions in the network. It is important to emphasize that the core-essential genes would not be sufficient to maintain a viable metabolic network, as all of the many possible minimal networks contain constant (core-essential genes) and variable (subset of conditionally essential genes) components. The mrTM set consists of 177 core-essential, 203 nonessential, and 98 conditional-essential genes. Proteins in these three sets have very different fold distributions (Fig. 4). The number of folds in the core-essential group is surprisingly large for its sample size (111 folds for 177 proteins) as compared with the nonessential group, which contains more proteins but a smaller number of folds (92 folds for 203 proteins). This trend is inverse to that observed when mrTM is compared with nonredundant sequences in the National Center for Biotechnology Information (NCBI) database (31) (fig. S8), where the mrTM set was more abundant in popular folds. These analyses suggest that core-essential proteins perform unique chemical functions that are strongly associated with specific folds and are so fundamental that their deletion would result in a nonviable network.

Fig. 4

Fold composition of the nonessential, synthetic lethal, and core-essential protein sets (see text for details) illustrated by colors associated with different folds (see fig. S9 for details). The x axis represents the number of simulations that resulted in identification of core-essential (1000 appearances in 1000 simulations), synthetic lethal (from 999 to 1), and nonessential genes (0); and the y axis indicates their classification into SCOP fold categories. (Inset) Cumulative fold coverage of core-essential and nonessential protein sets (blue, core-essential; magenta, nonessential). The fold distribution in all three groups is different, although core-essential and nonessential sets have some weak similarity, more than either group compared with synthetic lethal sets.

We have presented here the integration of a metabolic and structural view of the central metabolic network of the thermophilic bacterium T. maritima. Achieving a complete description on these two levels is an important milestone that now enables large-scale analyses, such as the network-scale comparison of correlations between fold conservation and biochemical function. From our study, not only can we provide a quantitative estimate of the dominance of the patchwork model (24) versus the retrograde model (25) of metabolic evolution, but we can also illustrate the importance of convergent or parallel evolution in proteins carrying out similar biochemical functions. Furthermore, we show that the set of proteins responsible for the central metabolism in T. maritima is highly nonrandom and dominated by a small number of folds that significantly exceed their already dominant distribution in the protein universe, suggesting that the central metabolism network has evolved mainly from a set of the most ancient proteins that have had sufficient time to develop divergent functionalities and, hence, expand into the very large and very diverse protein families that we observe today. At the same time, the subset of core-essential proteins reverses this trend and is relatively more diverse than an equivalent subset of nonessential proteins. This counterintuitive situation is attributable to the presence of some specific folds with functions that are so unique that it is impossible to replace them with other existing folds.

Supporting Online Material

Materials and Methods

Figs. S1 to S9

Tables S1 to S13


Metabolic reconstruction in SMBL and MATLAB formats

  • * These authors contributed equally to this work.

  • Present address: Center of Systems Biology, University of Iceland, IS-101 Reykjavik, Iceland.

References and Notes

  1. Materials and methods are available as supporting material on Science Online.
  2. TM0449 is a flavin adenine dinucleotide–dependent thymidylate synthase, and our structure has contributed to new developments in functional studies of this and related proteins [see (21, 22) and references therein].
  3. We specifically acknowledge the invaluable work of individual crystallographers at the JCSG and other Protein Structure Initiative (PSI) centers, as well as individual research groups, who have solved structures analyzed here, either directly or that we used as modeling templates. The full list of these proteins is provided in the supporting online material. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences (NIGMS). This work was supported by the NIH PSI grants P20 GM076221 (JCCM) and U54 GM074898 (JCSG) from the NIGMS; grant DE-FG02-08ER64686 from the Office of Science (Biological and Environmental Research), U.S. Department of Energy; and the Gordon and Betty Moore Foundation CAMERA project.
View Abstract

Stay Connected to Science

Navigate This Article