An evolution-based model for designing chorismate mutase enzymes

See allHide authors and affiliations

Science  24 Jul 2020:
Vol. 369, Issue 6502, pp. 440-445
DOI: 10.1126/science.aba3304

Learning from evolution

Protein sequences contain information specifying their three-dimensional structure and function, and statistical analysis of families of sequences has been used to predict these properties. Building from sequence data, Russ et al. used statistical models that take into account conservation at amino acid positions and correlations in the evolution of pairs of amino acids to predict new artificial sequences that will have the properties of the protein family. For the chorismate mutase family of metabolic enzymes, the authors demonstrate experimentally that the artificial sequences display natural-like catalytic function. Because the models access an enormous space of diverse sequences, such evolution-based statistical approaches may guide the search for functional proteins with altered chemical activities.

Science, this issue p. 440


The rational design of enzymes is an important goal for both fundamental and practical reasons. Here, we describe a process to learn the constraints for specifying proteins purely from evolutionary sequence data, design and build libraries of synthetic genes, and test them for activity in vivo using a quantitative complementation assay. For chorismate mutase, a key enzyme in the biosynthesis of aromatic amino acids, we demonstrate the design of natural-like catalytic function with substantial sequence diversity. Further optimization focuses the generative model toward function in a specific genomic context. The data show that sequence-based statistical models suffice to specify proteins and provide access to an enormous space of functional sequences. This result provides a foundation for a general process for evolution-based design of artificial proteins.

Approaches for protein design typically begin with atomic structures and physical models for forces between atoms, but the marked expansion of protein sequence databases and the growth of new computational methods (14) have opened new strategies to this problem. Statistical analyses of homologs comprising a protein family have recently enabled successful prediction of protein structure (58), protein-protein interactions (913), and mutational effects (1418) and, for a family of small protein interaction modules, have led to the successful design of artificial amino acid sequences that fold and function in a manner similar to their natural counterparts (1921). These findings indicate that sequence-based statistical models may represent a general approach for a purely data-driven strategy for the design of complex proteins that can work in vivo, in specific organismal contexts. A first key step is to demonstrate the sufficiency of these models for specifying functional proteins.

An approach for evolution-inspired protein design is shown in Fig. 1, A to C, and is based on direct coupling analysis (DCA), a method originally conceived to predict contacts between amino acids in protein three-dimensional (3D) structures (1). The starting point is a large and diverse multiple sequence alignment (MSA) of a protein family, from which we estimate the observed frequencies (fia) and pairwise co-occurrences (fijab) of all amino acids (a,b) at positions (i,j), the first- and second-order statistics (Fig. 1A). From these quantities, we infer a model comprising a set of intrinsic amino acid propensities [fields hi(a)] and a set of pairwise interactions [couplings Jij(a,b)] that optimally account for the observed statistics (Fig. 1A). This model defines a probability P for each amino acid sequence (a,,aL) of length L: P(a1,,aL)exp[H(a1,,aL)](1)with the Hamiltonian H(a1,,aL)=ihi(ai)i<jJij(ai,aj) representing a statistical energy (which provides a quantitative log-likelihood score for each sequence; see materials and methods). Lower energies are associated with higher probability. Monte Carlo (MC) sampling from the model allows for generating nonnatural sequence repertoires (Fig. 1B), which can then be screened for desired functional activities (Fig. 1C). If positional conservation and pairwise correlation suffice in general to capture the information content of protein sequences and if the model inference is sufficiently accurate, then the artificial sequences should recapitulate the functional diversity and properties of natural proteins.

Fig. 1 Evolutionary data-driven protein engineering.

(A) MSA of M natural homologs provides empirical first- and second-order statistics of amino acids (fia,fijab), which are used to infer a statistical model with the bmDCA method. The probability of sequence a=(a1,,aL) is an exponential function of a Hamiltonian, or statistical energy, parameterized by intrinsic fields hi(a) and couplings Jij(a,b) acting on amino acids. (B and C) The model is used to generate NM artificial sequences that can be tested in a high-throughput assay for desired functions. (D) CM is an enzyme occurring at the central branch point in the shikimate pathway that leads to the synthesis of Tyr and Phe. (E) Members of the AroQα and AroQδ families of CMs fold into a domain-swapped dimer (PDB ID 1ECM). Active site residues are shown with yellow stick bonds and arise from both subunits (dark and light blue). A bound substrate analog is shown in magenta.

To test this process, we chose the AroQ family of chorismate mutases (CMs), a classic model for understanding principles of catalysis and enzyme design (2224). These enzymes occur in bacteria, archaea, fungi, and plants and operate at the central branch point of the shikimate pathway leading to the biosynthesis of tyrosine (Tyr) and phenylalanine (Phe) (Fig. 1D). CMs catalyze the conversion of the intermediary metabolite chorismate to prephenate through a Claisen rearrangement, leading to more than a million-fold acceleration of the reaction rate (25), and are necessary for growth of bacteria in minimal media. For example, Escherichia coli strains lacking a CM are auxotrophic for Tyr and Phe, with both the degree of supplementation of these amino acids and the expression level of a reintroduced CM gene quantitatively determining the growth rate (23). Structurally, AroQα subfamily members exemplified by EcCM, the CM domain of the CM-prephenate dehydratase from E. coli, form a domain-swapped dimer of relatively small protomers (~100 amino acids) (Fig. 1E) (26, 27). Their size, essentiality for bacterial growth, and the existence of good biochemical assays make AroQ CMs an excellent target for testing the power of statistical models inferred from MSAs.

We used DCA to make a statistical model (Eq. 1) for an alignment of 1259 natural AroQ protein domains that broadly encompasses the diversity of bacterial, archaeal, fungal, and plant lineages. Deducing the exact parameters (hi,Jij) from the observed statistics in the MSA (fi,fij) for any protein is computationally intractable, but a number of approximation algorithms are available (1). Here, we used bmDCA, a computationally quite expensive but highly accurate method based on Boltzmann machine learning (28). For example, sequences generated by MC sampling from the model reproduced the empirical first- and second-order statistics of natural sequences used for fitting (Fig. 2, A and B). We also found that the model recapitulates higher-order statistical features in the MSA that were never used in inferring the model (see materials and methods). These features include three-way residue correlations (Fig. 2C) and the inhomogeneously clustered phylogenetic organization of the protein family in sequence space (Fig. 2D). Our findings suggest that the statistical model goes beyond just being a good fit to the first- and second-order statistics to capture the essential rules governing the divergence of natural CM sequences through evolution. By contrast, a simpler profile model that retains only the intrinsic propensities of amino acids at sites [hi(a), fitted to reproduce frequencies of amino acids fia] (fig. S1A) but leaves out pairwise couplings fails to reproduce even the second-order statistics of the MSA (fig. S1B) and does not account well for the pattern of sequence divergence in the natural CM proteins (fig. S1D).

Fig. 2 Design and testing of artificial CM sequences.

(A to C) 2D histograms showing the relationship of first- (A), second- (B), and third-order (C) statistics of natural and bmDCA-designed sequences. The color scale indicates the number of counts per bin. (D) The top two principal components of the pairwise sequence distance matrix of natural homologs (blue circles) overlaid with a projection of artificial CM sequences (black circles); the position of EcCM from E. coli is marked with a red plus sign. Artificial sequences both recapitulate data used for fitting (A and B) and also account for statistical features of natural data not used for fitting (C and D). (E) Workflow for functional characterization of CM activity. CM-deficient E. coli cells carrying libraries of variants were grown under selective conditions in minimal medium, after which we performed deep sequencing of input and selected populations and calculation of the r.e. of each variant. (F) The relationship of r.e. to catalytic power [log10(kcat/Km)] for a number of CM variants yields a “standard curve.” The assay shows a hyperbolic relationship over the range from complete lack of CM activity to wild-type EcCM activity.

These findings raise the possibility that bmDCA may be a generative model, meaning that natural sequences and sequences sampled from the probability distribution P(a) are, despite considerable divergence, equivalent. To test this, we developed a high-throughput, quantitative in vivo complementation assay to monitor CM activity in E. coli that is suitable for studying large numbers of natural and designed CMs in a single experiment. Briefly, libraries of CM variants (natural and/or artificial, see below) were made using a custom de novo gene synthesis protocol that is capable of fast and relatively inexpensive assembly of novel DNA sequences at large scale (29) (see materials and methods). For example, we made gene libraries comprising nearly every natural CM homolog in the MSA (1130 out of 1259 total) and more than 1900 artificial variants by exploring various design parameters of the bmDCA model (see materials and methods). These libraries were expressed in a CM-deficient bacterial host strain (KA12/pKIMP-UAUC) (23), and all transformants were grown together as a single population in selective medium lacking Phe and Tyr to select for those variants exhibiting CM activity (Fig. 2E). Deep sequencing of the population before and after selection allowed us to compute the log frequency of each allele relative to that of wild-type EcCM. This quantity is called the relative enrichment (r.e.), which under particular conditions of gene induction, growth time, and temperature quantitatively and reproducibly reports the catalytic CM activity (Fig. 2F and fig. S2). This “select-seq” assay is monotonic over a broad range of catalytic power and serves as an effective tool to rigorously compare large numbers of natural and artifical variants for functional activity in vivo, in a single internally controlled experiment.

The first study examined the performance of the natural CM homologs in the select-seq assay as a positive control for bmDCA-designed sequences. Natural sequences showed a unimodal distribution of bmDCA statistical energies centered close to the value of EcCM (defined as zero, Fig. 3A), but it was not obvious how they would function in the particular E. coli host strain and experimental conditions used in our assay. For example, members of the CM family may vary in unknown ways with regard to activity in any particular environment, and the MSA includes some fraction of paralogous enzymes that carry out a related but distinct chemical reaction (30, 31). The select-seq assay showed that the 1130 natural CM homologs exhibit a bimodal distribution of complementation in the assay, with one mode comprising ~38% of the sequences centered close to the level of wild-type EcCM and the remainder comprising a mode centered close to the level of the null allele (Fig. 3B). A green fluorescent protein–tagged version of the library suggests that the bimodality of complementation is not obviously related to differences in expression levels compared to the E. coli variant (fig. S3); instead, the bimodality presumably originates from nonlinearities linking sequence to growth rate in the host strain and from the inclusion of some functionally distinct paralogous sequences. For the purpose of this study, the bimodality allows normalization of r.e. scores by the two modes and by Gaussian mixture modeling to meaningfully group sequences into those that are functionally either like wild-type EcCM (norm. r.e. > 0.42) or like the null allele in our assay (Fig. 3B). The standard curve shows that this quantity is a stringent test of high CM activity (Fig. 2F).

Fig. 3 Functional analysis of natural and artificial CMs.

(A) Distribution of bmDCA statistical energies for 1130 tested natural AroQ homologs, relative to the value for EcCM. The data show a unimodal distribution centered close to EcCM. (B) The distribution of functional complementation by natural AroQ sequences is bimodal, with ~38% of sequences in one mode near that of EcCM and the rest in another mode close to the r.e. of the null allele. The bimodality is used to normalize the raw r.e. scores between zero and the mean of the near-null mode for all libraries in panels B, F to H, and J. (C to E) Distributions of statistical energies for artificial sequences. (F to H) Distributions of corresponding r.e. values sampled at T{0.33,0.66,1}, respectively. (I and J) Distributions of statistical energy and r.e. for artificial sequences retaining the intrinsic propensities of amino acids at positions but leaving out all correlations. Taken together, the data show that bmDCA is a generative model.

To evaluate the generative potential of the bmDCA model, we used MC sampling to randomly draw sequences from the model that span a wide range of statistical energies relative to the natural MSA (Fig. 3, C to E), with the hypothesis that sequences with low energy (i.e., high probability) may be functional CMs. To sample sequences with low energy, we introduced a formal computational “temperature” of T1 in our model:PT(a,,aL)exp[H(a1,aL)/T]which, in exact analogy to temperature in statistical physics, serves to decrease the mean energy when set to values below unity. For example, sampling at T{0.33, 0.66} produced sequences with statistical energies that closely reflected the natural distribution (Fig. 3D) or reached even lower values (Fig. 3C). By contrast, sequences sampled at T=1 showed a broad distribution of statistical energies that deviated significantly from the natural distribution (Fig. 3E) toward higher energies. This deviation is, among other factors, due to statistical adjustments [regularization (see materials and methods)] used during model inference for compensating for the limited sampling of sequences in the input MSA.

We made and tested libraries of 493 to 615 artificial sequences sampled at T{0.33, 0.66, 1.0} from bmDCA models inferred at two regularization strengths (Fig. 3, F to H). The data showed that, overall, these sequences also displayed a bimodal distribution of complementation, with many complementing function near the level of the wild-type EcCM sequence. Consistent with our hypothesis, the probability of complementation is well-predicted by the bmDCA statistical energy, with low-energy sequences drawn from T{0.33, 0.66} essentially recapitulating or even somewhat exceeding the performance of natural sequences (Fig. 3, F and G). By contrast, sequences drawn from T=1 showed poor performance, consistent with deviation in bmDCA statistical energies (Fig. 3H). Overall, 481 artificial sequences out of 1618 total (~30%, norm r.e. > 0.42) rescued growth in our assay, comprising a range of top-hit identities to any natural CM of 42 to 92% (table S1 and fig. S4, A and B). These included 46 sequences with <65% identity to proteins in the MSA, corresponding to at least 33 mutations away from the closest natural counterpart. Sequence divergence from EcCM ranged from 14 to 82% (fig. S4, C and D). A representation of the positions in the EcCM protein that contribute most to the bmDCA statistical energy highlights residues distributed both within the active site and extending through the AroQ tertiary structure to include the dimer interface (blue spheres, Fig. 4F).

Fig. 4 General and system-specific constraints in CM.

(A) The overall relationship of bmDCA statistical energies and catalytic function, with functional sequences in black and nonfunctional sequences in blue. The data expose a strong nonlinear relationship, with functional sequences strictly below a sharp threshold (EDCA<50). This value is the limit of statistical energies for natural homologs (Fig. 3A). (B) The two top principal components of sequence variation in natural homologs, with sequences complementing the E. coli CM auxotroph in black. (C) The same as for panel b, but for artificial sequences, showing a similar pattern. Sequences chosen for in-depth biochemical characterization are indicated in panels B and C (see Table 1). (D and E) The r.e. distributions for all artificial sequences with EDCA<50 (D) or with an additional statistical constraint derived from the pattern of rescue of natural homologs (P(x=1|a); see text for details) (E). The additional constraint identifies sequences functional in E. coli in our selection assay. (F) Spatial architecture of functional constraints in CM enzymes mapped onto the EcCM structure. Blue spheres show positions constrained in the bmDCA model. Yellow spheres show the extra constraints required for E. coli–specific function, highlighting a peripheral shell around active site residues important for CM catalysis.

Are the abilities of artificial CM sequences to rescue just a function of their sequence distance from their natural counterparts? To test this, we made 326 sequences with the same distribution of top sequence identities as bmDCA-designed sequences but preserving only the first-order statistics (position-specific conservation) and leaving out correlations. These sequences expectedly showed high bmDCA energies and displayed no complementation at all (Fig. 3, I and J). Thus, enzyme function is not simply about the magnitude of sequence variation and not even about conservation of sites taken independently; instead, it fundamentally depends on the pattern of correlations imposed by the couplings in Jij (see Eq. 1).

We selected five natural and five artificial CMs that complement growth in our assay for in-depth biochemical studies. The natural sequences occur in organisms representing a broad range of phylogenetic groups and diverse environments, and the artificial sequences were chosen to sample regions that reflect this diversity (Fig. 4, B and C). The selected CM genes were expressed in an E. coli strain that is deficient for endogenous CM to eliminate any possibility of contamination with the wild-type enzyme, and catalytic parameters of the purified proteins were determined using a spectrophotometric assay following the consumption of the substrate chorismate (23). The data showed that all 10 CM genes were expressed similarly and that the natural CMs displayed catalytic parameters similar to those of the previously characterized E. coli (32) and Methancoccus jannaschii (33) enzymes (Table 1). Consistent with their natural-like complementation, the artificial CMs showed catalytic parameters that closely recapitulated those of natural CMs. Thus, we conclude that the bmDCA-designed CMs are bona fide synthetic orthologs of the CM family.

Table 1

Biochemical properties of natural and artificial CM enzymes. Construct numbers correspond to the numbering presented in tables S1 to S3, which provide additional information about these CM proteins.

View this table:

Putting all the data together, we found a steep relationship between bmDCA statistical energy and CM activity (Fig. 4A and fig. S5). Forty-five percent of artificial sequences rescued the CM null phenotype when the statistical energy was below a threshold value set by the distribution of statistical energies observed for natural sequences (EDCA<50) (Fig. 4A), and essentially no sequences (<3%) were functional above this value. Thus, bmDCA infers an effective generative model, capable of designing natural-like enzymatic activity with considerable sequence diversity if statistical energies are within the range of natural homologs. The extent of sequence variation from natural homologs highlights the sparsity of the essential constraints on folding and biochemical function.

The bmDCA model captures the overall statistics of a protein family and does not focus on specific functional activities of individual members of the family. Thus, just like natural CM homologs, most bmDCA-designed sequences do not complement function under the specific conditions of our assay (Fig. 4A). But, might it be possible to improve the generative model to deduce the extra information that makes a protein sequence optimal for a specific phenotype? A bit of insight comes from the study of how sequences that rescue function in our assay occupy the sequence space spanned by natural CM sequences. Natural CMs that complement function in our E. coli host strain are distributed in several diverse clusters (Fig. 4B), but functional artificial sequences also follow the same pattern (Fig. 4C). This suggests that information about CM function in the specific context of the E. coli assay conditions exists in the statistics of natural sequences and might be learned. If so, knowledge gained in the first experimental trial might be added to formally train a classifier to predict artificial sequences that encode particular protein phenotypes and organismal environments.

To test this idea, we annotated the sequences in the natural MSA with a binary value x indicating their ability to function in our assay (x=1 if functional, x=0 if not). Due to its formal similarity to the DCA framework, we used logistic regression on the annotated MSA to develop a model providing a probability for any sequence a=(a1,,aL) to function in the E. coli select-seq assay; that is, P(x=1|a). Figure 4, D and E, shows that for low-energy CM-like artificial sequences sampled from the naïve, unsupervised bmDCA model (Fig. 4D), the extra condition P(x=1|a)>0.8 efficiently predicted the subset that complements in the context of our assay (83%) (Fig. 4E). These results support an iterative design strategy for specific protein phenotypes in which the bmDCA model is updated with each round of selection to optimize desired phenotypes.

What structural principles underlie the general constraints on CM function and the extra constraints for system-specific function? Mapping the positions that contribute most significantly to E. coli–specific function of CM sequences showed an arrangement of amino acids peripheral to the active site, within a poorly conserved secondary shell around active site positions (Fig. 4F). Thus, these positions work allosterically or otherwise indirectly to control catalytic activity, a mechanism to provide context-dependent fine-tuning of reaction parameters.

The results described here validate and extend the concept that pairwise amino acid correlations in practically available sequence alignments of protein families suffice to specify protein folding and function (19, 20). The bmDCA model is one approach to capture these correlations, but there is more work to be done to fully understand these models. Currently, the interpretation of DCA is focused on the relatively few highest-magnitude terms in the matrix of couplings (Jij), because these terms identify direct structural contacts between amino acids in protein tertiary structures (34). Indeed, the top terms in Jij for CMs do nicely correspond to contacts in the tertiary structure (fig. S6). However, contact terms in Jij alone do not suffice to reproduce either the alignment statistics in the AroQ family (fig. S7) or the functional effects of mutations (17). Instead, function in proteins seems to depend also on many weaker, noncontacting terms in Jij that currently have no simple physical interpretation. Similar findings have been made in the case of predicting protein-protein interaction specificity (35). The weaker terms in Jij seem to describe the collective evolution of amino acids within the structure, a property that may be related to patterns elucidated by other approaches to sequence coevolution (1, 3640). A key next goal is to further refine the topology and best representation of sequence correlations that underlie the physics of protein structure and function.

However, even pending these necessary refinements, the data presented here provide the foundations for a general data-driven approach to protein engineering. This approach is similar to directed evolution in that it works without the use of physics-based potentials or atomic structures, but the computational models access a sequence space of functional proteins that is vastly larger than currently understood. Indeed, the results presented here permit a lower-bound estimate of the size of the sequence space consistent with the evolutionary rules for specifying members of a protein family. For example, at T=0.66, we conservatively compute a total space of 1025 sequences that could be synthetic homologs of the AroQ family (see materials and methods). Given that ~30% of sequences randomly sampled from this pool rescue CM function under our assay conditions, this amounts to more than 1024 sequences that can operate in a specific genomic and experimental context. These numbers are enormous in absolute terms but are infinitesimally unlikely in a sequence space searched without any model (~10125) or with models capturing only first-order constraints (1085; see materials and methods). These considerations suggest that it will be of great interest to use evolution-based statistical models to guide the search for functional proteins with altered or even novel chemical activities.

Supplementary Materials

Materials and Methods

Figs. S1 to S7

Tables S1 to S3

References (4153)

References and Notes

Acknowledgments: We thank O. Rivoire, C. Nizak, A. Ferguson, C. Feinauer, G. Uguzzoni, A. Pagnani, and members of the Ranganathan lab for discussions, A. George for computational work, and the high-performance computing (BioHPC) and flow cytometry facilities at UT Southwestern Medical Center. Funding: This work was supported by NIH grant RO1GM12345 (R.R.), Robert A. Welch Foundation grant I-1366 (R.R.), a Data Science Discovery award from the University of Chicago Center for Data and Computing (R.R.), the Green Center for Systems Biology at UT Southwestern Medical Center (R.R.), the EU H2020 Research and Innovation Programme MSCA-RISE-2016, Grant Agreement 734439 (InferNet) (M.W.), grant number CE30-0021-01 RBMpro from the Agence Nationale de la Recherche (R.M. and S.C.), and Swiss National Science Foundation grants 310030M_182648 (P.K.) and 310030B_176405 (D.H.). Author contributions: W.P.R., M.W., R.M., S.C., and R.R. conceived the idea for the project; M.F., P.B.-C., S.C., and M.W. built the computational models; W.P.R., with contributions from P.K. and D.H., developed the in vivo CM assay; W.P.R. built the synthetic gene libraries and carried out the in vivo assays; C.S., with contributions from M.S., P.K., and D.H., obtained the kinetic parameters in Table 1; S.C. and R. M. carried out the sequence entropy calculations; and W.P.R., M.W., and R.R. wrote the paper with contributions from all authors. Competing interests: R.R. is a cofounder and scientific advisor of Evozyne and is an inventor on a patent application filed by the University of Chicago on technologies for data-driven molecular engineering. Data and materials availability: Materials are available from the authors by request, and codes for bmDCA are available at and at Zenodo (41).

Correction (3 December 2020): A new reference 21 has been added, and the citations have been renumbered accordingly.

Stay Connected to Science


Navigate This Article