Research Article

De Novo Protein Design: Fully Automated Sequence Selection

See allHide authors and affiliations

Science  03 Oct 1997:
Vol. 278, Issue 5335, pp. 82-87
DOI: 10.1126/science.278.5335.82


The first fully automated design and experimental validation of a novel sequence for an entire protein is described. A computational design algorithm based on physical chemical potential functions and stereochemical constraints was used to screen a combinatorial library of 1.9 × 1027 possible amino acid sequences for compatibility with the design target, a ββα protein motif based on the polypeptide backbone structure of a zinc finger domain. A BLAST search shows that the designed sequence, full sequence design 1 (FSD-1), has very low identity to any known protein sequence. The solution structure of FSD-1 was solved by nuclear magnetic resonance spectroscopy and indicates that FSD-1 forms a compact well-ordered structure, which is in excellent agreement with the design target structure. This result demonstrates that computational methods can perform the immense combinatorial search required for protein design, and it suggests that an unbiased and quantitative algorithm can be used in various structural contexts.

Significant advances have been made toward the design of stable, well-folded proteins with novel sequences (1). These efforts have generated insight into the factors that control protein folding and have suggested new approaches to biotechnology (2). In order to broaden the scope and power of protein design techniques, several groups have developed and experimentally tested systematic quantitative methods for protein design directed toward developing general design algorithms (3, 4). These techniques, which have been used to screen possible sequences for compatibility with the desired protein fold, have been focused mostly on the redesign of protein cores.

We have sought to expand the range of computational protein design to residues of all parts of a protein: the buried core, the solvent-exposed surface, and the boundary between core and surface (4-6). Our goal is an unbiased, quantitative design algorithm that is based on the physical properties that determine protein structure and stability and that is not limited to specific folds or motifs. Such a method should escape the lack of generality of design approaches based on system-specific heuristics or subjective considerations or both. We have developed our algorithm by combining theory, computation, and experiment in a cycle that has improved our understanding of the physical chemistry governing protein design (4). We now report the successful design by the algorithm of an original sequence for an entire protein and the experimental validation of the protein's structure.

Sequence selection. Our design methodology begins with a backbone fold and we attempt to select an amino acid sequence that will stabilize this target structure. The method consists of an automated side-chain selection algorithm that explicitly and quantitatively considers specific interactions between (i) side chain and backbone and (ii) side chain and side chain (4). The side chain selection algorithm screens all possible amino acid sequences and finds the optimal sequence and side-chain orientations for a given backbone. In order to correctly account for the torsional flexibility of side chains and the geometric specificity of side-chain placement, we consider a discrete set of all allowed conformers of each side chain, called rotamers (7). The sizable search problem presented by rotamer sequence optimization is overcome by application of the dead-end elimination (DEE) theorem (8). Our implementation of the DEE theorem extends its utility to sequence design and rapidly finds the globally optimal sequence in its optimal conformation (4).

Previously we determined the different contributions of core, surface, and boundary residues to the scoring of a sequence arrangement. The sequence predictions of a scoring function, or a combination of scoring functions, were experimentally tested in order to assess the accuracy of the algorithm and to derive improvements to it. We successfully redesigned the core of a coiled coil and of the streptococcal protein G β1 (Gβ1) domain using a van der Waals potential to account for steric constraints and an atomic solvation potential favoring the burial and penalizing the exposure of nonpolar surface area (4, 6). Effective solvation parameters and the appropriate balance between packing and solvation terms were found by systematic analysis of experimental data and feedback into the simulation. Solvent-exposed residues on the surface of a protein were designed with the use of a hydrogen-bond potential and secondary structure propensities in addition to a van der Waals potential. Coiled coils designed with such a scoring function were 10° to 12°C more thermally stable than the naturally occurring analog (5). Residues that form the boundary between the core and surface require a combination of the core and the surface scoring functions. The algorithm considers both hydrophobic and hydrophilic amino acids at boundary positions, whereas core positions are restricted to hydrophobic amino acids and surface positions are restricted to hydrophilic amino acids.

In order to assess the capability of our design algorithm, we have computed the entire amino acid sequence for a small protein motif. We sought a protein fold that would be small enough to be both computationally and experimentally tractable, yet large enough to form an independently folded structure in the absence of disulfide bonds or metal binding. We chose the ββα motif typified by the zinc finger DNA binding module (9). Although this motif consists of fewer than 30 residues, it does contain sheet, helix, and turn structures. The ability of this fold to form in the absence of metal ions or disulfide bonds has been demonstrated by Imperiali and co-workers, who designed a 23-residue peptide, containing an unusual amino acid (d-proline) and a nonnatural amino acid [3-(1,10-phenanthrol-2-yl)-l-alanine], which achieved this fold (10); our initial characterization of a partially computed sequence indicated that it also forms this fold (11). In computing the full sequence for this target fold, we use the scoring functions from our previous work without modification (12). The ββα motif was not used in any of our prior work to develop the design methodology and therefore provides a test of the algorithm's generality.

The sequence selection algorithm requires structure coordinates that define the target motif's backbone (N, Cα, C, and O atoms and Cα-Cβ vectors). The Brookhaven Protein Data Bank (PDB) (13) was examined for high-resolution structures of the ββα motif, and the second zinc finger module of the DNA binding protein Zif268 was selected as our design template (9,14). In order to assign the residue positions in the template structure into core, surface, or boundary classes, the orientation of the Cα-Cβ vectors was assessed relative to a solvent-accessible surface computed with only the template Cα atoms (15). The small size of this motif limits to one (position 5) the number of residues that can be assigned unambiguously to the core, whereas seven residues (positions 3, 7, 12, 18, 21, 22, and 25) were classified as boundary and the remaining 20 residues were assigned to the surface. Whereas three of the zinc binding positions of Zif268 are in the boundary or core, one residue, position 8, has a Cα-Cβ vector directed away from the geometric center of the protein and is classified as a surface position. As in our previous studies, the amino acids considered at the core positions during sequence selection were Ala, Val, Leu, Ile, Phe, Tyr, and Trp; the amino acids considered at the surface positions were Ala, Ser, Thr, His, Asp, Asn, Glu, Gln, Lys, and Arg; and the combined core and surface amino acid sets (16 amino acids) were considered at the boundary positions. Two of the residue positions (9 and 27) have φ angles greater than 0° and are set to Gly by the sequence selection algorithm to minimize backbone strain.

The total number of amino acid sequences that must be considered by the design algorithm is the product of the number of possible amino acids at each residue position. The ββα motif residue classification described above results in a virtual combinatorial library of 1.9 × 1027 possible amino acid sequences (16). This library size is 15 orders of magnitude larger than that accessible by experimental random library approaches. A corresponding peptide library consisting of only a single molecule for each 28-residue sequence would have a mass of 11.6 metric tons (17). In order to accurately model the geometric specificity of side-chain placement, we explicitly consider the torsional flexibility of amino acid side chains in our sequence scoring by representing each amino acid with a discrete set of allowed conformations, called rotamers (18). As a result, the design algorithm must consider all rotamers for each possible amino acid at each residue position. The total size of the search space for the ββα motif is therefore 1.1 × 1062 possible rotamer sequences. We use a search algorithm based on an extension of the DEE theorem to solve the rotamer sequence optimization problem (4, 8). Efficient implementation of the DEE theorem has made complete protein sequence design tractable for about 50 residues on current parallel computers in a single calculation. The rotamer optimization problem for the ββα motif required 90 CPU hours to find the optimal sequence (19, 20).

The optimal sequence (Fig. 1) is called full sequence design (FSD-1). Even though all of the hydrophilic amino acids were considered at each of the boundary positions, the algorithm selected only nonpolar amino acids. The eight core and boundary positions are predicted to form a well-packed buried cluster. The Phe side chains selected by the algorithm at positions 21 and 25, the zinc-binding His positions of Zif268, are more than 80 percent buried, and the Ala at position 5 is 100 percent buried but the Lys at position 8 is more than 60 percent exposed to solvent (Fig.2). The other boundary positions demonstrate the steric constraints on buried residues by packing similar side chains in an arrangement similar to that of Zif268 (Fig.2). The calculated optimal configuration for core and boundary residues buries ∼1150 Å2 of nonpolar surface area. On the helix surface, the algorithm places Asn14 with a hydrogen bond between its side-chain carbonyl oxygen and the backbone amide proton of residue 16. The eight charged residues on the helix form three pairs of hydrogen bonds, although in our coiled-coil designs, helical surface hydrogen bonds appeared to be less important than the overall helix propensity of the sequence (5). Positions 4 and 11 on the exposed sheet surface were selected by the program to be Thr, one of the best β-sheet forming residues (21).

Figure 1

Sequence of FSD-1 aligned with the second zinc finger of Zif268. The bar at the top of the figure shows the residue position classifications: the solid bar indicates the single core position, the hatched bars indicate the seven boundary positions and the open bars indicate the 20 surface positions. The alignment matches positions of FSD-1 to the corresponding backbone template positions of Zif268. Of the six identical positions (21 percent) between FSD-1 and Zif268, four are buried (Ile7, Phe12, Leu18, and Ile22). The zinc binding residues of Zif268 are boxed. Representative nonoptimal sequence solutions determined by means of a Monte Carlo simulated annealing protocol are shown with their rank. Vertical lines indicate identity with FSD-1. The symbols at the bottom of the figure show the degree of sequence conservation for each residue position computed across the top 1000 sequences: filled circles indicate more than 99 percent conservation, half-filled circles indicate conservation between 90 and 99 percent, open circles indicate conservation between 50 and 90 percent, and the absence of a symbol indicates less than 50% conservation. The consensus sequence determined by choosing the amino acid with the highest occurrence at each position is identical to the sequence of FSD-1. Single-letter abbreviations for amino acid residues as follows: A, Ala; C, Cys; D, Asp; E, Glu; F, Phe; G, Gly; H, His; I, Ile; K, Lys; L, Leu; M, Met; N, Asn; P, Pro; Q, Gln; R, Arg; S, Ser; T, Thr; V, Val; W, Trp; and Y, Tyr.

Figure 2

Comparison of Zif268 (9) and computed FSD-1 structures. (A) Stereoview of the second zinc finger module of Zif268 showing its buried residues and zinc binding site. (B) Stereoview of the computed orientations of buried side chains in FSD-1. For clarity, only side chains from residues 3, 5, 8, 12, 18, 21, 22, and 25 are shown. Color figures were created with MOLMOL (38).

Alignment of the sequences for FSD-1 and Zif268 (Fig. 1) indicates that only 6 of the 28 residues (21 percent) are identical and only 11 (39 percent) are similar. Four of the identities are in the buried cluster, which is consistent with the expectation that buried residues are more conserved than solvent-exposed residues for a given motif (22). A BLAST (23) search of the FSD-1 sequence against the nonredundant protein sequence database of the National Center for Biotechnology Information did not reveal any zinc finger protein sequences. Further, the BLAST search found only low identity matches of weak statistical significance to fragments of various unrelated proteins. The highest identity matches were 10 residues (36 percent) with P values ranging from 0.63 to 1.0, where P is the probability of a match being a chance occurrence. Random 28-residue sequences that consist of amino acids allowed in the ββα position classification described above produced similar BLAST search results, with 10- or 11-residue identities (36 to 39 percent) and P values ranging from 0.35 to 1.0, further suggesting that the matches for FSD-1 are statistically insignificant. The low identity with any known protein sequence demonstrates the novelty of the FSD-1 sequence and underscores that no sequence information from any protein motif was used in our sequence scoring function.

In order to examine the robustness of the computed sequence, we used the sequence of FSD-1 as the starting point of a Monte Carlo simulated annealing run. The Monte Carlo search revealed high scoring, suboptimal sequences in the neighborhood of the optimal solution (4). The energy spread from the ground-state solution to the 1000th most stable sequence is about 5 kcal/mol, an indication that the density of states is high. The amino acids comprising the core of the molecule, with the exception of position 7, are essentially invariant (Fig. 1). Almost all of the sequence variation occurs at surface positions, and typically involves conservative changes. Asn14, which is predicted to form a stabilizing hydrogen bond to the helix backbone, is among the most conserved surface positions. The strong sequence conservation observed for critical areas of the molecule suggests that, if a representative sequence folds into the design target structure, then many sequences whose variations do not disrupt the critical interactions may be equally competent. Even if billions of sequences would successfully achieve the target fold, they would represent only a very small proportion of the 1027 possible sequences.

Experimental validation. FSD-1 was synthesized in order to allow us to characterize its structure and assess the performance of the design algorithm (24). The far-ultraviolet (UV) circular dichroism (CD) spectrum of FSD-1 shows minima at 220 nm and 207 nm, which is indicative of a folded structure (Fig. 3A) (25). The thermal melt is weakly cooperative, with an inflection point at 39°C (Fig.3B), and is completely reversible. The broad melt is consistent with a low enthalpy of folding which is expected for a motif with a small hydrophobic core. This behavior contrasts the uncooperative thermal unfolding transitions observed for other folded short peptides (26). FSD-1 is highly soluble (greater than 3 mM), and equilibrium sedimentation studies at 100 μM, 500 μM, and 1 mM show the protein to be monomeric (27). The sedimentation data fit well to a single species, monomer model with a molecular mass of 3630 at 1 mM, in good agreement with the calculated monomer mass of 3488. Also, far UV-CD spectra showed no concentration dependence from 50 μM to 2 mM, and nuclear magnetic resonance (NMR) COSY spectra taken at 100 μM and 2 mM were essentially identical.

Figure 3

Circular dichroism (CD) measurements of FSD-1. (A) Far-UV CD spectrum of FSD-1 at 1°C. The minima at 220 and 207 nm indicate a folded structure. (B) Thermal unfolding of FSD-1 monitored by CD. The melting curve has an inflection point at 39°C. To illustrate the cooperativity of the thermal transition, the melting curve was fit to a two-state model [(39) and the derivative of the fit is shown (inset)]. The melting temperature determined from this fit is 42°C.

The solution structure of FSD-1 was solved by means of homonuclear 2D 1H NMR spectroscopy (28). NMR spectra were well dispersed, indicating an ordered protein structure and easing resonance assignments. Proton chemical shift assignments were determined with standard homonuclear methods (29). Unambiguous sequential and short-range NOEs (Fig.4) indicate helical secondary structure from residues 15 to 26 in agreement with the design target. Representative long-range NOEs from the helix to Ile7 and Phe12 indicate a hydrophobic core consistent with the desired tertiary structure (Fig. 4B).

Figure 4

NOE contacts for FSD-1. (A) Sequential and short-range NOE connectivities. The d denotes a contact between the indicated protons. All adjacent residues are connected by Hα-HN, HN-HN, or Hβ-HN NOE crosspeaks. The helix (residues 15 to 26) is well defined by short-range connections, as is the hairpin turn at residues 7 and 8. (B) Representative NOE contacts from aromatic to methyl protons. Several long-range NOEs from Ile7 and Phe12 to the helix help define the fold of the protein. The starred peak has an ambiguous F1 assignment, Ile22 Hd1 or Leu18 Hd2.

The structure of FSD-1 was determined from 284 experimental restraints (10.1 restraints per residue) that were nonredundant with covalent structure including 274 NOE distance restraints and 10 hydrogen bond restraints involving slowly exchanging amide protons (30). Structure calculations were performed with X-PLOR (31) with the use of standard protocols for hybrid distance geometry-simulated annealing (32). An ensemble of 41 structures converged with good covalent geometry and no distance restraint violations greater than 0.3 Å (Fig.5 and Table1). The backbone of FSD-1 is well defined with a root-mean-square (rms) deviation from the mean of 0.54 Å (residues 3 to 26). Consideration of the buried side chains (Tyr3, Ala5, Ile7, Phe12, Leu18, Phe21, Ile22, and Phe25) along with the backbone gives an rms deviation of 0.99 Å, indicating that the core of the molecule is well ordered. The stereochemical quality of the ensemble of structures was examined with PROCHECK (33). Apart from the disordered termini and the glycine residues, 87 percent of the residues fall in the most favored region and the remainder in the allowed region of φ,ψ space. Modest heterogeneity is evident in the first strand (residues 3 to 6), which has an average backbone angular order parameter, 〈S〉 (34), of 0.96 ± 0.04 compared to the second strand (residues 9 to 12) with an 〈S〉 = 0.98 ± 0.02 and the helix (residues 15 to 26) with an 〈S〉 = 0.99 ± 0.01. Overall, FSD-1 is notably well ordered and, to our knowledge, is the shortest sequence consisting entirely of naturally occurring amino acids that folds to a well-ordered structure without metal binding, oligomerization, or disulfide bond formation (35).

Figure 5

Solution structure of FSD-1. Stereoview showing the best-fit superposition of the 41 converged simulated annealing structures from X-PLOR (31). The backbone Cα trace is shown in blue and the side-chain heavy atoms of the hydrophobic residues (Tyr3, Ala5, Ile7, Phe12, Leu18, Phe21, Ile22, and Phe25) are shown in magenta. The amino terminus is at the lower left of the figure and the carboxyl terminus is at the upper right of the figure. The structure consists of two antiparallel strands from positions 3 to 6 (back strand) and 9 to 12 (front strand), with a hairpin turn at residues 7 and 8, followed by a helix from positions 15 to 26. The termini, residues 1, 2, 27, and 28 have very few NOE restraints and are disordered.

Table 1

NMR structure determination: distance restraints, structural statistics, and atomic root-mean-square (rms) deviations. 〈SA〉 are the 41 simulated annealing structures,SA is the average structure before energy minimization, (SA)r is the restrained energy minimized average structure, and SD is the standard deviation.

View this table:

The packing pattern of the hydrophobic core of the NMR structure ensemble of FSD-1 (Tyr3, Ile7, Phe12, Leu18, Phe21, Ile22, and Phe25) is similar to the computed packing arrangement. Five of the seven residues have χ1angles in the same gauche+, gauche or trans category as the design target, and three residues match both χ1 and χ2angles. The two residues that do not match their computed χ1 angles are Ile7 and Phe25, which is consistent with their location at the less constrained open end of the molecule. Ala5 is not involved in its expected extensive packing interactions and instead exposes about 45 percent of its surface area because of the displacement of the strand 1 backbone relative to the design template. Conversely, Lys8 behaves as predicted by the algorithm with its solvent exposure (60 percent) and χ1 and χ2 angles matching the computed structure. Because there are few NOEs involving solvent-exposed side chains, most of these side chains are disordered in the solution structure, a state that precludes examination of the predicted surface residue hydrogen bonds. However, Asn14 forms a hydrogen bond from its side chain carbonyl oxygen as predicted, but to the amide of Glu17, not Lys16 as expected from the design. This hydrogen bond is present in 95 percent of the structure ensemble and has a donor-acceptor distance of 2.6 ± 0.06 Å. In general, the side chains of FSD-1 correspond well with the design algorithm predictions, but further refinement of the scoring function and rotamer library should improve sequence selection and side chain placement and improve the correlation between the predicted and observed structures.

We compared the average restrained minimized structure of FSD-1 and the design target (Fig. 6). The overall backbone rms deviation of FSD-1 from the design target is 1.98 Å for residues 3 to 26 and only 0.98 Å for residues 8 to 26 (Table2). The largest difference between FSD-1 and the target structure occurs from residues 4 to 7, with a displacement of 3.0 to 3.5 Å of the backbone atom positions of strand 1. The agreement for strand 2, the strand-to-helix turn, and the helix is remarkable, with the differences nearly within the accuracy of the structure determination. For this region of the structure, the rms difference of φ, ψ angles between FSD-1 and the design target is only 14 ± 9°. In order to quantitatively assess the similarity of FSD-1 to the global fold of the target, we calculated their supersecondary structure parameter values (Table 2) (36,37), which describe the relative orientations of secondary structure units in proteins. The values of θ, the inclination of the helix relative to the sheet, and Ω, the dihedral angle between the helix axis and the strand axes (see legend to Table 2), are nearly identical. The height of the helix above the sheet, h, is only 1 Å greater in FSD-1. A study of protein core design as a function of helix height for Gβ1 variants demonstrated that up to 1.5 Å variation in helix height has little effect on sequence selection (37). The comparison of supersecondary structure parameter values and backbone coordinates highlights the excellent agreement between the experimentally determined structure of FSD-1 and the design target, and demonstrates the success of our algorithm at computing a sequence for this ββα motif.

Figure 6

Comparison of the FSD-1 structure (blue) and the design target (red). Stereoview of the best-fit superposition of the restrained energy minimized average NMR structure of FSD-1 and the backbone of Zif268. Residues 3 to 26 are shown.

Table 2

Comparison of the FSD-1 experimentally determined structure and the design target structure. The FSD-1 structure is the restrained energy minimized average from the NMR structure determination. The design target structure is the second DNA binding module of the zinc finger Zif268 (9)

View this table:

The quality of the match between FSD-1 and the design target demonstrates the ability of our algorithm to design a sequence for a fold that contains the three major secondary structure elements of proteins: sheet, helix, and turn. Since the ββα fold is different from those used to develop the sequence-selection methodology, the design of FSD-1 represents a successful transfer of our algorithm to a new motif. Further tests of the performance of the algorithm on several different motifs are necessary, although its basis in physical chemistry and the absence of heuristics and subjective considerations should allow the algorithm to be used in many different structural contexts. Also, the generation of various kinds of backbone templates for use as input to our fully automated sequence selection algorithm could enable the design of new protein folds. Recent results indicate that the sequence selection algorithm is not sensitive to even fairly large perturbations in backbone geometry and should be robust enough to accommodate computer-generated backbones (37).

The key to using a quantitative method for the FSD-1 design, and for the continued development of the methodology, is the tight coupling of theory, computation, and experiment used to improve the accuracy of the physical chemical potential functions in our algorithm. When combined with these potential functions, computational optimization methods such as DEE can rapidly find sequences for structures too large for experimental library screening or too complex for subjective approaches. Given that the FSD-1 sequence was computed with only a 4-GigaFLOPS computer (19), and that TeraFLOPS computers are now available with PetaFLOPS computers on the drawing board, the prospect for pursuing even larger and more complex designs is excellent.

  • * To whom correspondence should be addressed. E-mail: steve{at}

  • Present address: Xencor, Pasadena, CA 91106, USA.


View Abstract

Navigate This Article