Research Article

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

See allHide authors and affiliations

Science  21 Nov 2003:
Vol. 302, Issue 5649, pp. 1364-1368
DOI: 10.1126/science.1089427


A major challenge of computational protein design is the creation of novel proteins with arbitrarily chosen three-dimensional structures. Here, we used a general computational strategy that iterates between sequence design and structure prediction to design a 93-residue α/β protein called Top7 with a novel sequence and topology. Top7 was found experimentally to be folded and extremely stable, and the x-ray crystal structure of Top7 is similar (root mean square deviation equals 1.2 angstroms) to the design model. The ability to design a new protein fold makes possible the exploration of the large regions of the protein universe not yet observed in nature.

There are a large but finite number of protein folds observed thus far in nature, and it is not clear whether the structures not yet observed are physically unrealizable or have simply not yet been sampled by the evolutionary process or characterized by a structural biologist. Methods for de novo design of novel protein structures provide a route to resolving this question and, perhaps more importantly, a possible route to novel protein machines and therapeutics.

There has been considerable progress in the development of computational methods for identifying amino acid sequences compatible with a target structure (13), notably the pioneering complete redesign of a zinc finger protein by Mayo and co-workers (1). In general, these methods have not been used to create new protein structures but rather to redesign naturally occurring proteins so that they have enhanced stability or new functionality (46). Because of the strong steric restrictions in the cores of globular proteins, the packing of side chains in redesigned proteins is often quite similar to that in the original native protein (1, 7), and hence high-resolution protein backbone coordinates contain some memory of the original native sequence (812). When creating a new protein from scratch, there is no such sequence memory to aid the process, and it is not even known whether the target backbone is designable. Thus, the computational design of novel protein structures is a more rigorous test of current force fields and optimization methodology than the redesign of naturally occurring proteins.

Because it is unlikely that any arbitrarily chosen protein backbone will be designable, it is essential that the design procedure include a search of nearby conformational space in addition to sequence space. With the exception of the method used by Desjarlais and Handel (2) to redesign the hydrophobic core of a small naturally occurring protein, most previous approaches have either optimized the amino acid sequence for a large number of fixed backbone conformations (4, 1214) or, as in the landmark design by Harbury and colleagues of coiled coil oligomers with a right-handed superhelical twist (15), refined the backbone conformation for a large number of fixed amino acid sequences (15, 16). The range of sequence-structure pairs that can be searched with the use of these approaches is restricted by the need to specify, in advance, a limited number of backbone conformations or amino acid sequences.

We have developed a general procedure for identifying very low free energy sequence-structure pairs that iterates between sequence optimization and structure prediction and can be applied to the design of any desired target structure. The same energy function is used to guide the search at all stages, and at each stage only the lowest energy sequence or structure identified in the previous iteration is optimized, thereby avoiding the large-scale and computationally expensive enumeration of alternative backbones or alternative sequences. Unlike the genetic algorithm of Desjarlais and Handel (2) in which randomly selected torsion angles and residue identities were simultaneously perturbed, our procedure iterates between full-scale optimization of sequence for a fixed backbone conformation and gradient-based optimization of the backbone coordinates for a fixed sequence. We used this approach to create a 93-residue α/β protein with a topology not present in the Protein Structure Database (PDB).

Generation of starting models. The target structure for the de novo design process can range from a detailed backbone model to a back-of-the-envelope sketch. Because we aimed to create a novel protein fold, we selected a topology not present in the PDB according to the Topology of Protein Structure (TOPS) server (17). A rough two-dimensional diagram was created of the target structure (Fig. 1), and constraints were identified that define the topology (Fig. 1, arrows). Three-dimensional models satisfying the constraints were then generated by assembling three- and nine-residue fragments from the PDB with secondary structures consistent with the diagram using the Rosetta de novo structure prediction methodology (18), leading to 172 backbone-only models that had the desired topology and secondary structure content and had root mean square deviations (RMSDs) from each other of 2 to 3 Å.

Fig. 1.

A two-dimensional schematic of the target fold (hexagon, strand; square, helix; circle, other). Hydrogen bond partners are shown as purple arrows. The amino acids shown are those in the final designed (Top7) sequence.

Generation of starting sequences. A sequence was designed for each model with the use of the RosettaDesign (9) Monte Carlo search protocol and energy function, which is dominated by a 12-6 Lennard-Jones potential, an orientation-dependent hydrogen bonding term (19), and an implicit solvation model (20, 21). All amino acids except for cysteine were allowed at 71 of the 93 positions [∼110 rotamers from Dunbrack's library (22) per position]; the remaining 22 surface β-sheet positions were restricted to polar amino acids (∼75 rotamers per position). The search through the 11071 × 7522 (>10186) rotamer combinations took ∼10 min for each model on a Pentium III (Intel) processor.

Because the starting backbone conformations were generated without regard to side-chain packing, it was anticipated that sequences with very low free energies might not exist (i.e., the structures would not be designable). Indeed, the lowest energy sequences selected for the starting structures had energies considerably higher than those of native proteins of roughly the same size. In particular, the Lennard-Jones interaction energies for core residues were on average 0.8 kcal mol–1 less favorable than the interaction energies for the same residues in native protein cores. The finding that low-energy sequences do not exist for protein backbones generated without regard to side-chain packing emphasizes the need to couple sequence design with backbone flexibility for general protein design.

Simultaneous optimization of sequence and structure. The critical feature of the design protocol is the cycling between sequence design, as described above, and backbone optimization. The goal of the backbone optimization step, to identify the lowest free energy backbone conformation for a fixed amino acid sequence, is formally analogous to the high-resolution structure prediction problem, and we used the Rosetta program (23), which we developed for structure prediction. The backbone torsion angles were optimized with the use of a Monte Carlo minimization protocol (24) in which each move has the following parts. (i) An initial perturbation, consisting of either a small random change in the torsion angles of one to five randomly selected residues or a substitution of the backbone torsion angles of one to three consecutive residues with torsion angles from a structure in the PDB. In the latter case, the torsion angles of neighboring residues were varied to minimize the displacement of the downstream portion of the chain. (ii) A rapid optimization of side-chain conformation for all residue positions that had a higher energy after step 1 by cycling through each rotamer at each position in turn and replacing the current side-chain conformation with the lowest energy rotamer conformation. (iii) Optimization of the backbone torsion angles in a 10-residue window surrounding the site of insertion by energy minimization using a quasi-Newton method (25). Moves were accepted or rejected on the basis of the energy difference between the final minimized structure and the starting structure according to the Metropolis criterion. The same energy function was used for backbone optimization and sequence design. Each round of backbone relaxation consisted of several thousand such Monte Carlo minimization moves; a full combinatorial optimization of side-chain rotamer conformations was carried out with the use of a Monte Carlo procedure every 20 moves.

For each starting structure, five independent simulations, each with 15 cycles of sequence design and backbone optimization, were used to obtain low-energy structure sequence pairs. Final energies were comparable to those observed for naturally occurring proteins. Proteins designed with the use of an initial version of the protocol with a damped Lennard-Jones repulsive term and a Monte Carlo optimization without the minimization step were observed experimentally to be quite stable but appeared to have somewhat molten cores (26). To optimize steric packing, the atomic radii were reparameterized on the basis of the distances of closest approach of atom pairs in high-resolution protein structures, explicit protons were included on all atoms, the penalty for atom-atom overlaps was greatly steepened, and the full Monte Carlo minimization protocol was used for varying the backbone, resulting in the generation of much lower energy sequence-structure pairs (20% of the final 860 models had more favorable Lennard-Jones energies than an average protein in the PDB). With these improvements, the protocol was used to design a protein sequence called Top7 (27).

The average Lennard-Jones energies for the buried residues in Top7 become favorable during the relaxation process (table S1), and, although the structural changes during the iterative refinement process are modest (the final protein backbone model has an RMSD of 1.1 Å from the starting model), they bring about dramatic changes in the designed sequence: Only 31% of the Top7 residues are identical to those in the starting sequence. Neither the Top7 sequence nor the sequence before the iterative sequence-structure refinement process have significant similarity to any naturally occurring protein sequence; the closest match to the Top7 sequence found with the use of PSI-BLAST (28) in the Non-Redundant protein sequence database is weaker than would be expected by random chance (E value = 1.6).

Biophysical and structural characterization of Top7. The folding, stability, and structure of the Top7 protein (29) were analyzed with the use of a variety of biophysical methods. The Top7 protein is highly soluble (at 25 to 60 mg ml–1) and is monomeric as determined by gel filtration chromatography. The circular dichroism (CD) spectrum of Top7 is characteristic of α/β proteins (Fig. 2A), and the protein is remarkably thermally stable: The CD spectrum at 98°C is nearly indistinguishable from that at 25°C. At intermediate concentrations (∼5 M) of guanidine hydrochloride (GuHCl), Top7 unfolds cooperatively with an increase in temperature and exhibits cold denaturation (Fig. 2B). Fitting these data to the Gibbs-Helmholtz equation gave a change in heat capacity at constant pressure (Δ°Cp) per residue associated with unfolding of about 10 cal deg–1 mol–1, a typical value for well-folded proteins of this size (30). The GuHCl-induced chemical denaturation of Top7 is cooperative, and the steep transition is characteristic of the two-state unfolding expected for a small, monomeric, single-domain protein (Fig. 2C). Fitting the chemical denaturation data to a two-state unfolding model yields a free energy of unfolding of 13.2 kcal mol–1 at 25°C, indicating that Top7 is more stable than most proteins of similar size (31). The nuclear Overhauser effect spectroscopy (NOESY) and heteronuclear single-quantum coherence (HSQC) spectra of Top7 (Fig. 2, D and E) exhibit features characteristic of a folded protein with substantial β-sheet content. The HSQC spectrum contains the expected number of cross peaks, and the dispersion is comparable to that of α/β proteins of similar size. Strong backbone NH-Hα cross peaks and the observation of Hα resonances downfield of the water signal (to 6 parts per million) indicate the presence of a β sheet, whereas NH-NH peaks are consistent with a partial helical character for the protein.

Fig. 2.

Biophysical characterization of Top7. (A) The far-ultraviolet (UV) CD spectrum of 20 μM Top7 in 25 mM tris-HCl, 30 mM NaCl, and pH = 8.0 at varying temperatures and concentrations of GuHCl. (B) CD signal at 220 nm as a function of temperature and GuHCl for 8 μM TOP7 in 25 mM tris-HCl, 30 mM NaCl, pH = 8.0, in a 2-mm cuvette. (C) CD signal at 220 nm as a function of GuHCl concentration for 5 μM protein in 25 mM tris-HCl, 30 mM NaCl, pH = 8.0, at 25°C in a 1-cm cuvette. (D) The NOESY spectrum of ∼1 mM Top7 at pH = 6.0 recorded at 298 K, 500 Mhz, and 200-ms mixing time with the use of Watergate suppression. ω, frequency. (E) The HSQC spectrum of ∼1 mM 15N-Top7 at pH = 6.0 recorded at 298 K and 500 Mhz with the use of the fast HSQC scheme of Mori et al. (43).

Crystallization trials with Top7 yielded crystals that diffracted to 2.5 Å. Remarkably, a strong molecular replacement (MR) solution to the phase problem was found with the use of the design model. This suggested immediately that the design model was quite close to the true structure: Even the small deviations of nuclear magnetic resonance (NMR) solution structures from x-ray crystal structures can make molecular replacement searches fail. To obtain unbiased phase information, we produced and crystallized a selenomethionyl (SeMet)-substituted variant of Top7 with a surface lysine at position 37 mutated to methionine, and we solved the x-ray crystal structure to 2.5 Å by direct rebuilding into an unbiased single-wavelength anomalous difference (SAD) electron density map (Fig. 3B) and residual difference Fourier maps (32). The final Rwork and Rfree were 0.268 and 0.293, respectively (table S2).

Fig. 3.

Schematic representation of Top7 in unbiased SAD density. (A and B) Stick representations of residues 46 to 76 from the computationally designed Top7 (left, green) and from the 2.5 Å x-ray structure (right, red) are shown in unbiased density (blue). The map was generated from SAD phasing from a single SeMet-substituted variant of Top7, followed by density modification. (C and D) Ribbon diagrams of Top7 with residues 46 to 76 highlighted in red. The two diagrams are related by a 90° rotation around the vertical axis.

The high-resolution crystal structure reveals that the Top7 protein adopts the designed topology (Fig. 4A). Indeed, the structure is strikingly similar to the design model at atomic resolution (1.17 Å RMSD over all backbone atoms). The overall protein structure is very well ordered, with the exception of two turns (comprising residues 11 to 15 and 24 to 31), each of which exhibit elevated B-factors and poor quality electron density. The first of these two turns and the immediately adjoining residues from its neighboring strand deviate the most from the computational model. However, even in this region, the all-atom RMSD between the two models does not exceed 2.8 Å. In contrast, the C-terminal half of the x-ray structure is well ordered and very similar to the computational model; for example, the region from Asp78 to Gly85 has an all-atom RMSD of 0.79 Å (Fig. 4B). Many side chains in the core of the solved structure are effectively superposable with those of the designed Top7 (Fig. 4C).

Fig. 4.

Comparison of the computationally designed model (blue) to the solved x-ray structure (red) of Top7. (A) C-α overlay of the model and structure in stereo (backbone RMSD = 1.17 Å). (B) The C-terminal halves of the x-ray structure and model are extraordinarily similar. The representative region shown (Asp78 to Gly85) has an all-atom RMSD of 0.79 Å and a backbone RMSD of 0.54 Å. (C) Stereorepresentation of the effectively superposable side chains in the cores of the designed model and the solved structure.

Like the design model, the Top7 crystal structure is judged to be a novel topology by the TOPS server. The strongest structural similarity found in a Dali search of the PDB (33) is to a discontinuous portion of the 668-residue protein S-adenosylmethionine decarboxylase, which has a large 68-residue insertion between strands 1 and 2, and the third and fourth strands are connected by an unrefined loop instead of a helix. According to A. Murzin, the curator of the Structural Classification of Proteins (SCOP) database, the Top7 fold is not present in SCOP (34, 35).

Implications. The 1.17-Å backbone atom RMSD between the Top7 design model and the crystal structure implies that deep minima in the free energy function used in design correspond to deep minima in the actual free-energy landscape and hence are an important validation of the accuracy of current potential functions. This atomic-level accuracy contrasts sharply with the low accuracy of ab initio structure predictions for naturally occurring sequences: The most accurate structure predictions in the Critical Assessment of Structure Prediction experiments for 90- to 100-residue proteins have RMSDs greater than 4 Å from the experimentally determined structure. Why does the simultaneous optimization of sequence and structure identify the global free energy minimum, whereas the optimization of structure for fixed sequence does not? The answer may involve both of the challenges facing ab initio structure prediction, the vast size and ruggedness of the conformational space to be searched and the limited accuracy of current potential functions. The capability to alter the sequence and hence reconfigure the landscape may greatly facilitate the search for low-free-energy protein structures as compared to standard ab initio prediction, where the sequence is fixed. In addition, Top7 lacks functional constraints, which can lead to locally suboptimal regions in native structures that are particularly challenging for structure prediction, and the more extensive optimization of the folding free energy may partially compensate for inaccuracies in the potential functions. Finally, it should be noted that the design process focused entirely on minimizing the free energy of the folded monomeric structure; attaining a highly stable new structure did not require extensive negative design against possible alternative conformations (36, 37) nor consideration of the kinetic process of protein folding (38).

The design of Top7 shows that globular protein folds not yet observed in nature not only are physically possible but can be extremely stable. This extends the earlier observation that helical coiled coil geometries not found in nature can be generated by computational protein design (15). The protein therapeutics and molecular machines of the future should thus not be limited to the structures sampled by the biological evolutionary process. The methods used to design Top7 are, in principle, applicable to any globular protein structure and open the door to the exploration and use of a vast new world of protein structures and architectures.

Supporting Online Material

Materials and Methods

Tables S1 to S6

References and Notes

View Abstract

Navigate This Article