Pathways to a Protein Folding Intermediate Observed in a 1-Microsecond Simulation in Aqueous Solution

See allHide authors and affiliations

Science  23 Oct 1998:
Vol. 282, Issue 5389, pp. 740-744
DOI: 10.1126/science.282.5389.740


An implementation of classical molecular dynamics on parallel computers of increased efficiency has enabled a simulation of protein folding with explicit representation of water for 1 microsecond, about two orders of magnitude longer than the longest simulation of a protein in water reported to date. Starting with an unfolded state of villin headpiece subdomain, hydrophobic collapse and helix formation occur in an initial phase, followed by conformational readjustments. A marginally stable state, which has a lifetime of about 150 nanoseconds, a favorable solvation free energy, and shows significant resemblance to the native structure, is observed; two pathways to this state have been found.

Elucidation of the mechanism of protein folding is an important step in understanding the relation between sequence and structure of proteins. Understanding of the mechanism should allow more accurate prediction of protein structures, with wide-ranging implications in biochemistry, genetics, and pharmaceutical chemistry. The recent hypothesis of folding-related diseases is another example of the significance of folding (1). Yet despite great progress made by a variety of experimental and theoretical studies after decades of extensive research, it has been difficult to establish detailed descriptions of the folding process and mechanism (2).

Computer simulation of molecular systems can provide rich information at various levels of resolution, and this approach has been important in attempts to understand protein folding mechanisms. A simplified representation might treat the protein residues as (one or two) linked beads (3). Higher resolution models represent most or all of the atoms of the protein explicitly, with an implicit representation of the solvent (4, 5). At an even higher level of detail are molecular dynamics (MD) simulations with full atomic representation of both protein and solvent. Such calculations are uniquely suited to the study of protein folding because of their resolution and accuracy. The simulation parameters (that is, the force field) are derived from experiments and from gas-phase quantum mechanical calculations and have been tested in smaller systems in many critical comparisons with experimental results (6). Because of the complexity of the representation, the large number of atoms (often exceeding 10,000), and the need to take time steps of 1 to 2 fs, such simulations have, to date, been limited to a few nanoseconds (7) (or a few million integration steps). This has precluded the simulation of even the early stages of protein folding. Nevertheless, insights have been gained from unfolding simulations (8) of the denaturation process. Attempts have also been made to construct the folding free-energy landscape from unfolding simulations (9). Direct folding simulations with this approach, however, have been limited to small peptide fragments and have been carried out for as long as 50 ns (10). Direct simulation of the protein folding process with such an approach has not been considered possible (11) “either now or in the foreseeable future” (12, p. 29). By using a Cray T3E, a massively parallel supercomputer consisting of hundreds of central processing units (CPUs) connected by low-latency, high-speed, and high-availability networks, with an efficiently parallelized program that scales well to the 256-CPU level for small protein–solvent systems and is six times faster than a typical current state-of-the-art program (13), we have conducted a 1-μs simulation at 300 K on the villin headpiece subdomain, a 36-residue peptide (HP-36) (14, 15), starting from a fully unfolded extended state (Fig. 1A), including ∼3000 water molecules (16–18). The simulation time scale is close to that required to fold small proteins. The simulation shows a mechanism for the protein to find a folding intermediate, an important step in proceeding to its fully folded state.

Figure 1

Ribbon representations of (A) the unfolded, (B) partially folded (at 980 ns), and (C) native structures, and (E) a representative structure of the most stable cluster and (D) the overlap of the native (red) and the most stable cluster (blue) structures, generated with UCSF MidasPlus. Color code [except (D)]: red, main chain atoms and oxygen; black, non–main chain carbon; blue, non–main chain nitrogen; gray, hydrogen; yellow, sulfur.

Proteins can have marginally stable nonnative states that are difficult to observe experimentally (19). Computer simulation can play an important role in identifying these structures because of its extremely high time resolution and detailed atomic level representation. Recent experimental studies suggest that the time for (small) proteins to reach their marginally stable states with partially formed secondary structures is on the order of 10 μs (20). It also has been shown that a small protein can fold within 20 μs (21), and it has been estimated that the lower limit of the folding time is 1 μs (22).

HP-36 is one of the smallest proteins that can fold autonomously. It contains only naturally occurring amino acids and does not require disulfide bonds, oligomerization, or ligand binding for stabilization; its melting temperature is above 70°C in aqueous solution (14). The estimated folding time of the protein is between 10 and 100 μs (23), which would make it one of the fastest folding proteins. Nuclear magnetic resonance (NMR) studies of the 36-residue subdomain revealed three short helices (Fig. 1C) (15). We refer to them as helices 1, 2, and 3, for residues 4 to 8, 15 to 18, and 23 to 30, respectively, as found in the NMR structure. They are held together by a loop (residues 9 to 14), a turn (residues 19 to 22), and a closely packed hydrophobic core. In the following, we number the residues from 1 to 36, where our residue 1 corresponds to residue 41 in the NMR structure. The unfolded starting structure (Fig. 1A), which was generated from the NMR native structure by a 1.0-ns MD simulation at 1000 K, was in an extended state with very few native contacts (<3%) and no helical content.

In addition to the 1-μs simulation, a control simulation was conducted for 100 ns at 300 K (16), starting from the native NMR structure (15). In this 100-ns simulation, the NH2-terminal helix 1 rotated ∼30° outward while maintaining its helical structure. The COOH-terminal residue Phe36, which was disordered in the NMR structure, also exhibited large-scale movement. Phe36 was initially in the solvent, as found in the NMR structure. Together with Leu35, it soon moved toward the COOH-terminus of helix 1 and loosely packed against the middle of Lys8, forming a small hydrophobic cluster comprising Lys8, Leu35, and Phe36. Judging from the reduction of the hydrophobic surface, the formation of these contacts appears energetically reasonable. The overall structure, particularly the middle portion (helices 2 and 3) and the hydrophobic core, remained stable in the simulation. The average root mean square deviation (rmsd) from the NMR structure was 1.5 Å for the main chain atoms of residues 9 to 32 in the last 50 ns of the trajectory, whereas this rmsd varied from 3.0 to 8.8 Å during the last 800 ns of our 1-μs folding trajectory. The fact that the core of the native structure remained near the NMR structure indicated that our simulation protocol was adequate to study protein folding.

In the 1-μs trajectory, the radius of gyration (R γ) fluctuated (Fig. 2C) between 16 Å, which represents extended states, and 8.7 Å, which represents highly compact states, compared with 9.4 Å of the native structure. The main-chain rmsd (Fig. 2C) of all residues (1 to 36) varied between 12.4 and 4.5 Å; that of the middle portion (residues 9 to 32) fluctuated between 8.8 and 3.0 Å. Up to 80% of the native helical content (Fig. 2A) and up to 62% of the native contacts (Fig. 2B) were formed. The solvation component of the free energy (SFE) (Fig. 2D) also reached levels comparable to that of the native structure. More importantly, a marginally stable state was reached, as can be seen from the rmsd's and theR γ, which had a residence time of longer than 150 ns, much longer than typical MD simulations conducted to date.

Figure 2

Time evolution of (A) fractional native helical content, (B) fractional native contacts, (C) R γ and the main chain rmsd from the native structure, and (D) SFE of the protein. The helical content and the native contacts are plotted on a logarithmic time scale. The helical content was measured by the main chain φ-ψ angle (−60° ± 30°, −40° ± 30°). The native contacts were measured as the number of neighboring residues present in 80% of the last 50 ns of the native simulation. Residues are taken to be in contact if any of the atom pairs are closer than 2.8 Å, excluding residues i and i+1, which always have the contacts through main chain atoms. The SFE was calculated as described by Eisenberg and McLachlan (31) using their parameters (0.0163, −0.00637, 0.02114, −0.02376, and −0.05041, in kcal mol Å−2, for the surface areas of nonpolar, polar, sulfur, charged oxygen, and charged nitrogen, respectively). The straight line represents the SFE of the native structure.

An important feature of most of the trajectory is its high degree of fluctuation, exhibited by essentially all the features measured, including native helical content, native contacts, rmsd, andR γ (Fig. 2). Such a large degree of fluctuation is in contrast to the relatively small fluctuations found during the simulation beginning from the native structure and during the time when the marginally stable state was reached in the folding simulation. This high degree of fluctuation is an indication of the rugged and shallow free-energy landscape associated with early stages of folding. This shallow landscape enables the protein to search the early-stage folding free-energy surface easily.

The folding began with a “burst” phase, characterized by a steady rise in native helical content (Fig. 2A) and in native contacts (Fig. 2B), and the decrease of the SFE (Fig. 2D), which lasted from the beginning of the trajectory to ∼60 ns. Within this period of time, the native helical content increased to ∼60% from an initial value of zero; meanwhile, the native contacts increased to about 45% from an initial value of 3%, and the SFE was reduced by nearly 14 kcal/mol, reaching a level comparable to that of the native structure. Analysis of the correlations between various energy terms andR γ indicated that the initial phase of the 300 K simulation was driven by the burial of exposed hydrophobic surface (13). Therefore, this phase can be seen as an initial hydrophobic collapse. However, given the concomitant rise of the helical content, it appears that hydrophobic collapse occurs on the same time scale as formation of some secondary structure. This makes physical sense in that a protein, as it buries its hydrophobic groups, tries to avoid burying its hydrogen bonding functionalities, and secondary structure formation provides a way to do this. The time required to reach 50% helical content was about 60 ns, in excellent agreement with recent kinetic measurements on apo-myoglobin (48 ns) (24) and on alanine peptide (16 to 180 ns) (25). Given the diverse folding rates observed in different sequences in experiments (16 ns for alanine peptide and 48 ns for apo-myoglobin, with the same method), the small difference observed here may be the result of sequence dependence.

Alonso and Daggett (26) showed that almost all nonnative conformations of ubiquitin generated in unfolding simulations moved to a lower R γ when the temperature was lowered. Their least native structure went through two cycles of expansion and collapse in 2 ns. Our simulations expand on these findings by showing that cycles of expansion and collapse can extend even into the microsecond regime and that these expanded and collapsed structures get more nativelike as folding proceeds.

This burst phase was followed by a slower adjustment phase. The slower phase started from a sharp drop of the helical content, from an average of more than 50% down to about 20%, while theR γ decreased slightly and the SFE became more favorable. Meanwhile, the native contacts remained at a level similar to the one at the end of the burst phase. After this initial drop of the helical content, both the helical content and the native contacts remained at their respective levels. They showed a steady but slow rise after about 200 ns. A two-phase folding process has also been observed in the kinetic measurement of apo-myoglobin in which a fast 48-ns burst phase was followed by a 132-μs slow phase that was interpreted as the tertiary-contacts formation phase (24).

To reach their folded states from fully unfolded states, proteins can (27) sample marginally stable states, which can be identified by clustering methods (28). The population of snapshots in each cluster reflects the likelihood of each cluster being sampled in the duration of the simulation. There are 13 clusters that each contain more than 1000 snapshots (or 2% of the trajectory, equivalent to 20 ns). Among these 13 most populous clusters, 10 are compact, with values of R γ between 9.1 to 10.4 Å (or 96 to 110% of the native R γ). A common feature of these clusters is the formation of a helix at the NH2-terminus of helix 3, residues 23 to 28. Helix 2 is also partially formed in 12 of the 13 most populous clusters. This suggests that helix 2 and the NH2-terminus of helix 3 are the initiation sites of folding.

The core regions of most of these 13 populous clusters are packed and inaccessible to water. Their average buried surface areas were more than 50% for residues 10 to 30, which make up the core region, compared with 58% burial of the same residues in the native structure. Only 4 of the 13 clusters showed more than 50% accessible surface for the same region. This is in contrast to the initial 200 ns of the trajectory, in which half of the compact structures showed a solvated core (13), indicating a shift toward compact structures that are less accessible to water as the folding progresses.

The most populous cluster had a population of 8765 snapshots, or 17.5% of the trajectory. Most of the snapshots in this cluster are found between 240 to 400 ns, while the rmsd andR γ are very stable, and the SFE is comparable to that of the native structure. It is also the most compact cluster whose R γ of 9.1 Å is smaller than that of the native structure (9.4 Å). Representative structures from this cluster show a marked similarity to the native structure (Fig. 1D). Among the secondary structure elements, helix 2 is well formed, helix 1 and 3 are partially formed, and the loop connecting helix 1 and 2 starts to form. The main chain rmsd of the structure shown in Fig. 1E relative to the NMR structure is 5.7 Å for all residues and 4.0 Å for residues 9 to 32. The SFE of this cluster is 7.8 ± 2.3 kcal/mol, close to the 7.1 ± 2.2 kcal/mol of the 100-ns native simulation trajectory and the 6.4 kcal/mol found for the experimental NMR structure. A notable feature of this cluster is its high stability. The longest residence time in the cluster is about 150 ns, much longer than that of any other state (typically a few nanoseconds).

While the simulation is in the most highly populated, nativelike state, the side chains do not reach their native positions (Fig. 1E). This is not surprising. It is unrealistic to expect the protein to fold to the native structure within our simulation time of 1 μs, which is still much shorter than the lower bound of the estimated folding time, 10 μs (23). On the other hand, the estimated folding time is consistent with the formation of a marginally stable state that contains many of the features of the native structure. Because the residence time of the marginally stable state is longer than 150 ns, only two orders of magnitude shorter than the lower limit of the estimated folding time, and is highly nativelike, it may well be an intermediate state. We speculate that a number of intermediates such as the one we have observed will form and dissipate, until one forms that allows the precise side chain packing that will lead to the native state.

These states, identified as clusters, when linked together by the transitions, show the pathways of the folding events, whereas the number of transitions between clusters indicates the likelihood of such transitions in the simulation time scale. We have identified the most populated clusters and the transitions between them (Fig. 3). These transitions are bidirectional (that is, the number of forward and backward transitions are similar). A noteworthy feature is that the access to the marginally stable state (discussed above) is limited, and only two primary pathways to it were observed in the simulation. This is in contrast to other states with similar (but smaller) populations that are much more accessible. For example, the second most populated state can be readily accessed through five pathways. Consequentially, such states are kinetically less stable and have a much shorter residence time (a few nanoseconds) than the marginally stable state, even though on the basis of our limited sampling they are only slightly less favorable thermodynamically [−k B Tln(4782/8765), or about 0.4 kcal/mol]. We speculate that limited access to the folded state [such that folding takes place by way of a few pathways or a dominant pathway (5)] may serve to provide kinetic stability to the thermodynamically stable folded state. More importantly, through the transition network, early states can readily transit between one another, resulting in thoroughly tangled multiple pathways. Therefore, the emergence of such pathways may be a key feature of the funnel-shaped folding landscape, with the role of the folding intermediate being to merge the multiple pathways.

Figure 3

Pathways of folding events. Circles represent the most populous clusters and arrows represent transitions between them. The circles are shaded by k B Tln(n), where n is the number of snapshots in the cluster (see the legend), T is the temperature, andk B is the Boltzmann constant. The label in each circle indicates the number of snapshots in the cluster.

Among the native contacts, nine were in contact for more than 50% of the simulation time; they were “local” contacts (that is, less than four residues apart along the chain). Ten of the native contacts were formed for less than 10% of the simulation time, and seven of these poorly formed contacts were tertiary contacts (that is, more than five residues apart). Our simulation indicates that the tertiary contacts are less likely to form and be maintained in the early stages of folding. Therefore, the formation of tertiary contacts is likely to be the bottleneck of the folding process. These results are consistent with kinetic measurements of Plaxco et al. who found that the folding speed is primarily determined by the contact order (29)—the more “nonlocal” contacts a protein has, the slower it folds—suggesting the contribution of chain entropy loss to the free-energy barrier of folding.

Our results show that microsecond-scale simulations of small proteins in a fully solvated environment can be used to probe the early stages of the folding process. Although we have presented only a single trajectory here whose statistical significance cannot be assessed, we have carried out a second trajectory on HP-36 and one on protein G starting from unfolded states for ∼100 ns (30). The nature of the burst phase (hydrophobic collapse accompanied by secondary structure formation) was similar to that reported here. In addition, even though the repeated increases and decreases in theR γ during the 1-μs trajectory do not represent statistically independent events, they can be viewed as steps in the process of the protein finding its way from the fully unfolded to its fully folded state. With the further development of massively parallel supercomputers and constant improvement of the simulation methods, simulation times may be extended to cover the entire folding process of small proteins (tens of microseconds) within a few years. Equally exciting are the methods that have allowed experimentalists to study protein folding on a submicrosecond time scale. Thus, direct and realistic comparisons between experimental and simulation studies of protein folding may soon be made on the same time scale. These comparisons should provide a useful and critical assessment of the accuracy of the simulation models such as force fields and boundary conditions, and yield a microscopic understanding of the folding process valuable to theoreticians and experimentalists alike.

  • * To whom correspondence should be addressed. E-mail: pak{at}


View Abstract

Navigate This Article