Inferential Structure Determination

See allHide authors and affiliations

Science  08 Jul 2005:
Vol. 309, Issue 5732, pp. 303-306
DOI: 10.1126/science.1110428


Macromolecular structures calculated from nuclear magnetic resonance data are not fully determined by experimental data but depend on subjective choices in data treatment and parameter settings. This makes it difficult to objectively judge the precision of the structures. We used Bayesian inference to derive a probability distribution that represents the unknown structure and its precision. This probability distribution also determines additional unknowns, such as theory parameters, that previously had to be chosen empirically. We implemented this approach by using Markov chain Monte Carlo techniques. Our method provides an objective figure of merit and improves structural quality.

A major difficulty in the determination of three-dimensional macromolecular structures is that experimental data are indirect. We observe physical effects that depend on the atomic geometry and use a forward model to relate the observed data to the atomic coordinates. For example in nuclear magnetic resonance (NMR), the intensity Ii of peaks in nuclear Overhauser effect spectroscopy (NOESY) data is proportional to the inverse sixth power of the distance di of two spins: Math (1). This isolated spin pair approximation (ISPA) involves an unknown scaling factor γ. It seems straightforward to obtain the structure in the example: simply use the observed intensities to calculate sufficient distances to define the structure.

In realistic applications, this approach runs into difficulties. One problem is that the forward model is usually inherently degenerate, meaning that different conformations can lead to the same observations and therefore cannot be distinguished experimentally, and even a formally invertable forward model is practically degenerate if the data are incomplete. A further complication is that there are uncertainties in both the data and the forward model: Data are subject to experimental errors, and theories rest on approximations. Moreover, the forward model typically involves parameters that are not measurable. Algorithms for structure calculation from x-ray reflections, NMR spectra, or homology-derived restraints should account for these fundamental difficulties in some way.

Structure determination in general is an ill-posed inverse problem, meaning that going from the data to a unique structure is impossible. However, the current paradigm in structure calculation is to attempt an inversion of the forward model. Most algorithms minimize a hybrid energy Ehybrid = Ephys + wdataEdata (2), where a nonphysical energy Edata uses the forward model and a restraining function to assess the agreement between data and structure. A force field Ephys describes the physical properties of the macromolecule, such as bonded and nonbonded interactions between the atoms, and partially removes the degeneracy of the problem. The rationale is that minimization of the hybrid energy effectively inverts the forward model, yielding the “true” structure.

This strategy works in the case of many data of good quality. In less favorable situations, the ill-posed nature of the inverse problem becomes apparent. Specifically, it remains unclear how to choose auxiliary parameters like the weight wdata or theory parameters such as the scaling factor γ in the ISPA. Because the hybrid energy minimization paradigm offers no principle to settle these issues, such parameters need to be determined heuristically.

The principal difficulty in structure determination by NMR is the lack of information that is indispensible to reconstruct the structure unambiguously. By formulating an optimization problem (“search for the minimum of Ehybrid”), one however implicitly assumes that there is a unique answer. Repeating the optimization procedure multiple times to obtain several “unique” solutions hides but does not solve the ambiguity and makes it difficult to judge the validity and precision of NMR structures in an objective way.

We suggest that it is a misconception to use structure calculation methods that are only appropriate if the objective is to obtain a unique structure. Instead, we view structure determination as an inference problem, requiring reasoning from incomplete and uncertain information. We consider the entire conformational space and use the data only to rank the molecule's possible conformations. We assign a number Pi to every conformation Xi. If Pi > Pj, conformation Xi is more supported by the data than Xj. Cox (3) proved that such rankings are equivalent to a probability and that probability theory is the only consistent calculus to solve inference problems. The distribution of the probabilities Pi reflects the information content of the data. If all but one Pi vanish, the data determine the structure uniquely. If Pi are uniform throughout conformational space, the data are completely uninformative with respect to the structure.

Any inferential structure determination (ISD) is solved by calculating the probabilities Pi. We demand the probabilities to be objective in the sense that they only depend on data D and on relevant prior information I (such as the forward model or knowledge about physical interactions). Thus, Pi is a conditional probability, Pi = P(Xi|D,I); it is not a frequency of occurence but a quantitative representation of our state of knowledge. In the case of a continuous parametrization of conformations, such as Cartesian coordinates, Pi is a density p(X|D,I).

A direct consequence of probability calculus is Bayes' theorem (4), which formally solves our inference problem. The posterior density Math(1) factorizes into two natural components: The likelihood function p(D|X, I) combines a forward model and an error distribution and quantifies the likelihood of observing data D given a molecular structure X. Because we model deviations between measurements and predictions explicitly, the precision of the co-ordinates depends on the quality of the data and on the accuracy of the forward model. In the ideal case of a uniquely invertible model, the likelihood function is only peaked at the structure that satisfies the data (i.e., the conventional approach is contained as limiting case). The prior density p(X|I) takes prior knowledge about biomolecular structures into account and is determined by the physical energy and the temperature of the system (5).

The error distribution and the forward model typically contain auxiliary parameters ξ that are unavailable from the data but necessary in order to describe the problem adequately. In Bayesian theory, such nuisance parameters are treated in the same way as the coordinates: They are estimated from the experimental data by replacing X with (X, ξ) in Eq. 1. Assuming independence of X and ξ, the joint posterior density for all unknown parameters is Math(2) Equation 2 provides a unique rule to determine any quantity that is not accessible by experiment.

To demonstrate the practical feasibility of the ISD approach, we infer the molecular structure of the Fyn SH3 domain (59 amino acids length). Experimental distances between amide protons were derived from a series of NOESY spectra on a {15N, 2H} enriched protein (6). The data set is sparse: It comprises 154 measurements, of which on average only one per amino acid provides long-range structural information. The forward model Math defined by the ISPA does not account for experimental errors and systematic effects like spin diffusion (1) and internal dynamics (7); hence, observed intensities will deviate from theoretical predictions. A log normal distribution (5) describes these deviations and introduces a second nuisance parameter σ that quantifies their magnitude. Thus, we have two nuisance parameters, ξ = (γ,σ).

Although given in analytically closed form (5), it is practically impossible to evaluate the posterior density p(X,γ,σ|D,I) over all conformational space. Therefore, in our view, structure calculation comprises posterior simulation, which samples only regions that carry a considerable amount of probability mass. We have developed a Markov chain Monte Carlo (MCMC) algorithm based on the replica-exchange method (8) to simulate the joint posterior density of a structure determination problem (5, 9) (Fig. 1 and fig. S1).

Fig. 1.

Replica-exchange Monte Carlo algorithm. (A) We generate a stochastic sample (X(k), γ(k), σ(k)) from the joint posterior distribution in an iterative fashion by using Gibbs sampling (20). The nuisance parameters γ and σ are consecutively drawn from their conditional posterior distributions, with the values of the other parameters being fixed to their previously generated values. Coordinates are updated by using the hybrid Monte Carlo method (21). (B) To overcome energy barriers, we embed this scheme in a replica-exchange strategy, which simulates a sequence of heated copies of the system. Samples of the target distribution are generated in the low-temperature copy (Tlow) and propagate via stochastic exchanges between intermediate copies (Tlow < Ti < Thigh) to the high-temperature system (Thigh). The temperature Thigh is chosen such that the polypeptide chain can move freely in order to escape local modes of the probability density.

The most pronounced features of the posterior density can be represented in a set of conformational samples. Although this looks at first glance like a conventional structure ensemble, the rationale behind our approach to obtain conformational samples is very different. The uncertainty of atomic positions is directly influenced by the uncertainty of nuisance parameters and by the quality of the data. Effects not described in the ISPA, such as protein dynamics, tend to increase the deviations between predicted and measured peak intensities. This is reflected in an increase of the error σ and consequently leads to a loss in structural precision. However, unless the forward model incorporates experimental information on protein dynamics, we cannot discriminate motion from imprecisions due to experimental errors or lack of data.

Compared with conventional structure ensembles, our conformational samples are much better defined and systematically closer to the structure obtained with x-ray crystallography (10) (Fig. 2). A comparison of the 20 most probable conformations with the x-ray structure yields a backbone heavy atom rmsd (root mean square deviation) of 1.84 ± 0.20 Å for all residues and 1.36 ± 0.19 Å for the secondary structural elements. This is a considerable improvement over conventional techniques used in (6), where an ensemble with an overall rmsd of 2.86 ± 0.33 Å and an rmsd of 2.01 ± 0.28 Å for secondary structure elements was obtained. This improvement originates in the calculation of structures by random sampling, which searches conformational space more exhaustively and suppresses topologically unlikely conformations. Misfolds such as mirror images can only be realized in a small number of ways; thus, they are entropically suppressed and do not show up in a statistical ensemble. Discriminating such conformations on the basis of the hybrid energy is more difficult, in particular if the data are sparse.

Fig. 2.

Structure ensembles. Calculated structures (gray) were superimposed onto the x-ray structure of the SH3 domain (black). Superposition and plotting of the structures was carried out with MOLMOL (24). (A) Backbone traces (atoms N, Cα, and C) of the 20 most likely conformations obtained with our sampling algorithm. For comparison, two conventional structure ensembles were calculated by repeatedly running a standard simulated annealing protocol (22) with fixed weight wdata. We used the program CNS (23) to calculate 200 conformers for two different restraining potentials; the panels show the 20 lowest energy conformations. (B) In the first calculation, distance bounds and a flat-bottom harmonic-wall potential with linear asymptotes (25) were used, resulting in an ensemble with a backbone heavy atom rmsd to the x-ray structure (10) of 3.07 ± 0.53 Å for all residues and 1.93 ± 0.34 Å for secondary structure elements. (C) In the second calculation, we used a harmonic restraining potential on the distances Embedded Image, leading to an overall rmsd of 2.98 ± 0.46 Å and 2.15 ± 0.41 Å for secondary structure elements.

A probabilistic structure ensemble is exclusively determined by the data and the working hypotheses that enter the analysis (which are in the presented example the ISPA, the log-normal error distribution, and our choice of the force field). Modifications will, of course, lead to changes in the structures. The atom positions, for example, are sensitive to the parameters and the functional form of the force field used in the conformational prior density. This also holds for conventional approaches, which are based on analogous assumptions. However, in addition, conventional methods require empirical rules to treat nuisance parameters, because they cannot be determined from the hybrid energy alone. Cross-validation (11, 12) and maximum likelihood methods (13), for example, have successfully been applied in NMR and crystallographic refinement to determine certain nuisance parameters such as the weight wdata. The ISD approach goes beyond these techniques. Once the working hypotheses are made, Eq. 2 provides definite rules to determine any nuisance parameter, including its uncertainty, directly from the data (Fig. 3). Therefore heuristics and other subjective elements are superfluous.

Fig. 3.

Estimation of nuisance parameters. Posterior histograms compiled from MCMC samples for the scaling factor γ in the ISPA and for the width σ of the log normal error distribution. (A) Posterior histogram p–⅙|D,I) for the inverse sixth power of γ. This factor corrects interproton distances to match the experimental distances best. (B) Posterior histogram p(σ|D,I) for the error σ. In conventional approaches, this analog to the weight (wdata ∝ σ–2) can only be estimated via cross-validation or must be set empirically.

Because conventional structure ensembles depend on user-specific parameter settings and on the minimization protocol, it is difficult if not impossible to assign statistically meaningful error bars to atomic coordinates. In contrast, stochastic samples drawn from the joint posterior density p(X, γ, σ|D,I) are statistically well defined and can directly be used to calculate estimates of mean values and standard deviations (14). As a special case, we can derive an average structure with atom-wise error bars and are thus able to define an objective figure of merit for NMR structures (Fig. 4).

Fig. 4.

Conformational uncertainty. MOLMOL “sausage” plot of the mean structure with atom-wise error bars indicated by the thickness of the sausage. The 20 most probable conformations (also shown in Fig. 2A) from the simulation of the joint posterior distribution p(X,γ,σ|D,I) were used to calculate the average structure and its precision. The local precision ranges from 0.6 Å for secondary structure elements to 4.6 Å for loop regions (bottom and right-hand side) and termini (top). The average precision is 1.07 Å. The average precision of the structure ensembles calculated with CNS is 4.93 Å for the flat-bottom harmonic-wall potential and 5.04 Å for the harmonic potential.

Bayesian and maximum likelihood approaches have already proven useful for data analysis and partial aspects of structure refinement in NMR spectroscopy and x-ray crystallography (15, 16, 13, 17). Our results suggest that structure determination can be solved entirely in a probabilistic framework.

It is straightforward to apply our approach to other NMR parameters. In case of three-bond scalar coupling constants, for example, an appropriate forward model is the Karplus curve (18) involving three coefficients that are treated as nuisance parameters. However, our method is not restricted to NMR data and can be applied to other structure determination problems. Besides theoretical coherence, a rigorous probabilistic approach has decisive practical advantages. It has no free parameter and is stable for many more than the two nuisance parameters used in the example (19). Hence, tedious and time-consuming searches for optimal values are no longer necessary. Once the forward model to describe the data has been chosen, probability calculus uniquely determines the posterior distribution for all unknowns. It is then only a computational issue to generate posterior samples. Further intervention is not required, and structure determination attains objectivity.

Supporting Online Material

Materials and Methods

Fig. S1

References and Notes

View Abstract

Navigate This Article