Linking Crystallographic Model and Data Quality

See allHide authors and affiliations

Science  25 May 2012:
Vol. 336, Issue 6084, pp. 1030-1033
DOI: 10.1126/science.1218231


In macromolecular x-ray crystallography, refinement R values measure the agreement between observed and calculated data. Analogously, Rmerge values reporting on the agreement between multiple measurements of a given reflection are used to assess data quality. Here, we show that despite their widespread use, Rmerge values are poorly suited for determining the high-resolution limit and that current standard protocols discard much useful data. We introduce a statistic that estimates the correlation of an observed data set with the underlying (not measurable) true signal; this quantity, CC*, provides a single statistically valid guide for deciding which data are useful. CC* also can be used to assess model and data quality on the same scale, and this reveals when data quality is limiting model improvement.

Accurately determined protein structures provide insight into how biology functions at the molecular level and also guide the development of new drugs and protein-based nanomachines and technologies. The large majority of protein structures are determined by x-ray crystallography, where measured diffraction data are used to derive a molecular model. It is surprising that, despite decades of methodology development, the question of how to select the resolution cutoff of a crystallographic data set is still controversial, and the link between the quality of the data and the quality of the derived molecular model is poorly understood. Here, we describe a statistical quantity that addresses both of these issues and will lead to improved molecular models.

The measured data in x-ray crystallography are the intensities of reflections, and these yield structure factor amplitudes each with unique h, k, and l indices that define the lattice planes. The standard indicator for assessing the agreement of a refined model with the data is the crystallographic R value, defined as


where Fobs(hkl) and Fcalc(hkl) are the observed and calculated structure factor amplitudes, respectively. R is 0.0 for perfect agreement with the data, and R is near 0.59 for a random model (1). Because R can be made arbitrarily low for models having sufficient parameters to overfit the data, Brünger (2) introduced Rfree as a cross-validated R on the basis of a small subset of reflections not used during refinement. The R for the larger “working” set of reflections is then referred to as Rwork.

Crystallographic data quality is commonly assessed by an analogous indicator Rmerge [originally (3) Rsym], which measures the spread of n independent measurements of the intensity of a reflection, Ii(hkl), around their average, I¯(hkl):Rmerge=hkli=1n|Ii(hkl)I¯(hkl)|hkli=1nIi(hkl)(2)In 1997, it was discovered that because Ii(hkl) values influence I¯(hkl), the Rmerge definition must be adjusted by a factor of n/(n – 1) to give values that are independent of the multiplicity (4). The multiplicity-corrected version, called Rmeas, reliably reports on the consistency of the individual measurements. A further variant, Rpim (5), reports on the expected precision of I¯(hkl) and is lower by a factor of 1/n factor compared with Rmeas. Because the strength of diffraction decreases with resolution, a high-resolution cutoff is applied to discard data considered so noisy that their inclusion might degrade the quality of the resulting model. Data are typically truncated at a resolution before the Rmerge (or Rmeas) value exceeds ~0.6 to 0.8 and before the empirical signal-to-noise ratio, I¯/σ(I¯), drops below ~2.0 (6) (fig. S1). The uncertainty associated with these criteria is illustrated by a recent review that concluded “an appropriate choice of resolution cutoff is difficult and sometimes seems to be performed mainly to satisfy referees” (6).

That these criteria result in high-resolution cutoffs that are too conservative is illustrated here using an example data set (EXP) collected for a cysteine-bound complex of cysteine dioxygenase (CDO); the EXP data have an average intensity about 7% as strong as the data originally used to determine the structure at 1.42 Å resolution (PDB 3ELN; Rwork/Rfree = 0.135/0.177) (7, 8). Standardized model refinements starting with a 1.5 Å resolution unliganded CDO structure (PDB code 2B5H) (9) were carried out against the EXP data for a series of high-resolution cutoffs between 2.0 and 1.42 Å resolution (table S1). As R value comparisons are only meaningful if calculated at the same resolution, we evaluated paired refinements made with adjacent resolution limits using Rwork and Rfree values calculated at the poorer resolution limit. Improvement is indicated by drops in Rfree or increases in Rwork at the same Rfree (meaning the model is less overfit). This analysis revealed that every step of added data improved the resulting model (Fig. 1). Consistent with this, difference Fourier maps show a similar trend in signal versus resolution (fig. S2), and geometric parameters of the resulting models improve with resolution (table S2).

Fig. 1

Higher-resolution data, even if weak, improves refinement behavior. For each incremental step of resolution from XY (top legend), the pair of bars gives the changes in overall Rwork (blue) and Rfree (red) for the model refined at resolution Y with respect to those for the model refined at resolution X, with both R values calculated at resolution X. The first pair of bars shows that Rwork and Rfree dropped 0.38% and 0.34%, respectively, upon isotropic refinement when the refinement resolution limit was extended from 2.0 to 1.9 Å; the other pairs of bars show the improvement upon anisotropic refinement.

The proven value of the data out to 1.42 Å resolution contrasts strongly with the Rmeas and I¯/σ(I¯) values at that resolution (>4.0 and ~0.3, respectively) (Fig. 2), which are far beyond the limits currently associated with useful data. Applying the typical standards described above, this data set would have been truncated at ~1.8 Å resolution, which would halve the number of unique reflections in the data set (table S1) and would yield a worse model.

Fig. 2

Data quality R values behave differently than those from crystallographic refinement, and useful data extend well beyond what standard cutoff criteria would suggest. Rmeas (squares) and Rpim (circles) are compared with Rwork (blue) and Rfree (red) from 1.42 Å resolution refinements against the EXP data set. Embedded Image (gray) is also plotted. (Inset) A close-up of the plot beyond 2 Å resolution.

It is striking to observe the different behavior at high resolution of the crystallographic versus the data-quality R values, with the one remaining below 0.40 and the other diverging toward infinity (Fig. 2). Consideration of the Rmerge formula rationalizes this divergence, because the denominator (the average net intensity) approaches zero at high resolution, but the numerator becomes dominated by background noise and is essentially constant. Thus, despite their similar names and mathematical definitions, data-quality R values are not comparable to R values from model refinement, and there is no valid basis for the commonly applied criterion that data are not useful beyond a resolution where Rmeas (or Rmerge or Rpim) rises above ~0.6. As suggested by Wang (10), I¯/σ(I¯) at a much lower level than generally recommended could be used to define the cutoff, but this has the problem that σ(I¯) values can be misestimated (6, 11).

With current standards not serving as reliable guides for selecting a high-resolution cutoff, we investigated the use of the Pearson correlation coefficient (CC) (12) as a parameter that could potentially assess both data accuracy and the agreement of model and data on a common scale. Pearson’s CC is already used in crystallography, in that a CC value of 0.3 between independent measurements of anomalous signals has become the recommended criterion for selecting the high-resolution cutoff of the data to be used for defining the locations of the anomalous scatterers (13). Following a procedure suggested earlier (4), we divided the unmerged EXP data into two parts, each containing a random half of the measurements of each unique reflection. Then, the CC was calculated between the average intensities of each subset. This quantity, denoted CC1/2, is near 1.0 at low resolution and drops to near 0.1 at high resolution (Fig. 3). According to Student’s t test (12), the CC1/2 of 0.09 for the ~2100 reflection pairs in the highest resolution bin is significantly different from zero (P = 2 × 10−5).

Fig. 3

Signal as a function of resolution as measured by correlation coefficients. Plotted as a function of resolution for the EXP data are CC1/2 (diamonds) and the CC for a comparison with the 3ELN reference data set (triangles). Embedded Image (gray) is also shown. All determined CC1/2 values shown have expected standard errors of <0.025 (21, 22).

This high significance occurs even though CC1/2 should be expected to underestimate the information content of the data. This is because for weak data, CC1/2 measures the correlation of one noisy data set (the first half-data set) with another noisy data set (the other half-data set), whereas the true level of signal would be measured by what could be called CCtrue, the correlation of the averaged data set (less noisy because of the extra averaging) with the noise-free true signal. Although the true signal would normally not be known, for the EXP test case, the 3ELN data provide a reference that has much lower noise and should be much closer to the underlying true data. The CC calculated between the EXP and 3ELN data sets is indeed uniformly higher than CC1/2 (Fig. 3), dropping only to 0.31 in the highest resolution bin (Student’s t test P = 10−64).

We next sought an analytical relation between CC1/2 and CCtrue. Using only the assumption that errors in the two half-data sets are random and, on average, of similar size (see supplementary text), we derived the relationCC*=2CC1/21+CC1/2(3)where CC* estimates the value of CCtrue, based on a finite-size sample. Equation 3 has been used in electron microscopy studies for a similar purpose (14) and is also related to the Spearman-Brown prophecy formula used in psychometrics to predict what test length is required to achieve a certain level of reliability (15). CC*, when computed with Eq. 3, agrees reasonably well with the CC for the EXP data compared with the 3ELN reference data, which shows that systematic factors influencing a real data set are not large enough to greatly perturb this relation (Fig. 4A). CC* provides a statistic that not only assesses data quality but also allows direct comparison of crystallographic model quality and data quality on the same scale. In particular, CCwork and CCfree—the standard and cross-validated correlations of the experimental intensities, with the intensities calculated from the refined molecular model—can be directly compared with CC* (Fig. 4B). A CCwork larger than CC* implies overfitting, because, in that case, the model agrees better with the experimental data than the true signal does. A CCfree smaller than CC* (such as is seen at low resolution) indicates that the model does not account for all of the signal in the data. A CCfree closely matching CC*, such as at high resolution in Fig. 4B, implies that data quality is limiting model improvement. In this high-resolution region, the model, which was refined against EXP, correlates much better with the more accurate 3ELN than with the EXP data (Fig. 4B). This shows that, as is common for parsimonious models (16), the constructed molecular model is a better predictor of the true signal than are the experimental data from which it was derived. On a related point, because current estimates of a model’s coordinate error do not take the data errors into account (1719), the model accuracy is actually better than these methods indicate.

Fig. 4

The CC1/2/CC* relation and the utility of comparing CC* with CCwork and CCfree from a refined model. (A) Plotted is the analytical relation (Eq. 3) between CC1/2 and CC* (black curve). Also roughly following the CC* curve are the CC values for the EXP data compared with 3ELN (triangles) as a function of CC1/2. (B) Plotted as a function of resolution are CC* (black solid) for the EXP data set, as well as CCwork (blue dashed) and CCfree (red dashed) calculated on intensities from the 1.42 Å refined model. Also shown are values for CCwork (blue dotted) and CCfree (red dotted) between the 1.42 Å refined model and the 3ELN data set.

We verified, using a simulated data set (20) and two further test cases, that these findings are not specific to the EXP data (tables S3, S4, and S5, and fig. S3). Thus, CC* (or CC1/2) is a robust, statistically informative quantity useful for defining the high-resolution cutoff in crystallography. These examples show that with current data reduction and refinement protocols, it is justified to include data out to well beyond currently employed cutoff criteria (fig. S4), because the data at these lower signal levels do not degrade the model, but actually improve it. Advances in data-processing and refinement procedures, which until now have not been optimized for handling such weak data, may lead to further improvements in model accuracy. Finally, we emphasize that the analytical relation (Eq. 3) between CC1/2 and CC* is general, and thus, CC* may have similar applications for data- and model-quality assessment in other fields of science involving multiply measured data.

Supplementary Materials

Materials and Methods

Supplementary Text

Figs. S1 to S4

Tables S1 to S5

References (2329)

References and Notes

  1. Materials and methods are available as supplementary materials on Science Online.
  2. Ten independent random partitionings of the data into the two subsets for calculating CC1/2 yielded standard deviations of <0.02 in all resolution ranges, and agreed reasonably with the expected standard error as calculated by σ(CC) = (1 – CC2)/n1 where n is the number of observations contributing to the CC calculation (21).
  3. Acknowledgments: This work was supported in part by the Alexander von Humboldt Foundation, the Konstanz Research School Chemical Biology, and NIH grants GM083136 and DK056649. We thank R. Cooley for providing the EXP data images, V. Lunin for help with deriving Eq. 3, and A. Gittleman for help with mathematical notation. We also thank M. Junk, W. Kabsch, K. Schäfer, D. Tronrud, M. Wells and W. Welte for critically reading the manuscript. The program HIRESCUT is available upon request. P.A.K. and K.D. designed and performed the research and wrote the paper. The authors declare no competing financial interests.
View Abstract

Navigate This Article