phenix.model_vs_data: a high-level tool for the calculation of crystallographic model and data statistics

Application of phenix.model_vs_data to the contents of the Protein Data Bank shows that the vast majority of deposited structures can be automatically analyzed to reproduce the reported quality statistics. However, the small fraction of structures that elude automated re-analysis highlight areas where new software developments can help retain valuable information for future analysis.


Introduction
A tool for quickly obtaining an overview of crystallographic model quality, diffraction data statistics and indicators of the fit of the model to the data is very helpful at all stages of structure solution and validation. Such a tool requires the application of multiple complex and diverse algorithms. For example, it must be capable of processing different representations of atomic displacement parameters including translation-libration-screw (TLS) information (Schomaker & Trueblood, 1968), analysis of both X-ray and neutron data and data collected from twinned crystals, as well as handling novel ligands or nonstandard residues, Protein Data Bank (PDB; Bernstein et al., 1977;Berman et al., 2000) files with multiple models or alternative conformations, and the many reflection data file formats currently in use. We have developed a new program, phenix.model_vs_data, which is a part of the PHENIX project (Adams et al., , 2010. This program automatically handles a large variety of inputs with minimal user intervention. The high degree of automation and ease of use make it possible to routinely run phenix.model_vs_data for quick but comprehensive evaluations with results presented in a concise form. We have tested phenix.model_vs_data extensively by automatically processing all PDB models (Joosten, Womack et al., 2009;Joosten, Salzemann et al., 2009) for which experimental data are available. Here we describe this new tool and illustrate its use. Running phenix.model_vs_data across the whole PDB database we observe that there are a number of entries for which the reported statistics are not reproduced; the reasons for this are discussed, highlighting the difficulties that can be encountered in reproducing statistical quality metrics.

phenix.model_vs_data input and output
phenix.model_vs_data reads a model file in PDB format (Bernstein et al., 1977;Berman et al., 2000) and a file with experimental, reduced reflection data. For example, phenix:model vs data model:pdb data:hkl Many commonly used reflection file formats are supported directly, such as MTZ (CCP4 suite; Collaborative Computational Project, Number 4, 1994), X-plor/CNS (Brü nger et al., 1998), SHELX (Sheldrick, 2008) and SCALEPACK (Otwinowski & Minor, 1997). If multiple reflection data sets are detected, the user is prompted to specify which data array to use. It is also possible to pass multiple reflection files, for example a file with experimental data and a separate file with free-R flags (Brü nger, 1992).
The phenix.model_vs_data output contains four main sections: (1) model validation statistics, (2) data statistics, (3) a fit of the model to the diffraction data and (4) additional information extracted from the PDB file header if available. The output is plain text (Fig. 1). The statistics can be inspected from the output to the screen, or from the Python script level by accessing the corresponding attributes of the returned phenix.model_vs_data object.
If requested, an electron (for X-ray data) or nuclear (for neutron data) density map can be created by specifying a map type. Supported are regular or maximum-likelihood weighted maps ( A map; Read, 1986;Urzhumtsev et al., 1996) such as 2mF obs -DF calc , 3F obs -2F calc , anomalous difference maps, average kick maps (Pražnikar et al., 2009) and the replacement of missing F obs with DF calc [for more details see Murshudov et al. (1997) and Adams et al. (2010), and references therein]. The output file is in MTZ format and contains Fourier map coefficients that can be readily displayed in the COOT program (Emsley & Cowtan, 2004).
Another option is the computation of map correlation coefficients. The two maps that are correlated are the 2mF obs -DF calc map and the F calc map. The latter is computed as the Fourier transform of only the F calc for which there are corresponding experimental observations available to account for the effects of finite resolution and possible incompleteness of the experimental data. Depending on the resolution of the input data, the correlation coefficients are shown per atom or per residue. Since the correlation alone is not always conclusive, density values of normalized ('sigma-scaled') 2mF obs -DF calc and mF obs -DF calc maps are shown along with each correlation coefficient (the maps are normalized using the standard deviation, as is common practice). This facilitates quick assessment of local model-to-density fits characterized by regions with a poor map correlation and low 2mF obs -DF calc density values or high absolute densities in the mF obs -DF calc map.
2.2. phenix.model_vs_data algorithms phenix.model_vs_data makes extensive use of the CCTBX library . For example, input PDB files are processed with the comprehensive PDB library implemented in the CCTBX. The Monomer Library ) is used to obtain geometry restraints (bond, angle, dihedral, chirality, planarity and nonbonded restraints). If an input model contains residues not defined in the Monomer Library, for example a novel ligand or nonstandard residue, phenix.ready_set (N. W. Moriarty, unpublished), which uses eLBOW (Moriarty et al., 2009) internally, is used to automatically generate suitable restraints. The second part of the model-quality section contains summary statistics similar to those generated by the MolProbity web site (Davis et al., 2007;Chen et al., 2010), by using the tools integrated into PHENIX. phenix.ramalyze is used to compute the number of Ramachandran outliers, as well as favored and allowed residues (Lovell et al., 2003), and phenix.cbetadev is used to compute the number of residues with >0.25 Å deviation from ideal C positions (Lovell et al., 2003). phenix.rotalyze calculates the percent sidechain rotamer outliers (Lovell et al., 2000). phenix.reduce and phenix.probe are used to add H atoms and calculate the all-atom clashscore (Word et al., 1999).
phenix.xtriage (Zwart et al., 2005) is used to detect possible twinning (see, for example, Parsons, 2003;Helliwell, 2008). In the presence of possible twin laws, the R factors are computed without any twin law and then by taking each twin law into account. The  Example phenix.model_vs_data output (for PDB entry 3dcv). Model information includes composition and geometry statistics. Data information includes completeness in resolution shells. Model-to-data fit information includes R factors calculated for the whole set of structure factors using an optimized bulk-solvent model, anisotropic scaling, and TLS and twinning if applicable. R factors are also recalculated after applying the resolution limits and cutoffs reported in the PDB header.
twin-related calculations can be relatively time consuming, but provide a more robust basis for deciding if twinning needs to be included.
If a model was previously refined using TLS parameters, the ATOM and ANISOU records in the coordinate section of the PDB file may contain either total or residual atomic displacement parameters, depending on the refinement program used. The nature of the atomic displacement parameters is often not clear from the TLS information stored as REMARK records in the PDB file header. Therefore two alternatives are tested: R factors are computed assuming (i) total atomic displacement parameter values and (ii) residual atomic displacement parameter values in the coordinate section of the PDB file. The outcome with the lowest R factor is taken to be correct. Typical R-factor differences are 2-10%. The phenix.tls (P. V. Afonine, unpublished) module in the CCTBX is used to extract the TLS information (selections, origins, matrices) from the PDB file header. Two commonly used formats are automatically distinguished: phenix.refine (Afonine et al., 2005a) and REFMAC (Murshudov et al., 1997).
R factors are computed after performing bulk-solvent correction and anisotropic scaling as described by Afonine et al. (2005b). The Wilson B factor shown in the output is computed using a likelihood procedure (Zwart et al., 2005). Reflection data outliers are automatically detected (Read, 1999) and removed from subsequent calculations. The number of outliers is reported in the output.
phenix.model_vs_data also supports PDB files with multiple models [see, for example, Burling & Brü nger (1994), Levin et al. (2007), Terwilliger et al. (2007), and references therein]. In addition a list of PDB files can be given as input, facilitating the computation of statistics for very large structures that are currently typically split across multiple files in the PDB.

Running phenix.model_vs_data for entries in the PDB archive
The phenix.model_vs_data program has been thoroughly tested by analyzing all PDB entries for which experimental structure factors are available. This was performed in two steps: first the phenix.cif_as_mtz tool (P. V. Afonine, unpublished) was used to extract and convert all mmCIF structure factor data files into MTZ format (structure factors, values and free-R flags). Then phenix.model_vs_data was run using the generated MTZ files with the associated coordinate files. The conversion of CIF format reflection data automatically distinguishes between structure factor intensities or amplitudes, as well as X-ray or neutron data. If possible, the algorithm automatically extracts the free-R flags.
The result of analyzing the whole PDB yielded a wealth of useful information currently not always present in PDB depositions: twinning diagnostics, bulk-solvent and scale parameters (Afonine et al., 2005b), number of reflection outliers, MolProbity statistics, and Wilson B factors. For a number of structures we observed significant discrepancies between the archived metrics (e.g. R factors) and their recomputed values. Fig. 2 shows a histogram of the differences between reported R work (as found in the PDB file header) and the recomputed value. In the following section we discuss the factors that can lead to differences in the R factors. A somewhat similar discussion is presented by Kleywegt et al. (2004). We note that numerical considerations, such as the method used to calculate structure factors (i.e. direct versus fast Fourier transformation) have little impact on the results and the difference between R factors computed using the different methods is typically less than 0.01%.
3.1. Reasons for R-factor discrepancies 3.1.1. Missed twinning. Our analysis of the PDB indicates that approximately 3% of all crystal structures are affected by twinning [see Lebedev et al. (2006) for the results of a similar survey of the PDB]. In at least 120 cases, taking twinning into account reduced the R factors by 5-20% points.
3.1.2. Variations in bulk-solvent and anisotropic scaling model and related parameters. There are two bulk-solvent models generally used in crystallographic software. One is based on the Babinet principle and is used in the SHELXL (Sheldrick, 2008) and TNT (Tronrud, 1987) programs. The second is a mask-based method based on the flat bulk-solvent model (see Jiang & Brü nger, 1994, and references therein) and is used in programs such as CNS, REFMAC and phenix.refine. In addition, this correction is typically convoluted with overall anisotropic scaling of the diffraction data. There are two different approaches used to perform this anisotropic scaling: using an exponential function (Sheriff & Hendrickson, 1987;Murshudov et al., 1998;used in CNS, REFMAC and phenix.refine) or using a polynomial (Parkin et al., 1995;Usó n et al., 1999;used in SHELXL). The mask-based bulk-solvent model has been shown to be superior (Jiang & Brü nger, 1994) and recent methods have been developed to increase the stability of its calculation in combination with anisotropic scaling (Fokine & Urzhumtsev, 2002;Afonine et al., 2005b;Brü nger, 2007). Clearly, recalculation of R factors using different bulksolvent and anisotropic scaling algorithms from those originally used will most likely result in differences. Table 1  Histogram of differences between R work reported in the PDB file header and the value calculated with phenix.model_vs_data. Resolution and cutoffs were applied in the calculation if available.
trates, for a few selected structures taken at different resolutions, how large the deviations can be.
3.1.3. Missing anisotropic atomic displacement parameters. We observed 14 structures at a resolution higher than 1.0 Å that had all isotropic atomic displacement parameters. For these structures the recomputed R factors are several percentage points higher than those reported (see Table 2 for an example). A review of the literature indicated that at least five of these structures were refined using anisotropic atomic displacement parameters.
3.1.4. Nonphysical anisotropic atomic displacement parameters. To make physical sense, a symmetric matrix representing anisotropic atomic displacement parameters has to be positive definite. We observed several hundred entries with negative-definite anisotropic displacement parameters. The impact on R factors depends on the percentage of such atoms in a structure. Considering all cases we observe an average R-factor increase of $2.5% points, and in the worst case changes of 10% and more. Zero atomic displacement parameter values for H atoms (see x3.1.5) also fall into this category.
3.1.5. Missing H atoms. Analysis of deposited structures indicates that even if H atoms were used in refinement (e.g. using a riding model) they are often removed prior to structure deposition. To assess the impact of removing H atoms we selected 275 deposited structures that contain H atoms. Fig. 3 shows the difference between R work factors computed using the original structures and those with all H atoms removed. The contribution from the H atoms is significant, ranging from approximately 0.5 to 2.0 points in R work , and is essentially independent of resolution. Those structures where removal of the H atoms leads to a decrease in R work (i.e. negative differences) typically have nonphysical parameters (e.g. atomic displacement parameter values of zero for all H atoms). We then assessed the impact of adding H atoms back to those 275 structures. We restored the H atoms using ideal parameters and recomputed the R factors. Our observation is that the recomputed R factors do not match the original ones, as shown in Fig. 4 Table 1 Comparison of published (column 3) R factors and solvent parameters with those recomputed using default parameters (column 4), recomputed using published values of k sol and B sol (column 5), and recomputed using slightly different values of r shrink and r solv (those used in REFMAC; last column).

Figure 3
Differences between R work computed for the original structures with H atoms and the same structures after removal of the H atoms, shown as function of resolution. See x3.1.5 for details.

Figure 4
Differences between R work values (shown as function of resolution) computed for structures without H atoms and the same structures with restored H atoms based on ideal geometry. The atomic displacement parameter and occupancy of each restored H atom was set to be identical to those of the bonded atom. See x3.1.5 for details.
atoms at a nuclear position derived from neutron scattering experiments (Allen, 1986) or placing them at a shorter distance where the electron density peak is truly observed (as it is implemented in SHELXL). The assignment of the H-atom displacement parameters further complicates the calculation. H atoms can inherit the exact atomic displacement parameters of the atoms to which they are bound, or they can take this value multiplied by a factor between 1.0 and 1.5 (see the SHELXL manual, for example). At subatomic resolution the H-atom atomic displacement parameters may have been refined to unique values for each atom. 3.1.6. Missing water molecules. We observed a number of structures refined at resolutions better than 2 Å that do not possess any solvent atoms and for which the recalculated R factors are different from those originally reported. We selected a few such structures and automatically processed them with phenix.refine in order to add water atoms and then recompared R factors. Table 3 summarizes the results. The table suggests that the difference between published and recomputed R factors is due to missing solvent atoms. In many cases the differences in solvent structure are small (a few missing water molecules), while in other cases the absence of water molecules results in a very large discrepancy (e.g. structure 1ejg).
3.1.7. The use of very high resolution refinement methods: multipolar refinement and interatomic scatterers. At subatomic resolution (better than $1 Å ) a multipolar (Hansen & Coppens, 1978) or an interatomic scattering model (Afonine, 2004 can be used to model residual bonding density that is typically visible at such resolutions. Currently, there is no mechanism in the PDB file format to preserve this information, and therefore the R-factor statistics obtained in such a refinement cannot be reproduced from the deposited structure. An example is 1ejg, a structure refined at 0.54 Å resolution using multipolar methods.
3.1.8. Structures refined using the TLS model. When TLS refinement is used, the total atomic displacement parameter is typically approximated by the sum of three contributions: the residual atomic displacement parameter representing local atomic vibrations, the component representing the rigid-body displacements modeled through TLS, and the component representing lattice vibrations, which is usually modeled as part of the overall anisotropic scaling.
There are at least two types of PDB files where the TLS information is represented differently: entries where each atom participating in a TLS group has its total atomic displacement parameter reported (for example, structures refined with phenix.refine) and entries where only residual atomic displacement parameters are reported for each atom and the TLS component is stored as TLS matrices in the file header (typically, structures refined with REFMAC). To recompute the R factors, it is essential that the displacement information for each atom be correctly retrieved from the PDB file and the total atomic displacement parameter for each used. This in turn makes it vital for the structures where residual atomic displacement parameters are reported that the TLS information, namely TLS origin, values of the TLS matrices and the TLS group definition, can be correctly extracted from the PDB file header.
As of December 2009, there are 8278 structures (out of a total of 62 305) that contain TLS information. For 730 of these entries the TLS information cannot be correctly extracted. The typical problems in TLS records can be classified into three categories: (a) missing, empty, duplicate, ambiguous or syntactically incorrect TLS group selections; (b) missing or incorrectly defined TLS group origins; (c) problems with the TLS matrices (for example, incorrect formatting).
3.1.9. Other factors. Other possible reasons for discrepancies between reported and recalculated R factors are as follows: (a) Absence of test set (cross-validation) flags, so phenix.-model_vs_data uses all (work and test) reflections to compute the R factor.
(b) Some programs allow refinement of f 0 and f 00 for anomalous scatterers. However, the refined f 0 and f 00 values are typically not preserved in the PDB file header, and therefore they are not used in structure factor calculations.
(c) Various manipulations on F obs , such as removing outliers and applying anisotropic corrections.
(d) Incomplete, missing or incorrect information in the file header about data cutoffs used in statistics calculation (by resolution, ).
(e) Running a final refinement against all data (instead of excluding the R-free set) before deposition.

Special cases
Most crystallographic entries in the PDB are derived from X-ray diffraction data, and are represented as a single atomic model. However, there are special cases, which constitute only a very small fraction of all the entries: structures determined using neutron data, multiple model entries or extremely large  Table 3 Example of PDB entries with missing water molecules. structures. While most crystallographic software can seamlessly handle single-model X-ray structures (given that the appropriate libraries for nonstandard items, such as ligands, are provided), handling these special cases can be a challenging problem. The phenix.model_vs_data program was developed to handle such special cases with the results described below.
3.2.1. Multiple model entries. There are 125 crystal structures in the PDB that are represented by multiple models; 114 of them have experimental data available. Among those 114 data sets, nine contain Miller indices that are not unique under the symmetry with several hundreds of redundant reflections, and 49 files contain multiple data sets, making automated interpretation uncertain. Table 4 shows the summary of running phenix.model_vs_data for the remaining 56 entries. For 28 of these the recalculated R factor was within 5% of the reported value. Seven entries (2g0v, 2g0x, 2g0z, 2g10, 2g11, 2g12, 2g14) report R values obtained after difference refinement (Terwilliger & Berendzen, 1995) that reflect the agreement between model differences and data differences. Therefore it is not possible to reproduce these deposited R values; however, the computed R factors are all within the 17.3-18.5% range. Ten structures (1yrq, 1zev, 2ce2, 2cl6, 2cl7, 2clc, 2cld, 2evw, 2gn0, 3cmy) have recalculated R factors about five percentage points higher than the reported values. This is because these structures were subject to TLS refinement, but the TLS selections in the PDB file headers do not unam-biguously define the TLS groups, making it impossible to reproduce the total atomic displacement parameters of the affected atoms (for these structures only residual atomic displacement parameters are present in ATOM records). The 2ull entry has high recomputed R factors. The corresponding PDB file contains 16 models, and each protein atom within each model has the occupancy of 0.06, making the total occupancy $0.96 (16 Â 0.06). This should not pose problems if the overall occupancies are identical for each model and the number of models is less than 100: the overall scale factor will account for this numerical rounding. However, for PDB entry 2ull the solvent structure is identical for each of 16 models, but unlike the occupancies of the protein atoms those of the solvent atoms are not scaled to sum to one. We consider this the main reason for the R-factor mismatch.
3.2.2. Large structures spread across multiple files. There are 52 structures in PDB that are split across multiple files; 45 of them are crystal structures. Of these, 40 crystallographic structures have the experimental data deposited. Three of the 40 entries were excluded from tests because we could not extract the data (2zuo, 2zv4, 2zv5), or the data files are not unique under symmetry (1jyy, 1jyz, 1jz0, 1jz1). phenix.-model_vs_data could reproduce the R factors for the remaining 37 (results not shown).
3.2.3. Structures determined using neutron data. Currently 32 structures in the PDB were determined using neutron diffraction data, 26 of which have experimental structure factor data available.  Table 5 Crystal structures solved using neutron data.
R work and R free as extracted from PDB file header (second column), and as recalculated using phenix.model_vs_data (third column).  recomputed using phenix.model_vs_data. In only six cases out of the total of 26 did the recomputed R work not match the published values. In four of these cases this is because R work was not available in PDB file header. However, we still observed situations that make it challenging to recompute the R values: (a) The sum of occupancies for exchangeable H/D sites (see, for example, Niimura et al., 2006) is smaller than 1.
(b) Incorrect or missing information in the PDB file header, such as missing R factors or cutoff values.
(c) H/D exchange is not modeled or is incompletely modeled. For example, the molecule is fully deuterated but the corresponding PDB file contains all H atoms instead of D atoms. In a number of cases only a small fraction of the potentially exchangeable sites are modeled.
(d) In some cases the reflection data intensities are mislabeled as amplitudes or vice versa. We note that this problem is not limited to neutron diffraction data.
( f) Atoms with an undefined scattering type, e.g. labeled as X.

Conclusion
The output of the phenix.model_vs_data program is designed to enable easy validation of model and data files, and of commonly reported model/data statistics, in particular as found in PDB file headers. To assure a high degree of automation and robustness, the phenix.model_vs_data program is routinely tested by processing all PDB entries for which experimental data are available. The statistics generated are actively used in the development of the PHENIX system. An example of an application of this database is the POLYGON program (Urzhumtseva et al., 2009), which provides a concise graphical comparison of model quality measures with similar entries found in the PDB.
Application of phenix.model_vs_data to the contents of the PDB shows that the vast majority of deposited structures can be automatically analyzed to reproduce the reported quality statistics. However, there remain a small fraction of structures that elude automated re-analysis. These highlight areas where new developments in structure deposition tools and refinement software can help retain valuable information for future analysis.
phenix.model_vs_data is available as part of the PHENIX package, which can be obtained from http://www.phenixonline.org.